Professionals familiar with conventional database management platforms may be intimidated by big data, at least at first. Big data does require a new database management mindset. However, applying the principles of database management to big data is not difficult once you understand how big data is structured.
Big data was developed to address the problem of the data explosion. Consider that Google processed more than 2 million search queries per minute in 2012 and today they receive more than 4 million per minute. The global Internet population is at least 2.4 billion strong and growing, which means the amount of social data is growing at an incredible rate. There are 2.078 billion active social media accounts, including 1.68 billion active mobile social media accounts. Mobile users are driving Internet traffic even faster. Gartner reports that mobile data traffic is expected to grow by 59 percent this year to reach 51.8 million terabytes, growing to 79.5 million terabytes by 2016. Then there is the Internet of Things (IoT). The number of devices connected to the Internet will grow to 26 billion units in five years, a 30-fold increase from 2009 according to Gartner.
All that data can be analyzed to answer business-critical questions. Big data evolved because conventional database management can’t handle that much data, nor can it effectively manage unstructured data such as social media conversations. However, the roots of big data are still firmly planted in well-defined SQL and established database management strategies.
Beyond the Data Warehouse
Big data is defined as data that exceeds the capacity of conventional database systems because the data is too big, too fast, and doesn’t fit the structure of most database architectures. More importantly, big data offers a means to distill vast amounts of information into insights that promote better business decisions relating to product development, market development, operational efficiency, customer demand, and market predictions.
What big data does is address three database challenges: storage, processing, and management – those elements that conventional database management can’t address at scale.
Data storage is the first problem. Some companies think they can throw more enterprise storage at big data. Data storage is cheap, but RAID arrays can’t keep pace with data growth. Buying more storage capacity every six months is not only expensive but it puts stress on the IT team, and it’s still insufficient for most big data projects. That’s why most big data projects are leveraging the elasticity of cloud data storage.
Part of the problem with any kind of big data storage is that most of the data is duplicated or synthesized. Different departments and groups duplicate records for their own use and what starts as 100 terabytes can become a petabyte of data, all of which is stored and backed up without deduplication.
The solution to the duplication problem is to create a distributed data repository where you access one copy of the actual data without creating network overhead. Virtualization is a great tool to meet this need. Virtualization reduces the data footprint and centralizes management of the data. Using a virtual data model you can process data more efficiently, manage access and security from a central location even though the data is distributed, and get analysis that is fast and more accurate since there is less duplicate data.
That’s the enterprise data model. Now consider the data management model.
Big Data and SQL, NoSQL, and Hadoop
Database management is based on SQL, which provides a common structure for data stored in the data warehouse. Big data uses tools such as NoSQL (not only SQL) that allow you to bring in other data sources, such as unstructured data, for analytics. NoSQL does not exclude SQL; it merely encompasses additional data beyond the SQL data in the data warehouse.
This is where your distributed database infrastructure comes into play. SQL manages archived data. NoSQL relaxes the conventions of SQL structure to support different forms of highly distributed data, providing a means to organize all data for analysis. Hadoop is the file system that manages the scalability and distributed data storage. Hadoop is fault tolerant, providing an open source platform to access structured and unstructured data stored in thousands of network clusters both in the enterprise and in the cloud.
Working together, virtualization, SQL, NoSQL, Hadoop, and related tools such as MapReduce, Pig, and Hive provide the means to assimilate large data sets and organize them for analysis. In a nutshell, big data takes the data warehouse and makes it infinitely scalable, using tools like NoSQL to accommodate any type of unstructured data set and using Hadoop to access data stored anywhere. Strategies like virtualization cut down on network overhead for data calls so analytics are faster; fast enough to handle real-time responses.