Big data has been getting a lot of hype as a revolution in database management. Actually, when you consider what goes into big data, you will realize that big data is the next logical step in database management. For decades companies have been gathering information in structured database formats for use in business intelligence. Database management has been responsible for archiving and accessing that data to get a historical picture of sales, profits, inventory, and other information structured using SQL or some similar database format. Big data now combines that structured data with unstructured data to expand the data set resulting in better insights.
The value of big data is its ability to integrate structured and unstructured data in the same analysis. Along with historical data and valuable structured information that is part of conventional database management, big data brings new unstructured information from social media conversations, email, documents, even graphics and video into the mix.
Adding Unstructured Data
According to Wikibon, unstructured data is growing faster than structured data. Big data analytics have proven that structured data has value. The challenge now is to take what we have learned from database management and incorporate unstructured data into the mix for analysis.
Unstructured data is defined as digital information that has no metadata and no classification data, i.e., no descriptors that make it suitable for conventional database management. Unstructured data can include video footage, images, PowerPoint files, file shares web data, log files, retail customer data, and a host of other types of digital information.
To bring unstructured data into big data requires:
bringing structure to the unstructured data with classification and metadata;
deduplicating the file system to avoid replication and ensure that data can be analyzed either in place or using the metadata; and
integrating the data into big data analytics.
In essence, the unstructured data is normalized and given structure so it can be included for big data analysis.
SQL, NoSQL, and Hadoop
When considering the big data software tools available such as NoSQL (not only SQL), its purpose is to normalize data in a way it can be used for analysis. NoSQL, for example, doesn’t exclude SQL databases but includes other types of data, making NoSQL scalable as well as non-relational. Hadoop is responsible for manageable the big data pieces by running applications on systems with thousands of nodes using a distributed file system.
When you look at how relational database management (RDBMS), NoSQL, and Hadoop function it becomes clear how they work together.
RDBMS has been around for some time using SQL to handle simple transactions. It has worked for years because it adheres to ACID properties - atomicity, consistency, isolation, and durability – the elements that ensure that database transactions are handled reliably. With the coming of the web, we now have more data and more transactions than SQL can handle. RDBMS wasn’t designed to manage the volume of Amazon or eBay transactions, so now we are in big data territory.
NoSQL addresses the problem of scale by relaxing the ACID principles. NoSQL applies the concept of eventual consistency. Since NoSQL is highly distributed, the consistency of the data will vary depending on the source of the data. Resolving consistency is pushed up to the application; if there isn’t newer information delivered to storage, the current information is considered correct. Eventually, all the records reflect the same data and the data is in sync and correct. The advantage to this approach is distributed scalability.
Hadoop is not a database system but a file system. Hadoop is an open source, scalable, and distributed file system that offers fault tolerance. Hadoop and MapReduce, which is the heart of Hadoop, provide a software framework that lets you write programs that access unstructured data across thousands of clustered network nodes. The scalability of the file system is what enables the “big” in big data.
Working together RDBMS, NoSQL, and Hadoop consolidate all the data required for analytics. RDBMS database management works on the data warehouse, extracting SQL data from the legacy database. NoSQL brings in additional unstructured and structured data that is delivered using a Hadoop-enabled file system. Once the data sources come together, the data is analyzed for specific patterns or trends that deliver new insights about market trends, customer preferences, business operations, and other information useful to business. The results are typically digested and presented in a graphic format programmed in R to make it easy to understand.
When you consider the end-to-end process, big data actually starts with database management and extends beyond the data warehouse, bringing in external sources and resources, many of which reside outside the firewall. Expertise in RDBMS is a solid foundation for building a robust big data business.