As more organizations adopt big data to gain better insight about their products, customers, operations, and potential markets, IT managers consider the impact big data has on their database management strategies. Do they have to mothball their data warehouse? Does their legacy data have any bearing on new big data projects? How do they include big data as part of their overall database management strategy?
Database management systems (DBMS) let you store and retrieve structured information from a data warehouse. Usually that data is accessed using structured query language (SQL), so the format of the data is structured to accommodate SQL queries. With big data, you now have to accommodate the three Vs – volume (more data), variety (data in different formats), and velocity (rapid, real-time results). These three Vs are what make big data unmanageable using conventional database management tools.
The volume and variety of data is growing exponentially. IDC says that the amount of digital data will grow to 35 trillion gigabytes by 2020. In 2011, the amount of digital information created exceeded 1.8 trillion gigabytes, which was a nine-fold increase in five years. All that data can be harnessed for analysis if you can determine how to capture it, store it, manage it, and analyze it.
Start with the Data Warehouse
When selling big data solutions, start with the data warehouse as your foundation and then expand the database management strategy. Business intelligence is derived from data controlled within the enterprise, e.g., using historical sales data to project future sales. Big data starts with the data in the warehouse and then brings in additional outside data sources. For example, using historical sales data in the data warehouse combined with social media commentary and information about market conditions, you can get a more accurate snapshot of product demand for the coming year.
If you start with the data warehouse and build outward, you find that the pieces missing to support big data are volume, variety, and velocity – the ability to process more information faster.
Let’s start with volume. You need more data storage for big data. The challenge is how much data storage? Depending on the external data sources you include in your analytics, you could need a lot or a little extra storage.
From a sales standpoint, you could recommend adding to current enterprise storage with network area storage (NAS) and RAID arrays to address immediate needs. However, trying to buy enough storage to stay ahead of big data demand is a losing proposition. You get more flexibility (and more potential profit) from cloud storage systems that can expand with big data needs. Now you have data storage that is extensible beyond the data warehouse.
To support scalability you need tools to manage data distributed across the enterprise and the cloud. This is where you can offer Hadoop expertise. Hadoop provides the file system to manage distributed resources. Hadoop can manage data stored across thousands of network nodes (Hadoop Distributed File System or HDFS) and it applies distributed processing to make data handling more efficient.
Where most data management systems can only handle structured SQL queries, big data has the advantage of being able to handle data of all varieties, including video, graphics, email, and text. This unstructured data can’t be processed using SQL, so big data uses other tools such as NoSQL to handle unstructured data and virtual resources.
When the relational database management (RDBMS) system can no longer handle the data, it’s time to think about NoSQL. NoSQL is not a substitute for SQL but rather a superset. No SQL means “not only SQL” so it encompasses SQL and other types of data sets.
When you start to run up against the limitations of RDBMS, it’s time to sell NoSQL programming to take database management to the next level.
Finally there is velocity; the rate at which data is handled within the organization. With such a vast pool of potential data scattered throughout the cloud, conventional data processing strategies will take too long to deliver results, especially if you are looking for real-time response times. You need to address the problem of velocity in big data.
Offering virtualization is part of the solution. Virtualizing big data resources offloads data processing to distributed virtual machines and optimizes the number of read/writes. Developing a big data virtualization strategy will address much of the problem of data velocity.
When selling big data solutions, starting on familiar ground such as database management will help you demonstrate what resources are required and how you can add value. If you start with the existing data warehouse and look outward, you can lead your customer through the logical steps to expand their current DBMS infrastructure to embrace big data.