As we said before, big data consulting requires taking different types of data from multiple, disjointed resources, and assembling that data for analysis. That requires a different kind of enterprise infrastructure with different data processing and storage requirements. Most customers are at a loss as to how to create a big data infrastructure that can handle Hadoop analytics, so a large part of the value of big data consulting is developing a big data infrastructure that delivers ongoing insight.
Network infrastructure is one of the top priorities for IT executives struggling with big data deployment. In a study conducted by QuinStreet, 40 percent of the 540 IT executives surveyed said that increasing network bandwidth to accommodate big data was a primary concern. IT managers also are adding more data storage, assuming that you have to store everything before you can determine which data sets to use for big data analytics.
Hadoop, Not Hadump
Another survey by Ventana shows that Hadoop is taking hold fast with big data users. Of IT managers polled, 54 percent said they are using Hadoop for large-scale data processing, 87 percent are planning or performing new types of data analysis using large data sets, and 82 percent say they would benefit from faster analysis and better use of available computing resources. The same survey shows that 94 percent of Hadoop users are performing analytics on large volumes of data they couldn’t analyze before, including 88 percent who analyze data in greater detail and 82 percent who retain more data.
The problem with a lot of Hadoop development, however is that it becomes “Hadump,” where data from the data warehouse, business intelligence, and outside resources are dumped into a Hadoop framework without rhyme or reason. The rationale for big data success is either start with a use case to solve, or store all the data you can for analysis later. Applying the right use case will deliver immediate returns, and developing an infrastructure that can expand to accommodate more data will provide returns in the long run.
In most big data initiatives, only 30 percent of data collected has value; you just don’t know which 30 percent. The cost of data storage continues to drop, while the cost of analysis and data science remains high, so you want to develop an infrastructure that can streamline analytics while accommodating more types of data storage.
Data Analytics from Hadoop to SQL
Most business intelligence architectures pass data through a data warehouse for analysis. Big data is just a larger type of business intelligence so the structure is similar. You want to place your Hadoop cluster between the raw data and the data warehouse, so users with repetitive queries, reports, or dashboards can get the data they need from the data warehouse, and power users can access the data warehouse or the Hadoop cluster directly for more complex queries.
The data sources feed the Hadoop cluster with raw information for analysis. The Hadoop cluster serves as the staging area for incoming data, whether its structured data from operational systems, semi-structured data such as log files or machine-generated data, or unstructured data from the web, text, audio, and video.
The raw data is analyzed in the Hadoop cluster using MapReduce and other languages. From the Hadoop cluster, the results are fed into a data warehousing hub where it is distributed to downstream systems such as data marts and analytical sandboxes. These downstream data stores can use conventional SQL tools for reporting and analysis.
Scaling High-Performance Data Storage
As part of your big data consulting you need to be ready with storage solutions. Big data has outgrown conventional enterprise storage so it needs to be able to scale in a transparent manner. Adding modules and arrays is one approach, and more big data infrastructures are using cloud storage because it is elastic and can expand to accommodate any data need.
Another challenge is the number of files. Using metadata to track files can reduce scalability and performance, so big data architectures are using object-based storage to handle large datasets without adding overhead. Object storage can handle billions of files and can scale geographically.
Latency is another storage challenge. Many big data applications operate in real time and can’t abide latency. Storage node clusters can increase capacity and object-based data storage can support parallel data streams to reduce latency. Many big data infrastructures require high IOPS performance for server virtualization, and that will mean solid-state data storage devices.
Big data consulting requires assembling the right pieces to deliver optimal performance, both in terms of the speed of data delivery and the quality of the data insight. Understanding how to deploy Hadoop and the challenges of high-performance data storage are a place to start.