Big data applications have a need for speed. Big data is built around the concept of the “three V’s”: volume, variety, and velocity. For big data projects to be valuable, they have to be able to process lots of data (volume) in various structured and unstructured forms (variety) extremely fast (velocity). The value of big data is in the insight it delivers, so if you can process data in real-time or near real-time it has more value.
In the past, data was analyzed using batch processing. Chunks of information would be accumulated and submitted to the server. This schema works fine when the rate of incoming data is slower than the batch processing rate. With big data you have more data and more types of data requiring processing right away, so the batch process breaks down. Big data applications require a new strategy to handle data at rate approaching real time.
Being able to use real-time data as part of big data analytics provides a new level of insight which really makes big data valuable. Consider the amount of real-time information that can be made available for decision-making, such as data from mechanical sensors on the production line or unstructured data from social media. Twitter alone generates 58 million tweets per day or 9,100 tweets per second. If you are looking for sentiment on trending analyses and use Twitter as part of the big data mix, batch processing won’t be able to keep up. To deliver valuable business insight, big data applications have to be designed for speed.
“Watson, I Need You, NOW!”
IBM’s Watson computer project showed the power of real-time big data in 2011 when a computer was able to beat two human champions at “Jeopardy.” The demonstration showed that a new kind of architecture that uses more in-memory processing rather than accessing data stored on disk was faster and better suited to big data. A new kind of programming approach needs to be adopted to make better use of in-memory hardware.
This presents a programming paradox. Apache Hadoop has become the most popular platform for developing big data applications, yet Hadoop was developed as a batch-oriented system. When Yahoo! introduced Hadoop it proved to be well-suited for analytics and offered the added advantages of being open source and highly scalable. However, its real-time possibilities are still being proven, and other, faster in-memory databases like NoSQL are filling the gap.
Programmers have been trying to compensate for the batch-processing structure of Hadoop’s MapReduce using tools like the HBase database, SQL interfaces such as Impala, and in-memory frameworks such as Spark. However, for real-time data processing, NoSQL still delivers the best performance.
To deal with the problem, the Apache Software Foundation released Spark v1.0 in May, which allows developers to write big data applications in Java, Scala, or Python using more than 80 high-level operators. The new Spark release is said to run 100 times faster than MapReduce in memory for even faster analytics results.
Architecting for Big Data
To facilitate speed, big data applications have to be aided by smart hardware design. Solid-state drives (SSDs) based on NAND Flash memory are ideal for big data applications because of their hyper-fast performance. SSDs can be used to speed performance anywhere in the big data infrastructure, including host cache, network cache, storage arrays, or even hybrid storage arrays with an SSD tier.
Data virtualization is another strategy that improves performance for big data applications. Virtualization creates an abstraction layer within the IT architecture, pulling data from various sources and combining it with other data for analytics. Data virtualization speeds performance by allowing retrieval and manipulation of data without have to actually know how the data is stored or formatted. Virtualizing Hadoop promotes high-availability and better performance for big data applications.
So putting the velocity in big data is about promoting high-availability for streams of data. You can compensate for the Hadoop batch processing architecture using smarter programming tools, and by applying an architecture that makes the data more accessible to big data applications. Performance should be a major consideration in any big data initiative. The closer you can get to real-time data processing, the more valuable the results from big data will have for customers.
What’s your preferred approach to design for big data speed? Is it more of a hardware problem, a software problem, or both?