Value-added resellers eyeing the big data market appreciate the need for data storage, lots of data storage. But which data storage technologies do they need to add to their catalog? And what types of data storage are best suited to big data applications? How do you maximize profits from big data storage and still choose the technology that delivers the best results?
Big data, by definition, requires lots of data and data storage, or as Gartner defines it, a high volume, high velocity, and wide variety of data assets. And big data can be comprised of both structured and unstructured data, which means large volumes of data that change fast and frequently. The payoff from big data, of course, is that it provides actionable insight unavailable using conventional database technology. Big data can ingest vast amounts of structured and unstructured data to yield valuable results, which means big data storage has to expand to accommodate the demands of any big data project.
Big data also is driving demand for more data storage, as the market projections demonstrate. IDC forecasts 40 percent compound annual growth rate and predicts the data storage market will grow from $3.2 billion in 2010 to $17 billion in 2015. A Microsoft survey also revealed that 62 percent of enterprise respondents are already storing at least 100 terabytes of data. And while big data storage is booming, cloud data storage is catching up. Wikibon estimates that the cloud market for big data will grow from $0.36 billion in 2011 to $3.65 billion by 2017.
So where does the reseller invest in data storage to support big data. As they say, “it depends.”
Big Data Applications Dictate Storage Types
When you consider the demands of big data analytics, it’s clear that one type of storage isn’t going to suit all analytics applications. For example, big data often requires large unstructured data sets for rapid, real-time analytics, such as those used to trigger targeted, interest-based ads in Google or Amazon. That requires incredibly fast file access.
The data storage methodology is related to the big data application. For large capacity applications, such as for oil and gas exploration or life sciences, the concern is having sufficient bandwidth to transfer large files and still be sure performance isn’t affected as more files are added. For analytics applications emphasis is on fast response time, i.e. the time to transfer to and from the storage platform. Capacity and throughput, or sustained bandwidth, are important considerations.
Big Data Storage Options
Here are some of the storage options to consider for your big data practice:
1. Direct Attached Storage:
For Hadoop Distributed File Systems (HDFS), developers typically use a scale-out architecture that applies distributed processing with low-cost, off the shelf servers. To minimize latency and cost, Hadoop implementations usually use a SATA (Serial ATA) drive connected directly to the server, the idea being to use massive scale-out to move the computing capacity closer to the data. These types of configurations typically demand direct-attached storage (DAS). Using HDFS on DAS has a number of advantages, breaking up large files that are distributed across the cluster for lower latency without shared storage.
2. Storage Area Networks:
Using HDFS on a storage area network (SAN) functions much like DAS, but by using logical volumes in storage arrays – HDFS can’t tell the difference and they still appear as locally attached disks. What makes SANs different from DAS is that with SANs, instead of storing data on a direct attached disk, data is in one or more arrays attached to the data nodes through the SAN. Data storage looks like local storage and each array has its own cache, redundancy, and replication, and any node on a SAN can access any array volume.
3. Network Attached Storage:
Scale-out or clustered network attached storage (NAS) is ideal for big data since the storage architecture can be expanded as needed. With scale out (as opposed to scale up) new hardware can be added and configured to use resources more efficiently. This saves from having to buy massive arrays and hope they are big enough down the road.
4. Object storage:
One of the challenges with clustered NAS is the system becomes unwieldy as you add more files because NAS relies on traditional, tree-like file systems. Object storage applies a flat data structure, assigning a unique identifier to each file and indexing the data and its location so it can handle more data objects than a hierarchical structure. Object-based storage can expand file counts into the billions and can scale geographically.
5. Cloud storage:
The elasticity of the cloud makes it ideal for big data applications. Hadoop clusters put a heavy load on storage and cloud storage performance is usually inadequate for big data analytics, although many vendors are finding ways to address that problem. Cloud capabilities for big data support will vary with each vendor, but while analytics in the cloud may be challenging, the extensible data storage capacity is certainly there. Virtualization infrastructures can address most of these issues by abstracting data storage and providing a heterogeneous resource.
No one storage solution can suit every big data application. However, if you understand how to balance these five big data storage strategies, you will be able to create a balanced big data infrastructure that delivers both performance and scalability.