When developing a big data project, how you handle big data storage matters. The big data infrastructure has to be able to handle large amounts of data while delivering enough input/output operations per second (IOPS) for analytics. The type of storage you choose has to be able to support analytics at the same time you have to consider other requirements, such as capacity or privacy.
The biggest problem with big data storage is sheer volume. According to IDC, by 2020 the amount of data created annually worldwide will reach 40 zettabytes, which is 50 times more data than was created in 2010. For clarification, a zettabyte is 1,000 exabytes, or 1 million petabytes, or 1 billion terabytes. Even dipping a bucket in the ocean of available data requires a lot of big data storage.
So when choosing big data storage providers, you have to consider the type of storage that best serves the big data project.
NAS versus Object Storage
There are basically two ways to approach big data storage in a manner that delivers storage scalability and analytics processing power: network-attached storage (NAS) and object storage.
NAS provides a dedicated storage device, such as a RAID array, with its own IP address to provide file-based data storage to any device on the network. For big data, NAS is clustered to provide scalable storage and compute capacity, often using parallel file systems stored across many nodes.
Object storage offers a more scalable approach because of its simplicity. Rather than using a directory hierarchy, object storage stores files in a flat organization, indexing each file with a unique identifier and location. Object storage requires less overhead and uses less metadata so it can handle billions of files, making it ideal for big data storage. It also is better suited for cloud data storage since it spreads storage across multiple virtual resources, including geographically.
Depending on your potential data needs, you may want to plan for throughput with NAS or greater flexibility and scalability with object storage.
Hadoop Clusters vs. NoSQL
The Apache Hadoop framework uses a Hadoop Distributed File System (HDFS) to create a “shared nothing” architecture. In a distributed computing cluster, compute nodes are connected but storage is structured as direct-attached storage (DAS), usually eight to 10 disks per node configured as RAID.
One of the objectives is to reduce latency which is why the data is stored as close to the compute node as possible. If you are going to adopt Hadoop, be conscious of how you virtualize your data storage and be sure that your big data storage approach can reduce processing latency.
The alternative to Hadoop is NoSQL. NoSQL databases are designed for fast storage and retrieval of data without using the tabular structure of SQL databases. Where Hadoop uses parallel processing and data storage across commodity hardware, NoSQL is useful to access “key value” pair combinations, so it is more valuable for applications that revolve around one key piece of data.
NoSQL is optimized for very fast performance by capturing and quickly storing a single identifying key. This makes it easier for NoSQL to access data from multiple distributed sources and rapidly store large numbers of transactions. However, NoSQL is not as adept at dealing with lots of structured, semi-structured, and unstructured data.
Big Data Storage Shopping Criteria
So as you can see, depending on the nature of your big data project, your data storage needs will have to adapt to the data analytics requirements. When shopping for big data storage vendors, consider:
Capacity: Can your storage platform scale to accommodate the petabytes of data you might need for analysis? Can your storage handle a large number of files, perhaps even billions, without suffering from file system management problems?
Latency: Do your analytics require real-time data? Many big data environments will require high IOPS performance which will require virtualization, including cloud storage.
Access: As your big data deployment evolves, you will want to include different data sets for comparison. To accommodate new data, storage infrastructures are starting to include global file systems that support multiple hosts for file access.
Cost: In addition to storage for processing you also will need archival storage. To control costs, prioritize what data is needed for immediate analysis and what data can be parked in less accessible locations for access when needed. You will need to balance local and cloud-based data resources.
Chances are that you are going to use multiple big data storage companies. Big data lends itself well to virtualization, and you will want to balance the capacity of all your data storage systems with performance.
So what are the dominant criteria for your big data system? Capacity? Throughput? Cost? Where do you make your big data storage compromises?