In the world of big data, the more data you have the better the analytics and the more insight you can extract, so more organizations are dealing with more big data storage. It’s no longer uncommon for organizations to store petabytes of data (that’s 1,000 terabytes of 1015 bytes). So finding the right big data storage solutions is an important part of any big data initiative.
Big data is a loosely defined term for too much data to process using conventional technology, and big data can consists of anything – structured data, unstructured data, social media conversations, text. According to Gartner, enterprise data will grow 650 percent in the next five years, and IDC adds that the world’s information doubles every 18 months. IDC says that in 2011 we created 1.8 zetabytes (1.8 trillion gigabytes) of information, and we will generate 40 zetabytes by 2020. No wonder big data storage is a challenge for any enterprise.
Data Storage Is Cheap, But Not Free
One of the drivers for big data is the fact that data storage has become so cheap. As George Dyson observed, "Big data is what happened when the cost of storing information became less than the cost of making the decision to throw it away."
It’s tempting to keep everything in case you find a brilliant data scientist who can uncover a needle in the big data haystack. The challenge, of course, is the bigger the haystack the harder it is to find the needle. Storing everything is unrealistic, and is that data going to increase in value with time? Is big data storage going to ensure that the data you need is accessible, available, usable, verifiable, protected, and ultimately affordable to include in big data analytics?
Most data’s relevance declines over time. And there are going to be data subsets that will have less value for your needs. Prioritize your big data storage so you are only archiving information that will deliver insight. And consider disposing of outdated data to make room for fresh information. Ironically, the less data you store the more efficient and cost-effective your big data project.
Big Data Storage Checks and Balances
As Tim Stammers, senior analyst for Ovum, notes, there is no one solution for big data storage: “It depends on your application.” Vendors will recommend object storage, clustered NAS (networked attached storage), SAN networks, or whatever they understand.
When choosing a big data storage solution you need to consider your performance requirements – how accessible do you need data to be for effective analytics? How critical is the data? Do you replace your data findings with new information each day? Has the data taken a long time to gather, which means you need reliable mirroring or backup?
One concern many customers have is scalability. If they invest in a 500 terabytes of data storage today will they need another 500 TB in six months? How easy will it be to integrate more storage?
Many Hadoop programmers tend to write programs with the assumption that data storage and data processing will be on the same machine. Naturally, the more accessible the data, the better the performance will be. Using SAN storage or cloud data storage may reduce performance, or may require you to rethink how you allocate your Hadoop processing. And if you choose to store the data using a SAN or cloud storage system you can use that data for other purposes as well, such as CRM, ERP, supply chain planning, or financials.
Other Big Data Storage Concerns
There are other considerations specific to big data storage. Security and compliance, for example, may be a factor depending on the customer. Financial data, medical records, insurance records, and government data all have specific security standards and are subject to government regulations. When designing your big data infrastructure you have to be careful about cross-referencing data in a way that could reveal sensitive information.
What about data archiving and disaster recovery? Tape is still the most economical archiving medium and some tape cartridges can hold a number of terabytes. If, however, you may need to access archived data quickly, as in the case of an audit, you may want to consider a different storage medium.
Big data is also driving sales of commodity hardware to contain costs. Software companies are developing their own “white box” data storage software to run on off-the-shelf hardware specifically for big data.
Work with your customers to define their big data needs before recommending a storage system. Chances are that you will need to integrate different kinds of data storage to build a successful big data infrastructure. What do you look for when you consider requirements for big data storage?