Big data has turned the entire concept of enterprise storage and backups upside down. As part of your big data consulting practice you are undoubtedly recommending that the more bytes of data you have available to mine, the better the results of big data analytics. Data storage becomes more of an asset than ever before. And even though the cost of data storage continues to fall, the volume of information required for big data projects continues to outpace available storage. So for our next big data consulting tip, we want to review how you should think about big data storage, especially as it relates to analytics and backup.
Big data is clearly fueling the data storage market. IDC predicts that the annual growth rate for big data storage will maintain a 40 percent compound annual growth rate, growing from $3.2 billion to $17 billion by 2015. A survey conducted by Microsoft of their enterprise customers reveals that 62 percent are storing at least 100 terabytes of data. Big data is going to continue to drive storage sales through the roof.
As part of your big data consulting practice you need to be prepared to make recommendations as to how to best balance storage needs, including backup, with big data resources. How much enterprise storage will you want to allocate? What about data archives? How much cloud capacity? Answers to these questions will depend on the nature of the analytics, how big a data pool is required, how often you have to dip into that pool, and how much of that data you can archive.
What Data to Store
The challenge with big data archiving strategies is it’s hard to tell what’s significant. You may need short-term data for immediate real-time decision making, such as tracking stock trades, and then you need to archive data for later analysis, such as assessing market performance or identifying market trends from historical data.
In either case, to provide accurate analytics you have to have a complete set of raw data. You can’t predict which data sets are going to be valuable for analytics, so you have to keep it all. A lot of that data may have to be stored in the cloud for ready access for queries, while other data may be archived using more static media since it’s seldom called for.
Start by assessing the backup technology you already have in place. For data sets that you know have to be frequently accessed, consider using local data stores with storage snapshots to make access easier. You can add local storage or archival storage to accommodate big data storage, but be sure that whatever storage you add is compatible with existing data archives and backup systems. For data that has to be archived but is not likely to be needed for analytics, consider using a legacy tape system; they are reliable and cost-effective.
Don’t Backup Everything – Restore or Reproduce
Remember that archiving big data is different from archiving big data analytics. Big data archives are going to take more storage than big data analytics. However full data backups aren’t an option in either case because there is too much data and it’s too expensive to store it all.
Instead, determine what data doesn’t have to be backed up because it’s easier and cheaper to reproduce. A database report, for example, can be regenerated more easily than it can be restored. Exclude that data from your backup.
The data that you can’t reproduce is usually time and event sensitive. Machine data generated by sensors, for example, is specific to a point in time and can’t be replicated. This kind of data usually consists of lots of small files, and there is a lot of duplicate data generated so it’s best suited for disk storage which can handle small file transfers and deduplication.
As part of your backup strategy, consider using disk storage in conjunction with tape archives. The deduplicated disk data can be transferred to tape for safekeeping, since the data being stored can’t be reproduced. That way you have a secure archive on a cost-efficient and stable storage medium in the event of a corrupted disk or disk failure.
Developing big data backup strategies is part of big data consulting, and it will require an understanding of how to balance storage media with the demand for immediate data access for analytics. Consider adopting a tiered data storage strategy that encompasses solid-state storage, high-speed disks, tape, cloud storage, and other platforms. Data protection could be a limiting factor in your big data design, so triage the data, determining which data sets are critical, which can be reproduced, and which can be archived against future need.