Big data analysis requires both server computing power and storage, but big data storage is clearly the dominant hardware cost. While Hadoop does an excellent job of reallocating analytics processing across multiple servers, the need to provide data storage that offers high availability and scalability is always a challenge, and that challenge can grow with the number of terabytes or petabytes of data you are trying to analyze.
Using big data storage strategies is clearly more efficient than conventional data warehousing. In one cost analysis of the total cost of data (TCOD) for big data versus a data warehouse over five years, big data costs less than one-third of the cost of a data warehouse to handle 500 TB of stored data – the system cost is $9.5 million for big data because Hadoop data distribution is more efficient, where the same storage costs $30 million for the data warehouse. Hadoop lowers the cost of the hardware requirements, making storage the greatest hardware overhead.
Most big data architects start with a single array of clustered Networked Attached Storage (NAS). As their bog data needs grow they add more arrays. The problem with this strategy is that it’s not sustainable; at some point the cost and requirements to manage such a NAS infrastructure become unwieldy. You need a different strategy.
So what specifications do you look for in big data storage? Here are eight considerations for your checklist:
- Scalability – Trying to anticipate big data storage requirements is virtually impossible. You would have to calculate the data needed to run applications and predictive models for each big data category, including future demands. However, there is probably one primary application that is driving most of the company revenue and is the focal point for your big data initiative. Use that application to gauge your initial storage requirements. As your needs grow you can buy more disk space or you can add cloud storage. Whatever strategy you choose, you have to make sure storage scalability has minimal impact on data throughput and administration overhead.
- High data availability – In order to have value, data has to be readily accessible. However, as you start to archive more data, managing availability becomes more challenging. Traditional policy engines spread new data across RAID architectures, but RAID becomes less viable as storage demands grow. A better approach is to consider is “wide area storage” where data is represented by objects and the objects are dispersed across multiple storage nodes.
- Support tiered storage – A big data storage architecture needs to be able to prioritize data, keeping some data available for analytics and archiving data you don’t need right away. Most big data storage systems have a storage hierarchy to prioritize flash memory, disk, tape storage, and other media.
- Self-managing – Most enterprise storage systems are used by multiple applications and users, so the data storage system has to be able to tier data across media types so as not to conflict with applications storage. You need to be able to program and automate big data movement, including how long data is kept before being stored on slower media. For example, if older data is accessed from the same tape archive multiple times in a week the system should move the data to disk for faster access.
- Wide accessibility – Stored content needs to be available throughout the organization. Distributing data geographically so it’s closer to remote users has become increasingly important for performance, which is why big data architects are adopting more cloud storage.
- Supports both analytics and content applications – Data is seldom dedicated to a single task such as big data analytics. Unstructured files like as web logs or financial data may be needed in user applications as well as big data analytics. Big data storage needs to be able to accommodate both analytics and applications in a single shared architecture.
- Self-healing – A well-architected big data storage system can automatically work around component failures so users see no disruption in service.
- Supports various cloud configurations – Installing sufficient data storage on site for most big data applications isn’t practical, so whatever big data storage architecture you create needs to be able to accommodate public, private, and hybrid cloud environments from the outset.
These basic big data storage considerations should give you a place to start. There are many other storage challenges you may have to deal with depending on the environment and applications. The important thing to remember is that big data requires high availability, i.e. enough bandwidth to transfer large files, and the scalability to support multiple files at once as the file count increases. So where do you see the greatest reseller revenue opportunity from big data storage?