What kind of big data solution does your customer need? Choosing the right platforms for compute clusters, data storage, and analytics is a matter of choosing the right-sized tools for the job. When designing a big data solution, start by asking the right questions. Once you understand the questions you want to answer and the resources you have available, you can start designing the right big data solution.
Start by understanding where the hidden costs lie. Hadoop is open source and provides a low-cost means for processing large data sets on commodity hardware, but you still need hardware and storage. The traditional database approach is cost-prohibitive, and it can’t scale fast enough. For example, a petabyte Hadoop cluster requires about 250 nodes at a cost of $1 million, which is a fraction of the cost of an enterprise data warehouse. And there are hidden costs in adding storage and redundancy to the ecosystem.
To determine the right size big data solution, you have to start by asking the right questions.
What do You Want to Know?
Usually an organization has a general idea of the type of data they want to analyze, but their motivations to initiate a big data project are seldom clear-cut. Once you start delving into the data and uncovering patterns, one result may lead to another question that requires further data gathering and analysis. To reveal these “unknown” unknowns you need to create a baseline with a few basic use cases; collect and gather key data, build some predictive and statistical models to answer your questions, and then see what insight results. Now you have a better idea of the scope of your investigation and you can start looking at your big data requirements.
Consider the three basic components of big data as defined by Gartner:
- Velocity – the speed of incoming data and the number of events and elements being stored.
- Variety – the various forms of structured and unstructured data that need to be processed and normalized for analysis.
- Volume – how many terabytes or petabytes of data you need to manage?
Now add a fourth consideration – complexity. How are you going to handle data distribution and processing using on-premise, cloud, and hybrid platforms?
When you map the parameters of the question to the velocity, variety, and volume required for statistical analysis, you start to see the scope of your requirements and get a sense of what you will need in a big data solution.
What Do You Have in Place?
Now that you start to glean the size of the elephant, take an inventory of what you already have available in your big data ecosystem. Are your datasets large or small; terabytes of petabytes? Can you expand your existing data warehouse? Do you have to throw data away because you can’t store it? Is there a lot of “low-touch” data that is seldom needed for analytics? Do you want to explore complex or large amounts of data? Answering these types of questions will tell you if you can expand your current infrastructure.
What about the cost of expansion? What tools and technologies exist in your current data warehouse? Is the existing system scalable? Do you have adequate processing power and data storage capacity? Are the right governance and policies in place? Answering these kinds of questions will tell you if it’s cost-effect to expand to build your big data solution.
Also consider your future requirements. Adding new data sources can be costly, so can you solve your business problems with existing data? Also assess your application portfolio. Do you need a basic Hadoop platform, or will you need to add specialized software? Will you need to invest in a more robust Hadoop solution if you only need it for a current use case?
Can You Implement Big Data in Stages?
You don’t have to eat the elephant all at once. Most big data solutions can be incrementally implemented. You need to define the scope of the business problem in measurable terms, and match those terms to revenue gains.
If the business case is too limited in scope, then the business benefits won’t be achieved. If the scope of the project is too large, then you won’t have the resources to achieve the results in a cost-effective or timely way. If you can define the core requirements of the project from the outset then you can be confident that you have the right big data solution.
As with most projects, careful preplanning makes for a more successful outcome. Ask the right questions, consider your data needs, apply use cases, and test your assumptions before investing in a big data solution. What is your biggest concern when setting the scope for a big data project?