The merging of big data and software defined networking (SDN) is inevitable. SDN decouples the network control and forwarding functions so network control can be programmed using an SDN controller, abstracting the underlying architecture for more efficient delivery of network services. What makes SDN inevitable for big data is that both make use of virtual network resources available as needed, including storage. Big data uses highly distributed and diverse networking, including cloud-based storage, so the only way to get the agility and responsiveness needed for big data analytics is with software defined storage. So how does your big data team start to make use of software defined storage?
According to sources at IBM Storage Software Development, the amount of information is doubling every 18 to 24 months and data storage is growing at 20 to 40 percent per year, although storage budgets are only growing at 1 to 5 percent annually. So there is an avalanche of new data available for big data analytics scattered across servers and cloud resources.
How do you manage storage growth, contain storage costs, and still provide access to the data you need for big data analytics? That’s where software defined storage comes into play.
Accessing Software Data Storage
To take advantage of software defined storage as part of big data, the best approach is for your big data team to tackle it one phase at a time, starting with using SDN to access big data for analytics.
SDN includes both northbound and southbound interfaces to manage traffic with components up and down the network. For SDN, the northbound APIs provide communication between the SDN controller and the network components, and the southbound interface communicates with the routers and switches to manage traffic. OpenFlow has become the most popular open source protocol for the southbound interface.
OpenFlow is ideal for both SDN and big data since it allows easy, programmable routing and switching of virtual resources. Both SDN and big data rely on virtualization for elasticity and agility, and OpenFlow routing decisions can be programmed using the SDN controller. OpenFlow enables the SDN controller to discover network topology, including storage sources, and define network flows relayed via northbound APIs.
Using SDN to Access to Hadoop Clusters
Hadoop is typically used for big data stores. Hadoop is a distributed file system (HDFS) that uses clusters for data storage. It also provides a framework (MapReduce) to distribute data processing for large data sets, whether that data is structured or unstructured.
The SDN infrastructure orchestrates access to network resources and servers. Without orchestration it can take weeks to set up new Hadoop clusters, which defeats the purpose of big data. OpenStack and OpenFlow support software defined storage in a way that enables real-time analytics for big data. The SDN infrastructure manages data flow to access virtualized servers and cloud resources while managing quality of service and bandwidth allocation. Hadoop is the perfect application for OpenStack to support software defined storage. Hadoop handles the data access and processing and OpenStack serves as the traffic manager to virtual resources.
Getting Started with SDS and Big Data
So to implement software defined storage as part of your big data infrastructure, start slow and build as you go:
- Create a data pool – Identify data storage sources you want to use for big data analytics, including local servers and storage, remote storage clusters, and cloud resources. Develop a hybrid cloud or some other strategy to pool your data sources.
- Set up a software defined service test in your data center – Choose an SDS provider to help you virtualize your data storage.
- Test the external interfaces – Be sure that the SDS system works with various storage systems, including exporting and importing data, interfacing with cloud storage, and managing storage resources.
- Create and assign policies – Assign policies to virtual machines that take advantage of the features in various storage arrays and resources.
- Build a Hadoop application – Once the software defined storage system is in place, you can write Hadoop applications that take advantage of the virtualized infrastructure. Using Hadoop you should be able to apply OpenStack to optimize the infrastructure for big data analytics.
- Automate provisioning – The SDN infrastructure allows you to automate storage provisioning based on various parameters. Automating provisioning will optimize big data analytics.
- Keep it simple and build on what you learn – Don’t try to automate everything at the start. Refine both your SDN controls and Hadoop code to make the most of available resources and then refine based on what you learn.
Of course, this is a gross simplification of the process, but it should provide your big data team with an idea of what it takes to apply software defined storage as part of big data.