Security is a major consideration in big data consulting. Part of your job is to identify and consolidate data silos and incorporate new data management systems, such as cloud storage. This means new procedures for the way data is accessed and processed, including new security protocols. Unlike security policies and procedures designed specifically to deal with individual data silos, big data requires a new, holistic approach to security. Part of any big data consulting engagement has to be developing a new security framework to suit a new big data analytics model.
Gartner’s security analysts predict that while big data is booming, by 2016 more than 80 percent of organizations using big data will fail to develop a consolidated data security, resulting in potential security breaches, regulatory noncompliance, and financial liability. Securing private and cloud-based data systems is going to require monitoring and auditing data access using data-centric audit and protection (DCAP) tools that span all data silos.
When the Apache community started developing Hadoop as an open source project, security was not top of mind, so Hadoop is lacking in tools that address enterprise security, policy enforcement, and regulatory compliance. Vendors, who are used to delivering off-the-shelf solutions for database security, have started to step in but they are adapting their existing products, which are not designed to secure a clustered environment.
By its very nature, big data is designed to look at data as a whole rather than as discrete data repositories. That means nodes and storage in the enterprise and in the cloud work in tandem to appear as one unit, which presents a number of security challenges with Hadoop:
- Securing distributed computing – Central data repositories, even when they become massive, are easier to secure than distributed environments. Data processing is handled throughout the big data infrastructure and parallel computing systems are more vulnerable to attack.
- Unsecure data access – Most database security uses Role-Based Access Control (RBAC). Big data provides access control at the schema level with no way to differentiate users by role.
- Fragmented data – Hadoop data is fluid, which leads to data fragmented to accommodate redundancy. Spreading data fragments across multiple servers is another security challenge.
- Insecure communications – Node-to-node communication is done using the RPC and TCP/IP protocols which are not inherently secure.
With a data warehouse you have all your sensitive information stored in a central location so you can secure it with firewalls and intrusion detection software. However, these approaches don’t offer protection within a Hadoop cluster. As noted by Forrester in their report “Future of Data Security and Privacy: Controlling Big Data,” if a hacker breaks through the perimeter then they will have unlimited data access. This is even truer when protecting clusters. The solution, suggests Forrester, is to place the security closer to the data store itself.
Adopting a Holistic Security Approach
As part of big data consulting you need to know how to secure the big data repositories. Here are some best practice procedures to consider.
- Risk assessment – Understand how much of the big data information has to be protected. A survey of Hadoop projects revealed that most of the data consisted of log files, then DBMS files, then unstructured data files. Identify the sensitive data and how it needs to be protected, including considerations for privacy policies and compliance.
- Where is the sensitive data stored? – Use data classification to understand where sensitive data is stored within the enterprise and put security controls and policies in place to protect that data.
- Understand compliance exposure risk – Scrutinize data capture and storage procedures to identify compliance risk. For example, personal data such as names, addresses, and even Social Security numbers could be included as part of a Hadoop project and will require adequate protection to meet privacy regulations. Credit card information also makes its way into Hadoop data stores, even though the Payment Card Industry Data Security Standards (PCI DSS) recommends organizations not store credit card information.
- Data masking and encryption – Data masking is recommended to protect data stored in Hadoop directories. Data masking will hide sensitive data by disguising it with false information while maintaining the application logic. Data encryption is also a recommended practice.
So when big data consulting includes data security, remember that big data can’t be treated the same way as conventional data stores. You have to develop a security schema that allows for distributed computing, parallel data processing, and the fluid transfer of data to accommodate Hadoop applications. Whenever possible, protect the data, not the data repository, and be sure to have an understanding of where your data risks are and how to prioritize big data security measures.