As any IT manager knows, when it comes to data center management, the question isn’t if a data center outage will occur, but when. A data center is an interdependent system with many components, and if one of those components fails, the net effect could result in downtime. While an outage may be inevitable, you can learn from past system failures and use what you’ve learned in order to prevent future failures. As a solution provider, you are in an even better position to evaluate data center outages, because you have experience dealing with multiple data center customers; what you learn at one customer site, you can apply to the next.
CIOs worry about data center outages because they are disruptive and very expensive. The Ponemon Institute keeps track of the cost of data center outages, and those costs are rising. The latest survey of 63 data center operators in 2015 showed that the average cost of a data center outage is up 38 percent since 2010 to $740,357, or about $8,851 per minute. For example, consider the cost of downtime for JetBlue when its Verizon data center suffered a power outage. In this situation, the failover plan failed, which is something a comprehensive data center evaluation should have identified and prevented.
So how do you evaluate for data center outages? Here are three of the most common points of failure that require scrutiny:
1. Failed power – According to Ponemon, the number-one cause of data center outages is failed power. A faulty uninterruptible power supply (UPS) or battery backup accounts for 25 percent of data center outages.
Granted, UPS systems are fairly robust, but in the event of a major power outage, other problems can arise—for example, installing a battery backup system that isn’t sufficient to carry the load. Batteries and power generators tend to be the weak points in data center operations, so when a hurricane or major snowstorm hits and the power grid goes down, the backup power system may be able to handle the added load.
Assess the power failover system in order to be sure it can deliver sufficient electricity. Also, perform routine tests on power systems in order to help prevent a critical outage. Creating power failover systems is also recommended, but all power failover systems should be assessed for load capacity and tested regularly as part of routine maintenance.
2. Cybercrime – Hackers and cybercriminals have become the second-most common cause of downtime. The Ponemon study shows that cybercrime jumped from 2 percent in 2010 to 22 percent last year as the primary reason for data center failure. Distributed denial-of-service attacks were the most prevalent type of security breach.
Of course, security is an ongoing concern for any data center, but the more preventive measures you can implement in order to protect the data center from attack, the less likely there will be an outage. Some of the most common strategies are to make sure there are no direct connections to the Internet. Data center staff also should be vetted with regular background checks, and physical security of the data center should be assessed in order to make sure unauthorized users don’t have access.
3. Human error – Operational errors ranked as the third-greatest cause of data center outages, making up 22 percent of failures. The nature of human error is varied and contributes to other types of failures, so it probably ranks higher than number three. The biggest culprit seems to be management-related failures, which stem from inadequate training, staffing and procedures.
Your customers can implement protocols and procedures to protect their data centers, such as keeping soda cans out of the computer room. Most human errors occur during routine maintenance, so developing strict maintenance procedures is also important. For example, a design flaw could create a single point of failure, which means that a flawed maintenance sequence or a mistake made during maintenance could result in an outage. That’s why most maintenance is performed during off-peak hours.
As a solution provider, you are in an ideal position to assist with assessing protocols and procedures and providing adequate training in order to minimize human errors. You can review the system architecture, looking to identify weak points that require special attention or call for redundant systems. And if your data center customer is short-handed, you can even take on some of the workload with services such as remote monitoring and backup.
Of course, these three points of failure are interrelated, so you can’t just address one concern, but have to look at the data center as a whole. For example, a UPS may fail because it wasn’t tested properly or frequently enough. This is a combination of UPS failure, failure of system protocols and lack of training all contributing to the same outage.
Although research has identified these three areas—power, cybercrime and human error—as the primary causes of data center outage, they certainly aren’t the only culprits. Because any data center functions as a symbiotic infrastructure, you have to assess the data center from end to end in order to identify potential points of failure. It’s your expertise in data center infrastructure as well as specific systems that puts you in the ideal role to be a data center troubleshooter for your customers.