Sep 1 ’10

IT Sense:  Teachable Moments From the Gulf Oil Spill

by Editor in z/Journal

While I’m hopeful some resolution will have been found by the time this column goes to press for the broken well head that’s currently spewing tens of millions of gallons of crude oil into the Gulf of Mexico each day, I can’t help but see some compelling parallels between the evolution of this disaster and the state of today’s data center. Perhaps we can learn from this tragedy and improve our own thinking about disaster prevention in the digital world.

First, there’s the decision-making that led to the catastrophe aboard the Deepwater Horizon rig. Consensus is growing that some poor choices were made regarding the safety of the capping processes that led to the disaster. These are characterized as shortcuts made by BP managers in the quest for profit, a view reinforced by evidence that the engineers’ concerns about the sealing of the well were overridden by business managers who were focused on schedules and production efficiencies. 

This interpretation aligns with our general understanding of market dynamics that compel businesses to do whatever they can to reduce costs and improve profits. “Cost containment” and “top-line growth” are two of the three components of Harvard Business Review’s triangular metaphor for business value—the third component being “risk reduction.” In this case, managers—

perhaps without even thinking about broader consequences—appear to have preferred improved operational efficiency (top-line growth) to safety (risk reduction). They may well have viewed the likelihood of a calamity as so small they didn’t adequately weigh the consequences of a low probability blowout.

Parallel to corporate IT today: In many organizations, those responsible for business continuity and disaster recovery planning have been shown the door in an effort to trim labor costs. Their perceived value to the organization is diminished by the statistic advanced by Gartner and others that less than 5 percent of data center outages come in the form of catastrophic disasters. In other words, companies that are doing without any sort of continuity planning capability are, like their counterparts at BP, adopting a risk posture that’s based on the preference for cost containment and improved profit by ignoring the big consequences of a low-probability event: a smoke-and-rubble disaster or a severe weather event.

This is as understandable as it is shortsighted. If and when a big disaster happens, companies lacking current business continuity plans and logistics stand to lose everything. In addition to the impact on shareholder value, the lack of a continuity capability will harm a company’s supply chain partnerships, damage its customers, and disrupt the lives of employees and their families.

A second teaching point from the oil rig disaster: vendor hubris. Undoubtedly, BP was told its blowout preventer was “fail-safe.” This is similar to the claims made by just about every technology vendor regarding its system or storage array, its hypervisor, or its software process for de-duplicating, compressing, or otherwise manipulating a company’s most irreplaceable data asset.

IBM used to claim that Big Iron was bulletproof. Even if there was a fire, power cutoff followed by sprinklers would resolve the immediate crisis. To my knowledge, such claims are no longer made by Big Blue, but I hear echoes of the same hubris as VMware talks about its site recovery manager, or Microsoft or Oracle talk about high-availability failover clusters. It gets my back up when I hear software folks saying that nothing could possibly go wrong in their wares, despite the fact that most software ships to market while it is only 80 percent complete. As one developer recently said, “If you don’t get a ton of complaints back over your version 1.0 release, you shipped it too late.”

Finally, we have the issue of poor regulation helping to set the stage for the oil spill disaster.  Summarized, toothless regulation reduced the emphasis on risk mitigation. Concerns about ongoing oversight and the consequences of non-compliance were insufficient to encourage good safety practices. 

In too many business organizations today, we see the same lackadaisical attitude toward disjointed and confusing regulatory requirements for data protection, preservation, and stewardship. This varies from firm to firm and by industry segment, but that makes the central point: Absent a coherent set of best practices that non-technical auditors can grasp, data today is at high risk. Tape is treated as a dead or dying technology. Companies are seriously considering cloud woo. Plans for migrating workload off mainframes and on to x86 virtual servers, while on hold in some firms, are still in the offing. All of these “strategies” ignore the core issues of how to platform data safely and securely, yet they raise no red flags in the corporate governance/risk/compliance office. This is scary, given what we’re seeing in the Gulf of Mexico today.