IT Management

IT Sense: Disaster Recovery Needs to Evolve

There have always been two camps in the business world relative to Disaster Recovery Planning (DRP): those who see the need to protect our most irreplaceable asset—data—and allocate time and resources to do something about it, and those who pay lip service to the need for DRP and make it a priority only once a disaster occurs.

Surveys show that less than 50 percent of companies responding to polls claim to have a DR capability. Of those that do, usually fewer than 50 percent have ever tested their plan, which is tantamount to not having one at all. 

Too often, DRP comes down to filling a couple of binders with logistical foo that, while it may pass muster with auditors, will prove worthless in the event of any real emergency. 

As someone who has helped develop nearly 90 DRPs, and has facilitated implementations of plans following many disaster events over the last 20 years or so, I can make the following observations with some authority. 

The real value of DR planning is the creation of performance-based test objectives. These objectives describe what needs to be done, what is required to do it, and what standards will be used to evaluate the outcome. By sequencing the objectives chronologically, so that interdependencies between them are identified, you have the ingredients for a test plan.

The test plan is the real goal of a DRP initiative. With it, you can uncover the real value of DR: rehearsing a cadre of personnel in the activities that must be completed to recover critical business operations in a timely way. This isn’t to say the mission of testing is merely the identification of procedural gaps or the ferreting out of faulty assumptions. This is important information, of course, but it’s not the real value of testing. Simply put, testing rehearses recovery teams to act rationally in the face of a great irrationality. 

Disaster events have their own personalities and nuances, and the plan can’t possibly script for every variable. If there is to be a successful recovery from an unplanned interruption, companies need trained personnel whose wits have been honed through repeated rehearsal and who are able to keep their heads, innovate, and overcome obstacles.

It also helps if you have a recoverable infrastructure. In the mainframe world, the infrastructure is highly predictable and codified. The same can’t be said of open systems environments, unless someone has taken recoverability requirements into account when selecting gear and designing software. Unfortunately, this is rarely the case.

So-called n-tier client/server computing is too often a major headache from a recovery standpoint.  When messaging between multiple tiers of server and storage infrastructure is “hard coded” into platforms, just replicating this infrastructure at a backup site can be a major undertaking. Maintaining such a recovery platform over time, as configuration changes occur in the production setting, is doubly painful.

The majority of the underlying challenges to recoverability could be rectified if designers had simply used message-oriented middleware. Why they didn’t usually boils down to nobody asked them to do it.

The key to successful business recovery, and also the most common cause of post-event recovery delays, is data recovery. Data is unique as a recovery target: It can’t be replaced or substituted; it must be made redundant. You need to make a copy of data—on tape or disk—and restore the copy to a usable state, or recovery will be impossible.

Also, you don’t need to give all data the same levels of protection. Only data that supports business-critical applications needs to be rapidly restored in an emergency; the rest can wait. The trick is to identify this subset and to apply the right tools and rules to ensure a valid copy can be brought online within the time-to-data requirements of your company.

Defining business-critical data is an onerous task, but it’s also a fruitful one. For one thing, the same techniques and methods used to identify and segregate business-critical data are also used to identify and classify data that is subject to regulatory retention and deletion requirements. Moreover, classifying data is the first step toward optimizing storage utilization efficiency. So, a combined data management initiative can kill three birds with one stone.

This last point is critical: By casting your DR plan in the broader context of data management, you are more likely to get funding than if you were only pitching DRP and preparedness. Management wants a full-fledged business case for IT programs and initiatives. DRP doesn’t have one; it’s only risk reduction. Data management, by contrast, serves many goals, including cost-savings (managing data will help forestall new storage acquisitions), risk reduction (disaster avoidance and regulatory compliance), and process improvement (better application performance).

DRP needs to evolve into a broader data management context. The time to begin is now.