IT Management

When I began my career in IT nearly 30 years ago, one of the first things I was taught was that we were stewards of irreplaceable data assets, tasked with ensuring their availability to business managers and applications and personally responsible for ensuring data assets were managed, stored and protected as an integral part of IT strategy. I took Disaster Recovery (DR) planning seriously, which helps explain all the writing I’ve done on the subject over the past several years. It also explains why I’m increasingly irritated when I chat with mainframers—not administrators of x86 TinkerToy servers, but mainframers!—who’ve started buying into vendor nonsense about the end of DR recovery planning.

I was having morning coffee today over a phone call with Rebecca Levesque of 21st Century Software, whose company develops and markets enterprise DR solutions, just to compare notes on what we were seeing in the mainframe shops we visit. She observed a lack of information around the pitfalls of the current myopic focus on high availability at the expense of DR. As she said, “It’s all about ‘resiliency’; no one will talk about ‘DR.’ ”

Like me, she hears mainframers talk increasingly about how the world has changed: “We’re a VSAM environment now and we run just about everything in cache.” So, there’s no need for DR, these folks say, as they mirror or replicate disk-based data continuously and plan to failover to a mirrored environment somewhere. (No mention is made of what’s done about all the data in cache, by the way.)

Hmmm ... That’s the sort of silliness I’ve frankly gotten used to hearing from the leading purveyor of hypervisor hype. But now it seems the theme has become a crossover comic book that’s finding its way from many vendors into Big Iron accounts, too.

The complete story goes something like this. “Resiliency” has become the new meme, replacing “recoverability.” On its face, there’s nothing wrong with this assertion: I’ve always wanted the capability to recover a mission-critical application to be built-in rather than bolted on. But the rhetoric in this case is “code” for data mirroring (between two storage crates within a fabric or Local Area Network [LAN]) or replication (between two crates across a Wide Area Network [WAN]), strategies that sell a lot of proprietary storage gear since the multiple crates of disk drives bearing the same vendor’s logo need to be purchased to make hardware mirroring and replication work. Data must be replicated in a synchronous fashion between these crates, with minimum data deltas over distance (a nifty trick), for failover to become even close to possible.

Only, as Rebecca is fond of saying, mirrored failover is merely a “controlled disaster.” Failover may work OK, but data is usually lost in the process.

VSAM runs heavily in cache, and cached data is usually lost when an “unmanaged failover” (aka a real disaster) occurs. So, too, is the data in batch processes and applications, which exist in virtually every environment to one degree or another. Moreover, a failover process also requires a high degree of consistency between local and remote data and most strategies fail to take into account “data in flight”—that is, queued or already moving across a wire from rig to rig—most of which may be lost in the event of a catastrophic interruption.

In other words, hardware mirroring with data replication and failover sound good in principle and may even work well under test conditions, where variables are controlled, but it often doesn’t work in practice. We tend not to test application restart procedures. We think schedulers are magical and can get all processes back to normal—even if we have incorrect restart points and batch operations that stopped in mid-processing and have no way to come back up.

These limitations are spelled out fairly well by Big Blue. And, my fellow mainframers, it would be wise to read your Redbooks, instead of spending so much time perusing vendor marketing materials. Better yet, take a look at products that can provide more information about what’s actually going on in your environment before, during and after an interruption event.

Not long ago, an IT manager proudly proclaimed to me that his mainframe, supported by six or seven people, hadn’t experienced any downtime in several years. By contrast, his x86 server environment, consisting of hundreds of boxes supported by hundreds of staff, was down at least once a day. Do we really want to race to the bottom? Failing to do a good job of data stewardship will land us there for sure, via one or more data disasters.