Operations

In the coming days, I’ll be heading to the SHARE conference that’s taking place in Pittsburgh to give a talk on disaster recovery (DR) and agile data centers. Basically, I intend the talk to be a gentle nudge to the newbies, encouraging them to stop believing the nonsense they’re reading about cloud, software-defined data center, Infrastructure as a Service (IaaS), etc. and their supposed high-availability architectures that “negate” the need for traditional continuity planning. I’ve been seeing this meme about “the death of DR” pop up in a lot of trade press articles and analyst reports of late and it’s beginning to grind my gears.

Look, agility—the supposed objective of software-defined, IaaS and the rest—is a great goal, one toward which IT should always strive. The ability to turn on a dime in response to changing business requirements is a noble ideal. I mean, who doesn’t want to build a data center service capable of delivering better support than customers themselves can even envision? While we’re at it, who doesn’t want to support user mobility, providing secure access to data and applications to Joe or Josephine in sales where they’re meeting with their customer prospects? And, of course, integrity, security, resiliency and availability should be more than words on motivational posters; they should be our mission every day we come to work in our IT shops.

So, I’m down with the whole agility thing. The question is why we insist on conflating it with an architectural model that, while portrayed as new and evolutionary, is actually quite de-evolutionary. For example, we’ve always had the ability to replicate data across a wire between two or more identical (or more-or-less similar) storage devices. Over a short distance, this is called mirroring; over a longer distance, replication. And we can optimize the replication process so we don’t need to wait for a lot of data to amass at point A before copying it to point B. This process is key to supporting high-availability clustering, especially the active-active variety where two production sites share workload until one site or another fails and all workload continues to be processed by the extant site.

But note the subtle and all-important dependency here. For high availability to work, you need to have sound data replication going on continuously. That’s before you noodle out the clustering of systems and the programming for detecting failures and failing over workload. We seem to think in the mainframe space that using products such as IBM’s TS7700 Virtualization Engine cluster in a GRID configuration is all we need to do to guarantee continuity of operations. Or, we say we’ll reconstruct jobs using the scheduler following any interruption. See? No more need for that expensive DR planning stuff we hope we’ll never need to use.

The problem is, however, that folks don’t read the fine print. IBM Redbooks on TS7700 state pretty clearly that Rewind Unload or RUN consistency points don’t provide access to the data of incomplete jobs, those in the process of completing when the interruption occurs. We’re also warned about vulnerabilities created by deferred copy consistency points (waiting to be written remotely) or when the product is run with no consistency points at all. Same thing goes for those relying on the scheduler to automatically recover an environment. truth is, the scheduler is often unaware of data copies, which are used for restore, and without explicit operator intervention, the result will be incorrect restart points and failed application restart.

These are just examples of a broader point. agility depends on availability first and foremost. Without protecting systems and data, your quest for greater agility is a precarious one. availability doesn’t equal two copies. those two copies need to be roughly synchronous and constantly refreshed. you need to know that you’re copying the right data (output data isn’t enough; you also need configuration data, copies of application software and middleware, etc.) and that you’re copying the data to and from the right locations (storage admins may move data around without telling the dr folks, resulting in lots of replicated “blank space” if you don’t check). and you also need to test things so you can gauge the real-world amount of time it’s going to take to restart recovered apps at Site B. these basics don’t go away just because you’re playing around in the software-defined or cloud sandbox.

While we’re on the subject, there’s one more nit to pick: using resiliency synonymously with availability or recoverability. resiliency is a key concept in the agile movement, but resiliency isn’t a precise term. IBM researchers have tried to hang some metrics on resiliency to measure it better. the document titled “Quantifying resiliency of IaaS Cloud” (available at http://mdslab.unime.it/documents/IBM_duke_Cloud_resiliency.pdf) makes it clear that resiliency is a measure of the responsiveness of the software-defined data center to changes in demand and capacity. So, resiliency seems to be more about the process for allocating resources to changing workload in an efficiency manner than it is about the availability of the infrastructure or the data it processes.

I only make this point because I’ve been witnessing a lot of otherwise knowledgeable folks struggling to reconcile the high availability story of software-defined and agile with the need for efficient resource management—two very different challenges.

At the end of the day, the “agile data center” may sound new and visionary, but it’s what we’ve always (or should have always) been seeking. We want to deliver services exactly when they’re needed, and with the greatest economy and resiliency we can muster. availability is also key, but you will still need the safety net of dr planning because things can and will go wrong.