Operations

Mar 13 ’13

Two vs. Three Data Centers

Having a third or even fourth data center that operates outside an enterprise’s immediate geographical area is extra insurance for regionwide disasters, but there are also several trade-offs in DR performance—and in the costs of operating additional data center facilities.

Two data center DR within a specific geographic or metropolitan region where data centers are relatively proximate comes with lower facility and staffing costs than a strategy that uses more than two data centers. It’s possible to replicate data between the two sites synchronously. This entirely avoids data loss because the replication occurs simultaneously to both sites and not in the asynchronous, periodic update mode of distant, third data center updates. In the intra-region, two data center model, the server writes to disk and the data on the primary disk subsystem is hardened; the subsystem sends the data to the secondary system disk subsystem, which replies to the primary subsystem; and the primary subsystem replies to the system server. Because there is continuous access to data with synchronous disk replication, both the RTO and RPO are zero, even if a disk system fails.

In cases where the two data centers are far away from each other, a data latency factor of approximately 1 millisecond for each 100 km of distance between sites (directly related to the speed of light as it passes through fiber) enters in. The geographical distance gives the enterprise protection against any event that impacts a specific data center, but the latency induced by distance slightly diminishes performance, which is why many enterprises have adopted an active-active configuration (i.e., where systems in two of their data centers are functioning actively and in parallel with each other) and typically have these two data centers separated by 20 km or less. This sharply contrasts with an RTO of roughly one hour for data centers that function in active-standby mode (when one system is active and the other is in a standby “wait” mode and is activated when DR/failover is needed). RPO remains unchanged in an active-standby configuration.

In a three data center configuration, immediate issues are distance, latency and the fact the asynchronous data updates common in a distance environment will introduce data loss. To circumvent this in a metropolitan region where distances aren’t great, GDPS/Metro Region z/OS Global Mirror (MzGM) or GDPS Mirror/Global Mirror (MGM) can be used. Both deliver an RTO of minutes when sites are run in active-active mode and an RTO of less than one hour in active-standby mode. In all cases, RPO is zero.

However, because large spans of distance require asynchronous data replication, this entails some data loss. The good news for global enterprises is that IBM reports testing was successful for asynchronous update processing between two different sites at distances that were as far as 12,000 km apart, and commercially at data centers that were between 4,000 to 5,000 km apart. In other words, while some data loss must be managed, at least there are no distance constraints on asynchronous change updates. IBM says it actually has one global client that has a distance of nearly 9,000 km between its in-territory and outside data centers.

What About the Data Loss?

This is where sites must weigh the cost of losing business against the level of IT investment they want to make in DR and failover. If you’re a manufacturer and can afford to operate for up to several days in a manual mode without a system, instantaneous data recovery may be less of an issue. But if you’re an active online business with more than 2,000 transactions coming through your system every second, you will think about what the cost of losing 4,000 transactions is if your DR data loss exposure is 2 seconds—and if it makes sense to invest so you can cut that data exposure window by half.

Key Decision Points

It isn’t always an easy decision to determine a long-term data center strategy where it concerns DR. Conventional practice has always centered around a two data center backup and recovery strategy. Only recently have enterprises begun to bring up multiple DR sites in different geographies so they can assure non-stop processing in a global economy.

These are the questions most sites want to ask:

• How do you best manage your risks? If your commitment to your stakeholders is for continuous uptime, the most popular management data center strategy is one that can be kept within the metropolitan region in which your enterprise operates. Especially if you can maintain reasonable distances between data centers, the technology is there for you to operate in active-active mode so that systems failover is seamless and no one but IT knows the difference. But if you’re a global enterprise and you determine the risks are too great to keep all data centers in a single metropolitan region, having a backup data center in a remote area from your headquarters can make good business sense. Many enterprises try to get the best of both worlds. They maintain zero RTO/RPO by keeping redundant data centers within a more proximate metropolitan region, and use a “Plan C” third (or even fourth) data center in a remote geographical area that’s updated asynchronously.
• How long can you stay offline? A toothpaste manufacturer might be able to accept up to one week of downtime in its manufacturing facility, but a financial services company can’t. Best-of-class enterprises invest to assure their DR and failover solutions can meet the needs of customers, investors, auditors, regulators, managers and the board.
• Do you toggle? This is an emerging trend. Enterprises are toggling their production between two or more sites on a quarterly and semi-annual basis. A planned strategy for regular migration of production ensures that moving production in an unplanned situation will work. 

Summary

There’s no question that more enterprises will strongly consider multiple (i.e., three or more) data center options as they continue to scale out IT to support global enterprise presence. Zero RPO and RTO times with systems running in parallel for seamless failovers will be the order of the day. However, if the region your parallel data centers are in gets hit with a major disaster, it will be equally reassuring to have a data center in a distant location that can be “up and running”—even if it can only be run with asynchronous updates that stretch out RTO and RPO.

Sites will also begin to take a look at the new automation built into DR and failover. Today, this automation notifies IT of impending failover events, and it also recommends the next set of actions to take. This automation is capable of completely failing over a system based on business rule sets and parameters without IT intervention. In the future, IT might take advantage of this opportunity for true “lights out” disaster recovery. But for now, there’s too much at stake for high-level business and IT managers to circumvent “pressing the button” themselves.

 

2 Pages