Operations

Are three data centers better than two for Disaster Recovery (DR)? Using technologies such as Geographically Dispersed Parallel Sysplex (GDPS), which facilitates near continuous availability, more organizations are aiming for uninterrupted uptime. The rule of thumb used to be simple failover in a single data center with a hotsite or coldsite backup with more relaxed Recovery Time Objectives (RTOs). Now more enterprises are looking at two separate data centers, and in some cases, a three data center DR and failover model that minimizes or eliminates RTO lag time altogether. Let’s consider these newer DR data center models and review the critical decision points for sites to consider as they chart out the role of data centers in their DR strategies.

(Click here to view Sidebar: Financial Services Enterprise Approaches to Data Centers and DR)

Setting the Stage for Multiple Data Centers

First, let’s look at GDPS/Peer-to-Peer Remote Copy (PPRC). This is the continuous availability solution that most sites with zEnterprise mainframes use. Central to GDPS is HyperSwap software technology that swaps storage and large numbers of devices quickly so there’s minimal impact to application availability.

When sites first began setting up sysplexes encompassing two different data centers for DR and failover, they were limited as to how much distance could separate the data centers when using fiber optic cable. In the early GDPS/PRRC days of the late ’90s for instance, the Enterprise Systems Connection (ESCON) limit was 20 kilometers (km). Two years later, that distance expanded to 40 km and then to 100 km. Today, thanks to technological advances and changes to communications protocols, organizations that employ multiple sites in a metropolitan area have up to 200 km they can maintain between data centers for DR and failover.

This expansion potential is significant, as some industries mandate minimal distances that must be maintained between data centers for purposes of data and IT infrastructure protection. The expanded kilometer ranges now enable most organizations using multiple data centers for DR and failover within a single metropolitan region to comfortably attain regulatory compliance. If the enterprise is located in an area not likely to be subject to events generating regional outages (e.g., hurricanes, earthquakes), a two data center DR and failover configuration in a single metropolitan area, or in some cases, a dual system DR solution contained within a single physical data center, can suffice.

According to IBM, most of its enterprise clients that have adopted a three-site strategy also choose to configure their DR strategies this way. Forty percent of enterprises locate two of their three data centers in the same metropolitan region but in two separate data centers that are within a 40-km radius of each other, and 60 percent actually establish two discrete logical data centers within the same building, but on different floors; or on the same floor, with a firewall separating the data centers.

This enables them to deploy GDPS/PRRC, which uses synchronous disk replication that avoids data loss in the two data centers immediately within their geographical region. At the same time, more enterprises are starting to expand the distances between their primary data centers and their third data center to beyond 100 km. Typically, synchronous disk replication can’t be used at this distance due to the signal latency impact, but asynchronous data replication gives the enterprise a DR and failover alternative that’s outside its immediate service area, and that can be activated in the event of a regionwide disaster. This works very well for companies serving different global geographies, or for those whose geographical regions are subject to weather, power grid or other disruptions that could potentially bring down all their computing in a broad area.

Companies electing to move to three or four data center DR and failover models tend to be heavily concentrated in the financial services sector, where continuous availability of service is imperative. These same companies are confronted by pressures and conditions that can be addressed by having that third and even fourth data center located far away from the first two data center locations.

The primary drivers for more than two data centers are:

• A regulatory concern that the business will be able to “stay running” in the face of a widespread regional disaster. For instance, when the World Trade Center buildings in New York were hit by terrorists in the 9-11 attack, a New York bank that maintained two data centers in lower Manhattan couldn’t access its data centers. As part of its wide area DR and failover plan, the bank had a third data repository in New Jersey that it was able to recover from, although recovery took five days.
• A desire to avoid data loss if one of the two data centers in the enterprise’s primary region goes down. The first and second data centers within a specific geographical area can use disk replication to keep data at both sites in sync. This disk replication gets disrupted when one of the two sites fails. By maintaining a third data center that isn’t part of the same synchronous data replication process and is being updated in asynchronous, periodic data updates, companies have the assurance of another DR and failover mechanism if the primary or secondary site goes offline.
• A need to cover significant distances within a single geographical footprint. If two corporate locations are far apart in distance (even though they’re technically in the same region), IT might choose to implement a logically “dual” DR data center strategy within a single physical site, but also establish a third data center at a different (and distant) location within the region that uses asynchronous update mechanisms.
• A desire for data resiliency. Especially if an enterprise has a global reach, having a third data center in an entirely different geographical region from the primary and secondary data centers is a precautionary DR step. This strategy also allows IT to continue to use data replication between the two surviving sites, because if they only have two sites and one is lost, they’re forced back to a tape-based data recovery with an RTO/Recovery Point Objective (RPO) of 48 to 72 hours.

2 Pages