Jun 12 ’12

Raising the Bar on System Uptime

by Mary E. Shacklett in Enterprise Tech Journal

Global demands for continuous system uptime and availability are raising the bar for performance in corporate data centers. It’s also causing organizations to revisit their plans for system backup and Disaster Recovery (DR). Essentially, organizations are looking to improve three things:

• Uptime and availability
• Data backup and recoverability
• Anticipation of emerging problems so as to preempt the problems.

As enterprises go about this task, some profound changes are occurring in how they think about DR and backup. These changes demand new strategic thinking about system uptime and availability and how resources and facilities can be deployed and exploited to increase system uptime.

DR and Backup: Onto the Front Burner

Organizations that operate 24x7 vary widely in their approaches to DR, depending on their industry sector. Financial services companies are the most aggressive. This is partly due to rigorous regulatory standards, but financial services companies also operate in an environment where you can palpably hear the downtime ticking in seconds and dollars. In contrast, a parts manufacturer also feels the impact of downtime, but perhaps not in seconds. A retailer is sensitive to downtime, but usually has a distributed network of small servers in each of its retail outlets so it’s possible to store transactions locally while central computing is offline, and then forward those transactions when central computing is restored. In short, the Recovery Time Objective (RTO) and Recovery Point Objective (RPO) standards for different industries vary, but everyone is interested in recovering from disasters, ensuring recoverability of data, and being able to perceive and resolve a problem before it becomes an issue.

Virtually every organization also understands that being “open” for business non-stop is now the minimal expectation in a global marketplace. This means system uptime and availability must be next to infallible; it changes the dynamics of DR and backup from a “side” or “background” project to one of the top priorities on CIOs’ to-do lists.

The push for this change in perspective is coming from the business, which now considers any kind of IT downtime as an impact to revenue capture and customer satisfaction and retention. While every organization approaches DR and backup based on its own unique situation, virtually every organization wants system redundancy (and continued uptime) for planned actions such as system maintenance, and for seamless and automated failover for an unexpected downtime situation or disaster. Enterprises also want a minimal number of system components in the mission-critical path of business processing; they want to avoid single points of failure in their IT infrastructures.

Data Center Changes

The baseline practice has always been to go with a single data center and then contract for hot site or cold site services in the event of an outage. However, a growing number of enterprises now have two data centers. As continuous system availability and uptime have grown in importance, the trend in enterprises has also been to fully equip each data center so it can run all of production, and to insert technology that supports easy, transparent switchover of processing and other IT resources from one data center to the other when necessary. The second data center can operate in an active-standby mode, where it’s on constant “standby,” and is activated to assume the full production load during an outage; or it can operate in an active-active mode, where the second data center stays in synch with the primary data center and both data centers are processing in parallel so there’s literally no lapse for a failover or DR.

The two-data center concept operates especially effectively when both data centers are located in the same metropolitan area. Such proximity allows the use of communications topologies that can adroitly failover from one site to the other. The location of two data centers that are proximate to each other can also take advantage of a central pool of IT talent that can operate out of either data center. Moreover, two data centers are insurance for an enterprise that it’s going to keep running in the event of an outage.

With global activities increasingly ubiquitous, new data center thinking, which calls for a third data center that the enterprise operates in a remote geography, is rapidly gaining traction. The risk enterprises want to address involves what could happen if a sizable disaster brought down an entire geographical region—including all the data centers in that region. Some of this drive for the third data center is being fueled by industry regulators. In the financial services industry, for example, there’s growing regulatory pressure to at least keep data at locations that are significantly geographically removed from major data center sites. The most compelling pressure, however, appears to be the non-stop service expectations of a now global community of customers. They expect service even if your main operation is knocked down by a regional disaster. In a situation such as that, the third data center in a distant geographical locale keeps an enterprise in business.

There’s also a trend toward fully equipping each data center to run the entire business. This level of redundancy requires enterprises to move away from cold site or “standby” thinking and into dynamic processing environments where the full production load can be toggled between data centers on demand. In this multi-production data center model, data center A might run all enterprise production for the first quarter of the year, with production moving over to data center B for the next quarter. By toggling production back and forth between data centers, an enterprise has the peace of mind of total system resiliency and redundancy. It simultaneously ensures its DR, backup, and failover plan works continuously.

Data Wellness

DR and backup require safeguarding data and ensuring it’s always available. Storage-based replication technology provides data replication, and for IBM zEnterprise, there’s continuous, non-disruptive data protection for all data. DB2 and IMS System Level Backup/Restore Utilities now exploit the storage-based data replication technology called FlashCopy. In DB2, for instance, a non-disruptive backup of the database can be taken, exploiting FlashCopy. The utility maintains consistency between database tables and logs. For data restoration, a site can bring back all DB2 data and tables to any point in time. IT can take many timely database backups, and the speed of the process lets sites do backups after each shift or every night instead of, say, once per week. Technology is even available to perform backups every two to four hours for rapid recovery if data is corrupted.

Functions such as IBM FlashCopy let a site make point-in-time, full-volume copies of data, with the copies immediately available for read or write access. The copy can be used with standard backup tools to create backup copies on tape; you can also copy data to storage in an alternate data center for data backups. FlashCopy, which comes with storage replication, supports quick copy of volumes of data sets and has helped many sites dramatically reduce their backup windows. This technology complements real-time, storage-based data replication which, of course, isn’t a substitute for logical backups of data. Non-disruptive data backup is an area where many sites can further improve. Backups are still essential for recovery when data is corrupted and must be recovered.

A second option that preserves the wellness of data and also keeps it accessible is HyperSwap, a z/OS feature that improves availability to data. HyperSwap can keep data active in the event of a storage subsystem outage because the IBM Geographically Dispersed Parallel Sysplex (GDPS) controlling system and the production subsystems are constantly communicating. In an instance when a disk subsystem begins to fail, z/OS detects various hardware or software (i.e., GDPS or Tivoli Productivity Center/Replication [TPC/R]) to perform a disk “swap.” Swapped systems continue to run without interruption on the disk subsystem after the swap, fully masking production from the disk subsystem failure. HyperSwap is also a z/OS function that IBM believes is currently underexploited.

Precluding Problems

Critical for IT is a set of capable monitoring tools that can inform technicians of emerging hardware and software problems before these problems become major issues that can cause a failover. One example is an early warning system that can inform a technician when a server is running out of buffers, which could be symptomatic of a batch job looping and not releasing resources. The situation is potentially serious because the job holding onto a resource creates a lock on that resource, and other jobs in the queue for that resource must wait. In this case, effective monitoring and alerts would give a system operator notice of the problem in its early stages, allowing the operator to take preemptive action.

GDPS extends monitoring across multiple sites and is designed to detect a site failure, raise alerts and then, on command, automatically perform a site failover of a Parallel Sysplex. This provides automated recovery due to the failure of any resource associated with a Parallel Sysplex. In the area of monitoring, an enhanced z/OS “health checker” is also available in GDPS version 3.9. The health checker monitors facets of production performance such as temporary storage utilization, how many concurrent users are using a given application, and what types of system parameters are set. The health checker will come back with performance reports and recommendations for parameter changes based on what it observes.

Industry Trends and Best Practices

For all the tools and practices available, there are still some fundamental areas of execution that sites need to tighten up; they can also better exploit certain system capabilities.

Eliminating single points of failure: The “best laid plans of mice and men” aren’t going to help you in various failure scenarios if you have single points of failure within your IT infrastructures. If your infrastructure is organized around single Logical Partitions (LPARs) and instances of applications, you still have a single point of failure. If you cluster these virtual instances on a single server with multiple instances of operating systems and applications, you still have a single point of failure. This is where GDPS technology across processors, storage and data centers pays off, since you can replicate both your data and your operations—quickly facilitating failover.

Taking advantage of GDPS: When GDPS first became available 14 years ago, many sites took advantage of IBM’s zero-pricing policy and quickly installed the product at a minimal level to capitalize on the offer. While this was a prudent cost move, many organizations have failed to move past the point of minimal installation, so they’re leaving on the table many of GDPS’s most beneficial features for addressing DR, backup, and failover situations.

Staying in touch with the communications side of DR: Organizations tend to focus on only the technology elements of DR and backup, but there’s also a critical communications element that should be part of any DR and business continuation plan. A DR plan typically includes a communications “tree” that signals who makes which decisions when and at what level of authority in any disaster—as well as who communicates the occurrence of an outage, progress on the project, and restoration of service to corporate employees, stakeholders, partners, and customers. “Getting the word out” in an appropriate manner is an important ingredient in any DR scenario. It reduces the risk of business loss, lets IT go about its business of restoring systems, and should always be integral to any DR, backup, and failover strategy.

Summary

Non-stop worldwide availability of systems gives enterprises plenty to think about. For some, it has meant investing in new data centers that are individually capable of running the entire business. For others, it has meant the reconfiguration of staff so the enterprise has true “follow the sun” IT expertise available and “on the ground” in multiple geographies. For all, it’s opening the door to tools and capabilities in the data center and in systems that have long been available, but never fully exploited. As DR and backup make their way toward the top of many CIOs’ lists, this is likely to change.