DR and backup require safeguarding data and ensuring it’s always available. Storage-based replication technology provides data replication, and for IBM zEnterprise, there’s continuous, non-disruptive data protection for all data. DB2 and IMS System Level Backup/Restore Utilities now exploit the storage-based data replication technology called FlashCopy. In DB2, for instance, a non-disruptive backup of the database can be taken, exploiting FlashCopy. The utility maintains consistency between database tables and logs. For data restoration, a site can bring back all DB2 data and tables to any point in time. IT can take many timely database backups, and the speed of the process lets sites do backups after each shift or every night instead of, say, once per week. Technology is even available to perform backups every two to four hours for rapid recovery if data is corrupted.
Functions such as IBM FlashCopy let a site make point-in-time, full-volume copies of data, with the copies immediately available for read or write access. The copy can be used with standard backup tools to create backup copies on tape; you can also copy data to storage in an alternate data center for data backups. FlashCopy, which comes with storage replication, supports quick copy of volumes of data sets and has helped many sites dramatically reduce their backup windows. This technology complements real-time, storage-based data replication which, of course, isn’t a substitute for logical backups of data. Non-disruptive data backup is an area where many sites can further improve. Backups are still essential for recovery when data is corrupted and must be recovered.
A second option that preserves the wellness of data and also keeps it accessible is HyperSwap, a z/OS feature that improves availability to data. HyperSwap can keep data active in the event of a storage subsystem outage because the IBM Geographically Dispersed Parallel Sysplex (GDPS) controlling system and the production subsystems are constantly communicating. In an instance when a disk subsystem begins to fail, z/OS detects various hardware or software (i.e., GDPS or Tivoli Productivity Center/Replication [TPC/R]) to perform a disk “swap.” Swapped systems continue to run without interruption on the disk subsystem after the swap, fully masking production from the disk subsystem failure. HyperSwap is also a z/OS function that IBM believes is currently underexploited.
Critical for IT is a set of capable monitoring tools that can inform technicians of emerging hardware and software problems before these problems become major issues that can cause a failover. One example is an early warning system that can inform a technician when a server is running out of buffers, which could be symptomatic of a batch job looping and not releasing resources. The situation is potentially serious because the job holding onto a resource creates a lock on that resource, and other jobs in the queue for that resource must wait. In this case, effective monitoring and alerts would give a system operator notice of the problem in its early stages, allowing the operator to take preemptive action.
GDPS extends monitoring across multiple sites and is designed to detect a site failure, raise alerts and then, on command, automatically perform a site failover of a Parallel Sysplex. This provides automated recovery due to the failure of any resource associated with a Parallel Sysplex. In the area of monitoring, an enhanced z/OS “health checker” is also available in GDPS version 3.9. The health checker monitors facets of production performance such as temporary storage utilization, how many concurrent users are using a given application, and what types of system parameters are set. The health checker will come back with performance reports and recommendations for parameter changes based on what it observes.
Industry Trends and Best Practices
For all the tools and practices available, there are still some fundamental areas of execution that sites need to tighten up; they can also better exploit certain system capabilities.
Eliminating single points of failure: The “best laid plans of mice and men” aren’t going to help you in various failure scenarios if you have single points of failure within your IT infrastructures. If your infrastructure is organized around single Logical Partitions (LPARs) and instances of applications, you still have a single point of failure. If you cluster these virtual instances on a single server with multiple instances of operating systems and applications, you still have a single point of failure. This is where GDPS technology across processors, storage and data centers pays off, since you can replicate both your data and your operations—quickly facilitating failover.
Taking advantage of GDPS: When GDPS first became available 14 years ago, many sites took advantage of IBM’s zero-pricing policy and quickly installed the product at a minimal level to capitalize on the offer. While this was a prudent cost move, many organizations have failed to move past the point of minimal installation, so they’re leaving on the table many of GDPS’s most beneficial features for addressing DR, backup, and failover situations.
Staying in touch with the communications side of DR: Organizations tend to focus on only the technology elements of DR and backup, but there’s also a critical communications element that should be part of any DR and business continuation plan. A DR plan typically includes a communications “tree” that signals who makes which decisions when and at what level of authority in any disaster—as well as who communicates the occurrence of an outage, progress on the project, and restoration of service to corporate employees, stakeholders, partners, and customers. “Getting the word out” in an appropriate manner is an important ingredient in any DR scenario. It reduces the risk of business loss, lets IT go about its business of restoring systems, and should always be integral to any DR, backup, and failover strategy.
Non-stop worldwide availability of systems gives enterprises plenty to think about. For some, it has meant investing in new data centers that are individually capable of running the entire business. For others, it has meant the reconfiguration of staff so the enterprise has true “follow the sun” IT expertise available and “on the ground” in multiple geographies. For all, it’s opening the door to tools and capabilities in the data center and in systems that have long been available, but never fully exploited. As DR and backup make their way toward the top of many CIOs’ lists, this is likely to change.