Aug 1 ’05

Ensuring Recoverability in Mainframe Environments

by Editor in z/Journal

Explosive information growth has many corporations struggling to protect their information assets. Data loss, never an attractive proposition, has become more risky than ever, especially due to new corporate and regulatory mandates for information retention. The threat of terrorism, regional blackouts, and natural disasters has many enterprises reviewing and retooling their existing plans and procedures to ensure their data is adequately protected and recoverable.
 

There are other disasters to be concerned with besides major disasters that might result in a need to recover data and applications. The most common are failures resulting from programmatic or human error, which often result in data corruption or loss.

Protecting information assets isn’t simple. It requires resources and vigilance to ensure data is always recoverable, as well as in a consistent, coherent state. The challenge is how to balance recoverability against non-stop computing needs where data is constantly being generated and modified.


Recoverability vs. Continuous Availability

Organizations implement disaster recovery plans to help them strike this balance. Such plans take into account how critical each application is to business operations, and define recovery time and recovery point objectives for them.

The Recovery Time Objective (RTO) is the maximum allowable time an application may be offline. The Recovery Point Objective (RPO) is a measure of how much data can be lost. For example, if the last backup is 12 hours before the failure, the organization could potentially lose 12 hours of data. If the organization can’t afford to lose that much data, a shorter recovery point objective is required. Once these objectives are defined, the organization applies various data protection and recovery strategies to meet them.

Tape-based backup methods form the foundation of most data protection strategies. However, tape is slow relative to other technologies and recovery speed is becoming a critical element in many organizations’ recovery plans. So tape is now often the “last line of defense” in data protection rather than the primary strategy. In addition, tape has become the medium of choice for long-term data retention and archiving.

Disk-based backup and Point-in-Time (PIT) copies or snapshots provide faster recovery than standard tape backup. PIT copies provide the added advantage of improving the recovery point—meaning the risk of data loss is reduced as the amount of data that can be lost is limited to the interval between PIT copies. Asynchronous replication further reduces the risk of data loss and RPO to only a few transactions or data write.

For the best in terms of instant physical data recovery and the lowest RPO, data center managers turn to synchronous disk mirroring—particularly for local protection and recovery, as application performance can be adversely affected if the secondary copy is located at a distance from the primary. So synchronous mirroring and asynchronous replication are often used in combination for the most critical applications—mirroring for local protection and replication for distance protection. For example, an organization might implement synchronous mirroring within a data center to protect against local failures. To protect against a sitewide failure—such as a power failure or natural disaster—the data would be replicated to a second data center or Disaster Recovery (DR) site. Further protection could be provided with PIT copies and/or tape backups of either the local or remote copy of data. Clearly, such a multi-tiered, multi-site mirrored environment would represent a significant investment in terms of hardware, software, and management.

Benefits of Regular Testing

All these data protection strategies— from tape backup to synchronous mirroring—are physical recovery methods. However, it’s vital to understand that logical data recovery is different, yet equally important. Logical data recovery focuses on recovering the business processes and business applications (i.e., recovering the data to a consistent, known state to properly restart applications).

Generally, the faster the recovery speed and more granular the recovery point, the higher the cost of implementation and ongoing management, and the greater the complexity of the storage and network environment. This added cost and complexity increases the importance of ensuring DR plans are regularly tested to ensure data is recoverable. You don’t want to make a substantial investment in hardware and software only to find your data isn’t recoverable when you need it.

Another reason to regularly test DR plans and procedures is that even the highest level of synchronous mirroring protection may not provide protection against data corruption or loss resulting from logical failures or physical events. For example, if an application failure causes data corruption, then the corrupt data will be copied to the mirror until the failure is discovered. Further, if mirrored data has been deleted, it would be deleted on both mirrors. To recover, locating a copy of data taken before the failure or deletion is required.

Another problem supporting regular testing of DR plans could be a failure of the link between the mirrors. Since the application needs to get write verification from both the primary and secondary copies before proceeding, a link failure could result in a hung application. If a link failure occurs in asynchronous replication, the secondary copy might lag significantly behind the primary, as the application need not wait for write verification from the secondary before proceeding. Swiftly diagnosing and fixing link failures can help prevent application slowdowns and minimize the risk of data loss.

A further consideration in DR planning and testing is to ensure personnel at remote sites have the information and tools they need to rapidly restore operations if the primary site fails—especially in the event of a natural disaster or power failure that might disrupt communications or prevent personnel from traveling to the remote site.

All these factors taken together demonstrate the need to take a proactive stance toward recovery planning and to implement appropriate policies and procedures—supported by technology—to ensure recovery.

Software Technologies for Mainframe Recovery

The complexity of the modern data center environment is evident in:

-        Multiple applications with varying recovery objectives, sometimes sharing critical or non-critical data sets

-        Tiered storage and recovery architectures, including backup, replication and mirroring

-        A wide variety of storage devices from several different vendors.

Such complexity is the primary reason to deploy software technology to provide recovery assurance. What should an organization look for in recovery assurance software technology that helps ensure recoverability? Generally, managing and assuring recovery requires:

-        System recoverability: No data center is recoverable unless the operating system is intact. The recovery assurance software you choose should track and ensure all system-related data sets are available at the remote site, whether they’re available on backup tape, or on disk as backup files, PIT copies or as a result of replication or mirroring.

-        Application recoverability: Once the system is available, the business-critical applications are the next priority for recovery. Even in a mirrored environment, a missing file can cause a delay in recovering and restarting these applications, so the technology you select should eliminate these delays by ensuring all critical data is available and intact. In addition, the solution you choose should be able to find critical data regardless of where it’s physically located.

-        Prioritization of application recovery: In a major outage or disaster, not all applications need to be restored simultaneously. Look for a solution that enables a customized, phased restore process and prevents the accidental overlay of restored data sets, especially if you have multiple applications with varying recovery objectives.

-        Automation of critical data set identification: Identifying critical, non-critical and allocate-only application data sets ensures your data is always recoverable. Automating this identification process, regardless of the media it’s on, streamlines recovery management. Ongoing monitoring of application changes is necessary to ensure critical data isn’t inadvertently left out of the recovery scenario. You should also look for a solution with reporting capabilities to present the analysis in a historical or daily mode and illustrate application interdependencies.

Even minor physical or logical failures may result in suspended or halted batch processing. So organizations that leverage batch processing should incorporate swift resumption of batch operations into their recovery strategy. A key element in restoring or rerunning a batch cycle is using an appropriate synchronization point. To simplify restoration of batch operations, select recovery software that helps you identify the appropriate synch point and restart or rerun the batch as appropriate. Ideally, you should be able to isolate specific applications so more critical batches can run first.

It can be difficult to determine the resources necessary to recover in a given scenario, and it may not be practical to apply the same level of resource investment to each scenario. To help you size various recovery scenarios and reserve only the resources you need, simulation or modeling is helpful. Some recovery assurance solutions include the ability to model recovery scenarios tailored to the number of applications or jobs to be recovered; this streamlines planning, testing and resource allocation efforts, saving time and money. This type of modeling can also help identify unnecessary or redundant backups, enabling further cost savings.

To further simplify management, look for a solution that provides a central console from which to view the status of applications, backups, replicas, and mirrors. This capability, combined with reports, can substantially improve the ability to demonstrate recoverability to non-IT personnel—such as business management, auditors, and other interested parties.

Summary

A variety of backup and recovery technologies are available to protect enterprise data. However, these don’t necessarily provide complete protection from logical failures or external physical failures, particularly with respect to restoring the application to a production state.

Recovery assurance software can help guarantee that application data is correctly and consistently backed up, making a logical recovery possible. Software technologies specifically designed to fulfill this will:

-        Identify all critical files with interrelationships between applications

-        Identify synchronization points to ensure a logical recovery point for applications

-        Ensure files are backed up or mirrored and available offsite during a recovery

-        Constantly monitor all applications for daily operation or any new changes

-        Document all applications in detail, noting important modifications

-        Provide automated, staged, synchronized recovery in an emergency

-        Allow any anticipated application migration from or to other operations

-        Maintain an updated, active DR plan, integrating all IT groups.

In the future, as encryption becomes a standard practice in data protection and recovery, data center managers should look to add encryption key location and management to their overall recovery strategy. This will ensure that data can be unencrypted as necessary during the recovery process.