Apr 1 ’05

Hardware and Software Solutions for Mainframe Database Recovery

by Editor in z/Journal

Imagine a haggard CIO being interviewed about a recent failure in his production applications. You might hear this lament:  

“Last year, I spent millions of dollars on a storage replication solution to protect my data in case of a disaster. Yesterday, my business lost millions of dollars due to some improper database system maintenance. The recovery action took hours to diagnose and prepare, and even longer to execute. We’ll probably lose several accounts permanently and will never be able to recoup the losses. The storage replication solution offered no protection from the outage. I wish I had some tools that could reduce or eliminate the downtime for ‘local’ outages as well as protect for disaster situations.”

Companies have invested millions of dollars in their mainframe database applications. These applications allow trains and planes to move, shipped packages to be tracked, financial transactions to execute, and manufacturing to proceed. If these applications are unavailable for any reason, the company isn’t receiving the expected gain on its investment. Further, some industries incur fines if certain transactions aren’t processed in a timely fashion. Recent legislation requires all publicly traded institutions to maintain a high-availability application plan, including recovery from total site disaster.  

On Aug. 14, 2003, the Northeastern U.S. experienced a massive power blackout. Independent surveys taken since then indicate that more than two-thirds of all respondents lost at least one full business day due to the blackout. The cost of the downtime ranged from $50,000 to more than $1 million per hour.  

The challenge facing a mainframe database application user is to maintain the recoverability of the database while not adversely impacting availability. There are several techniques to protect the database and ensure recovery to a consistent point. The techniques range from periodic dumps of storage onto transportable media, to synchronous I/O mirroring at a second site.

Most companies have at least a basic Disaster Recovery (DR) plan for mainframe database applications. In recent years, companies have begun to address the larger issue of business continuity, recognizing that ensuring application availability requires more than just a DR plan.  

This enlargement of scope presents challenges and opportunities for companies to consider. Many companies combine a variety of solutions to completely protect the strategic database asset. Recovery solutions may include both hardware and software technologies, which provide different protection for different exposures.  

The Hardware Solutions

Hardware solutions may be driven by disk storage technology or by host-based processor systems. Generally, the hardware replication solutions take two forms: point-in-time backups and remote replication.  

Point-in-time backups are based on creating a consistent local copy of data, and making the copy available at a remote site. This occurs via either dumping to tape and shipping the tapes to a vault, or electronically transmitting the backup data set(s) to a remote set of volumes. Depending on the vendor technology used, these backups may be full-volume or data set-level operations, and may create a complete copy of the data or merely generate additional directory pointers to data. Typical data loss in this scenario is 24 hours (although if additional log data is available, it may be applied to a point-in-time backup to lessen the data loss). An application outage is usually required to ensure data consistency for a point-in-time backup. Some examples of this technology include:  

-        EMC Symetrix with Timefinder (volume) or EMC Snap (data set)

-        IBM ESS with Flashcopy (V1 for volume, V2 for data set)

-        IBM PPRC (volume—within same enclosure; PPRC technology is supported by EMC, Hitachi Data Systems, and StorageTek)  

-        HDS ShadowImage (volume)

-        StorageTek Snapshot (pointer-based volume or data set).   

Remote replication is based on storage devices or processes that replicate updates to an alternate site as they occur (synchronous) or soon thereafter (asynchronous). Typical data loss is almost zero for synchronous solutions. Synchronous solutions are distance-sensitive. Typically, a synchronous mirror must be within 100km to eliminate local performance degradation. These solutions might be envisioned as a “campus” environment. Some examples of synchronous remote mirroring include:   

-        EMC Symetrix SRDF/S

-        IBM PPRC (also supported by EMC, HDS, and STK devices)

-        StorageTek PowerPPRC.

Asynchronous solutions exist to allow for an extended-distance, remote site location, but care must be taken to ensure the remote site data consistency. For that reason, some of the technologies involve an intermediate “hop” site or volume as part of the replication process, timestamp the I/O to the remote site, or use other techniques to ensure consistency. Some examples of asynchronous remote mirroring include:  

-        IBM XRC (host-based solution, supported by EMC, HDS, and STK)

-        EMC SRDF/A

-        HDS HARC.

There are some limitations to the backup and replication solutions. For instance, they can’t correct a data corruption due to a bad transaction or user error and they can grow quite expensive to implement and maintain. The user may have to double their storage footprint and some solutions result in several copies of production data in multiple sites. To create a consistent point-in-time backup usually requires an application outage. Many vendors are delivering techniques to ensure data consistency across storage devices. Several large database customers have successfully implemented some form of hardware replication in the name of DR preparedness.  

The replication solutions can allow for a simplified process at the time of a disaster; the user performs some operations at the remote site to render the backup or mirror available to a processor and restarts the database applications. Processing can resume, usually within a few hours of   arrival at the remote site. These solutions are attractive to a customer willing to spend the extra money for replicating data. The user can control data loss and reduce recovery time. The hardware solutions are effective tools for restoring to a point-in-time backup or restarting from a mirror in the event of a site-wide disaster.

Software Solutions

What can cause a database outage? Some outage events are planned:

-        Application database maintenance

-        Data migration

-        Schema change implementation

-        Hardware upgrades (processor, storage)

-        Operating system or DBMS maintenance

-        Disaster recovery preparation.  

Other events are unplanned (and may happen at inopportune times):

-        Site disasters (floods, power outages, storms, fire, etc.)

-        Hardware failures (disk, CPU, network, etc.)  

-        Operating system failures

-        DBMS failures

-        Operation errors

-        Batch cycle errors

-        Improper data feeds  

-        User errors  

-        Deliberate data corruption

-        Application software errors

-        Fallback from application change migrations.  

Some of these outage possibilities aren’t protected in a hardware replication environment. For instance, if a user inadvertently corrupts some data or applies bad system maintenance, a remote replication environment would also be impacted. A restore from a point-in-time backup might entail too much data loss. Remote replication is a highly effective solution for a site-wide disaster, but those are rare events. It’s more likely that a user will experience damage in a database that requires additional functionality than is provided by a hardware backup or replication process.   

To completely protect the database environment, a customer needs a variety of tools to support database copy, log processing, and recovery management activities. These tools can reduce or eliminate the downtime for both planned and unplanned events. Using the right recovery software tools, the customer can:

-        Produce a consistent database image copy with minimal outage

-        Extract the effects of a bad update transaction with no outage

-        Recover a complex application with innovative log processing

-        Manage the log environment and prepare for optimum recovery, including dropped DB2 objects  

-        Define application recovery groups and simulate or estimate recovery - Leverage any investment in intelligent storage for backup and recovery

-        Prepare for automated database recovery, with coordination between DBMS applications (to any point-in-time for both local and disaster recovery events)  

-        Leverage the recovery tools in daily operations (e.g., data migration, log reporting, and reduced resource consumption).

The proper database recovery tool kit will protect the application database for both local and disaster recovery, allowing for recoverability while ensuring the highest level of availability. Using hardware and software recovery tools isn’t mutually exclusive; many companies employ both technologies.  

For instance, some software backup tools can exploit the investment in hardware replication technology by automating the process or providing reports. Some   companies replicate a few applications and use normal software recovery tools for the rest. The acquisition of such tools can be cost-justified based on their impact on reducing or eliminating downtime and their ability to improve resource consumption for daily operations.  

In today’s high-availability, complex e-business world, it’s imperative to protect the corporate data asset. If the database is unavailable, the millions of dollars invested in IT aren’t returning a benefit to the business, and IT becomes a cost burden. Software-based recovery tools aren’t a luxury or an insurance policy; they’re a key component of strategic application database availability and support.