Operations

Recovery processes are like “the elephant in the room.” Like ignoring the elephant, it’s often just too difficult to openly discuss how unplanned outages have, or could in the future, wreak havoc on business if the right backup and recovery procedures aren’t established and followed.
   
An unplanned outage can damage customer satisfaction, result in lost business, and hurt a company’s reputation. Here are some examples:

• An outage recently caused the Automated Teller Machine (ATM) network of a major bank to shut down for hours, and customers couldn’t withdraw money.
• A Web services company had a power outage that caused its company’s Websites to go down, resulting in a significant loss of business for the site’s customers.
• Another large company had an outage so severe it wiped out important files. The mirrored files weren’t backed up, so the company couldn’t recover key information. As a result, the employees had to manually re-enter about 18,000 jobs into a scheduling package based on documentation, institutional memory, and whatever they could find from prior runs to rebuild the schedule.
   
Fortunately, if you address recovery head-on and develop intelligent backup strategies and recovery processes, you can reduce or eliminate many unplanned outages. You can also mitigate some of the risks associated with outages by leveraging automation.

Evaluate Planned and Unplanned Downtime

Before updating the recovery processes for your organization, it’s important to understand the impact of unplanned—as well as planned—downtime. Planned downtime includes various activities such as database maintenance (performing copies and reorgs) and schema changes. Planned downtime typically doesn’t result in a recovery event unless the reason for the downtime requires a back out (e.g., a structural change is abandoned). Recovery involves restoring the last image copy tape, then applying log data to it to get up to a valid recovery point.
   
In a well-run organization, the back-out event is planned but rarely executed. If you’re doing a system upgrade, you can prepare in case the system upgrade fails. Then you can perform a back out to where the upgrade started and mitigate the impact of the outage.
   
Unplanned downtime, by its nature, is a surprise and can be caused by hardware failures, software errors, user errors, or poor maintenance. Many failures occur after a hardware or software upgrade. The recovery process can depend on the cause of the outage. That’s why you need to perform detection and analysis to determine the cause and scope of the failure. If recovery is manual, the operator may not be familiar with the recovery options. A program failure could result in one record being impacted or multiple databases being corrupted.
   
How much downtime, whether planned or unplanned, is acceptable? Most IT organizations have a recovery Service Level Agreement (SLA) or Recovery Time Objective (RTO) measured in hours for unplanned outages—two to six hours is common, depending on the application and its impact on the business. How long your systems are down depends on what caused the outage and what’s required to fix it. During a major outage, one enterprise company had to keep support staff on the phone all night for two nights; certain applications were down the whole time. The company also discovered months later they had data integrity problems in their databases because of the way they performed the recovery.

Get Smart

Smart backup strategies can mean the difference between a slow or rapid recovery. They’re based on developing recovery SLAs. An RTO must be established; it can be different for certain databases or applications, depending on business needs.

Making copies consumes many CPU cycles. Some unplanned outages occur when copies are made offline. Many IT organizations still schedule copies based on how they were set up 20 years ago. A smart copy strategy would answer the question, “What is the recovery objective for this particular application or this particular database within this application?” Some databases are more important than others. If the RTO for a particular database is only 15 minutes, then look at the size of the object and the amount of log traffic that gets put on that object, and incorporate that information into your backup strategy. You may have to run backups several times a day. It’s important to reset your thinking on what a backup strategy requires. 

Consider Performing Incremental Image Copies

It’s also smart to do the minimal backup required while maintaining recovery capabilities. This entails performing incremental image copying, in which you copy only the blocks or pages that have changed instead of the entire database. This approach can increase recovery speed. However, an incremental image copy reads the database less efficiently than a full copy. An incremental approach requires analysis to decide whether you need to copy a page or a block. Generally, if more than 10 percent of your database is changing, it’s faster and less expensive to take a full copy.
   
You may also decide to run full copies weekly and make incremental copies at other times. If enough data has changed, you can always revert to making full copies more frequently. Some tools can perform the function automatically and dynamically adjust backups based on business needs at various times to meet seasonal demands.
   
You can make copies while the database is online, which eliminates outages. As long as log data is applied to the online copy to a consistent recovery point, data integrity is protected. Doing online copies may make it more attractive to copy multiple times a day.

Make Copies to Disk or Tape

Many companies keep at least the most current copy on disk. A copy can be made to tape for archival or Disaster Recovery (DR) without impacting the disk copy. This action can reduce recovery time significantly by eliminating tape mount time and allowing for more parallel recoveries.

Smart recovery processes include ensuring that plenty of log data is available on disk. For DB2, this means having large “active” logs. For IMS, it means having large Online Log Data Sets (OLDS). These files should be sized so that any recovery event can retrieve all the log data from disk that has been created since the most recent image copy.
   
Disk-based copies and logs enable faster recovery, as well as parallel recoveries. If you’re doing an application recovery based on a certain point in time, you’re probably recovering dozens of databases and hundreds of indexes, and they’re all recovering to the same time period. Being able to perform recovery in parallel can reduce the overall outage for the recovery. This capability ensures that plenty of log data is available on the disk. 

Following Best Practices

DR is a special-case recovery event. DR plans typically are tested with some frequency and processes are documented to reduce the recovery time. Local recoveries are typically caused by application program failures; however, local recovery is rarely practiced. Sometimes, an application program failure requires a recovery to a prior point in time. Smart recovery processes can identify all the application objects impacted by the application program failure and recover only those databases that require this information. If a database hasn’t been updated since the specified recovery point, there’s no reason to recover it. The database is both physically and logically sound.
   
Application databases can be predefined into groups, depending on the nature of the cause of the outage. For instance, a group may dynamically include all the databases on a particular volume or for a particular application. DR groups can be defined with mission-critical application consideration.
   
Application recovery can be practiced using production data without actually impacting production availability. This can reduce the “think time” required in an unplanned outage. Automation can reduce the time to detect, analyze, build, and execute the recovery process, ensuring all relevant databases are recovered to a consistent state.

You may have one SLA for local site recovery and another for DR. For example, a two-hour SLA for DR implies you’re doing remote replication, which is expensive. There are limitations associated with mirroring, too, depending on the location. If you require no data loss, then your mirror can be only so far away from the main location. Going beyond 180 miles, for example, can result in performance degradation in your local environment because the local environment is waiting for data to be replicated.

Keeping at least 180 miles between locations may not be sufficient to protect your environment from major storms. So it’s important to determine SLAs for the different locations. A local recovery SLA of two hours is reasonable.  

How Can Technology Help?

Technology can help reduce or eliminate unplanned downtime outages. For example, you can leverage technology to offload copies to System z Integrated Information Processors (zIIPs) to reduce the CPU cost of backups and make the image copying process more efficient. The tool should let you exploit processor cache or intelligent storage devices.

You can improve recovery with technology in various other ways. Recoveries can be made more efficient by using tools that sort log data and merge it with copy input. This process can allow for back-out processing to a specified time. This period can be used to build coordinated recoveries between DB2, IMS, and VSAM applications, which accelerate the speed of recovery.
   
You can also make recoveries more efficient by identifying objects that have been changed since the specified recovery point and recover only those objects. Recovery can be eliminated if the application program failure impacts only certain records. If the updates were logged, the log records can be used to restore the data to its original state without requiring a recovery.
   
A flexible recovery process enables searching for or specifying a recovery point and ensuring data consistency after a recovery. The recovery solution could determine that only a subset of the database objects actually requires recovery, eliminating the downtime for unnecessary recoveries. Some applications have components running in several database management systems, such as IMS and DB2. A point-in-time recovery for one side of the application may require a point-in-time recovery for the other, too. A flexible recovery process allows for consistent coordinated recovery to any point in time for all the DB2, IMS, and VSAM components of an application.
   
A user may inadvertently update more data than intended. Only a subset of the data is affected, so a full recovery isn’t required or even desired. A flexible recovery process would allow for a back out of only the affected data, leaving the rest of the application available for update.  

Summary

Backups should be scheduled with the RTO in mind. The cost and impact of backups can be reduced or eliminated with the right tools. Innovative recovery techniques and automation reduce the impact of unplanned outages while ensuring data consistency after recovery.
   
DR is a special-case recovery. It’s more likely, however, that an unplanned outage will affect the production database but not result in a declaration of disaster. Local recovery, therefore, should be a top priority for IT organizations.

Automation and smart processes can help reduce the impact of unplanned outages. The technology used should help you recover only what you need. Conduct a business impact analysis to identify strategic applications and databases. Focus on those databases and applications, determine what the recovery SLA should be, then develop the backup strategy to support it.