IT Management

Business Continuity (BC) is an integrated, enterprisewide process that includes all activities—both internal and external to IT—that a business must perform to mitigate the impact of planned and unplanned downtime. This entails preparing for, responding to and recovering from a system outage that adversely impacts business operations. The goal of BC is to ensure the availability of information required to conduct essential business operations.

Information Availability (IA) refers to the ability of an IT infrastructure to function according to business expectations during its specified period of operation. When discussing IA, we need to make certain:

• Information is accessible at the right place to the right user (accessibility)
• Information is reliable and correct in all aspects (reliability)
• The information defines the exact moment during which information must be accessible (timeliness).

Various planned and unplanned incidents result in information unavailability. Planned outages may include installations, maintenance of hardware, software upgrades/patches, restores and facility upgrade operations. Unplanned outages include human error-induced failures, database corruption and failure of components. Other incidents that may cause information unavailability are natural and/or man-made disasters such as floods, hurricanes, fires, earthquakes and terrorist incidents. The majority of outages are planned; historically, statistics show the cause of information unavailability due to unforeseen disasters is less than 1 percent.

Information unavailability (downtime) results in loss of productivity and revenue, poor financial performance and damage to a business’s reputation. The Business Impact (BI) of downtime is the sum of all losses sustained as a result of a given disruption. One common metric used to measure BI is the average cost of downtime per hour. This is often used as a key estimate in determining the appropriate BC solution for an enterprise. Figure 1 shows the average cost of downtime per hour for several key industries.

How Do We Measure IA?

IA relies on the availability of both physical and virtual components of a data center; failure of these components may disrupt IA. A failure is defined as the termination of a component’s capability to perform a required function. The component’s capability may be restored by performing some sort of manual, corrective action; for example, a reboot, repair or replacement of the failed component(s). By repair, we mean that a component is restored to a condition that enables it to perform its required function(s). Part of the BC planning process should include a proactive risk analysis that considers the component failure rate and average repair time:  

• Mean Time Between Failure (MTBF) is the average time available for a system or component to perform its normal operations between failures. It’s a measure of how reliable a hardware product, system or component is. For most components, the measure is typically in thousands or even tens of thousands of hours between failures.
• Mean Time To Repair (MTTR) is a basic measure of the maintainability of repairable items. It’s the average time required to repair a failed component. Calculations of MTTR assume that the fault responsible for the failure is correctly identified, and the required spare parts and personnel are available. 

We can formally define IA as the period during which a system is in a condition to perform its intended function upon demand. IA can be expressed in terms of system uptime and system downtime, and measured as the amount or percentage of system uptime:

IA=system uptime / (system uptime + system downtime)

where system uptime is the period of time during which the system is in an accessible state. When it isn’t accessible, it’s termed system downtime. In terms of MTBF and MTTR, IA can be expressed as:

3 Pages