Aug 9 ’10
Proactive IT Systems Management: The Time Is Now
Despite the rhetoric, it appears not many shops have completed the journey. Today, critical business services rely on mainframe systems and applications being available and performing non-stop to Service Level Agreements (SLAs). The failure of one of these crucial business services can have catastrophic effects on the enterprise—from decreased profit to outright cessation of business. Given these stakes, the requirement for proactive systems management transcends IT operating methodologies to become a business imperative. This article explores proactive systems management and its six stages of maturity; it also describes the processes and tooling required to implement a proactive operating rhythm. It concludes with some best practice “first steps” and the potential benefits of implementing proactive systems management.
Proactive IT Systems Management
Proactive systems management can be described as managing the service delivery performance of business applications so performance problems are identified and remediated either before they have an impact, or before the impact has an adverse business effect. It involves having in place a specific set of IT processes, tooling, and skills. In most IT organizations, getting to proactive systems management will require a change in one or more of these three components. Getting to proactive systems management isn’t typically a binary operation. Rather, it’s a phased effort, with the amount of change required dependent upon in what stage, or level, of proactive systems management maturity, the organization currently operates.
Proactive systems management might be said to have six levels:
- Level 1: Wait for user complaints of service problems and react
- Level 2: React to rule of thumb alerts, diagnose via “war-room” approaches, and instigate silo-based investigation
- Level 3: Automatically discover violations and enable more rapid remediation, eliminating most war-room convocations
- Level 4: Automatically discover violations and advise stakeholders of potential impact before remediation efforts
- Level 5: Automatically discover violations and automatically mitigate impact
- Level 6: Automatically determine future (impending) violations and automatically remediate before impact.
This synopsis of a more extensive topic enables an understanding of some of the processes and tooling that would be required to move from the lower maturity levels to the higher ones.
- Cross-silo, cross-platform systems management discipline. In addition to a process requirement, this may extend into organizational considerations, too. Being proactive in one silo won’t result in proactive systems management of the business services if a problem occurring in another technology the business service uses is managed in a reactive manner. Ideally, organizations that want to reap the benefits of proactive management should be committed to proactive efforts across all appropriate technologies. It’s possible to start raising the level of proactive maturity in one platform, such as the mainframe, and then extend the processes, motions, and lessons learned to other platforms.
- Business service constructs for use in implementing proactive management. Proactively managing service commitments for business services isn’t possible if there’s no understanding of the business service. Focusing proactive efforts on a CICS region, a DB2 database, or a specific Logical Partition (LPAR) won’t necessarily translate into proactive management of the business service. Without a business service construct for systems management, you may waste time in discovery and remediation efforts in one silo and even make the performance situation worse.
- Encapsulate “tribal knowledge” housed in individual technicians. Although many organizations are building run books of prescribed actions for a variety of operational situations, these may not focus on some of the deep technical knowledge technical experts possess. There are several benefits of doing this: 1) the “really smart people” know how to identify and resolve many problems in their sphere of knowledge and they typically take the shortest path to service restoration; 2) their expertise, reasoning, and systems management processes may be applicable in other areas outside their sphere of responsibility; 3) when their knowledge is captured via automation, it reduces the time spent on repetitive triage tasks and frees them up for higher-value activities; and 4) capturing their knowledge mitigates risk to the organization and enhances IT governance processes.
One of the biggest barriers to increasing proactive maturity lies in the tooling technicians have at their disposal for systems management. IT must recognize that yesterday’s status quo systems management may be insufficient for proactive systems management. The tools and processes implemented and used by the same technicians for 10 to 20 years may have sufficed so far, but that doesn’t mean the organization doesn’t need proactive management. IT should evaluate its current level of proactive maturity, the desired level of maturity, and then assess what’s needed to reach the desired maturity level and the associated business benefits to be gained.
Getting to proactive management requires a degree of innovation—in the processes, in the organization, and especially in the tooling used to support systems management. The following is a list of some of the key requirements for proactivity in systems management tooling:
- Solution breadth and depth. With business services executing across platforms and technology silos, systems management solutions must provide complementary breadth. Cross-platform solutions provide a common framework for management, common terminology and alert, drill-down analysis and resolution paradigms, such as single pane of glass displays.
- Constructs for business services that flow across mainframe silos (CICS, DB2, IMS, WebSphere Application Server, WebSphere MQ, etc.) and across platforms. If IT defines a process for managing cross-platform business services, there’s a need for tools that can synthesize the business constructs automatically and display triage and root cause analysis data within silos from a starting point of the business service transaction.
- Thresholds appropriate for proactive operations. Most threshold alerting is based on either rules of thumb or experiential thresholds. Both have resulted in either so many alerts that they’re ignored (and therefore have no value) or so few alerts that performance problems with real business impact aren’t raised as alerts until the business owner calls. Thresholds should be automatically determined and actively maintained and updated as business cycles and processing patterns change. Without intelligence in the threshold process, it’s difficult to raise systems management to the higher levels of proactive maturity.
- Effective, intelligent alarm management. Complex business applications must be able to alarm on a wide variety of measures and conditions from multiple technologies. Alarm management should be capable of dealing with the business service complexity without being complex. It also must differentiate single occurrences of exceptions with more serious multiple occurrences within specific time cycles.
- Triage and support for root cause analysis. When tooling for the first four items in this list is in place, the next step is to enable drill-down on the alert to find solutions as rapidly as possible. To accomplish this, solution-led, drill-down analysis and problem solving across multiple technologies and platforms must be provided.
- Intelligence and advice. The complexity and scope of business services and the underlying IT infrastructure interfere with the application of technicians’ experiential knowledge. Intelligence and advice in systems management solutions can deal with the complexity, while also addressing some organizations’ concerns about bridging the generation gap to maintain deep technical knowledge. Highly experienced technicians will find it difficult to achieve proactive maturity without the assistance of advisor technology. Less experienced ones will find it impossible. Intelligent advisor technology can quickly analyze large numbers of variables, identify problem sources, and recommend remedial or corrective actions.
- Predictive intelligence. To achieve the highest level of maturity, the systems management tooling must anticipate problems before they arise. This requires observing Key Performance Indicators (KPIs), charting their directions, correlating changes with other KPIs, and predicting when one of the KPIs will reach a critical point that could impact services.
- Automation. While all of these capabilities can enable more effective systems management, without automation there can be little truly proactive activity. Automation provides alerts that lead to resolutions and reduce manual errors; it should be sophisticated enough to act on minute technical indicators while designed to surround the sophisticated and complex in simplistic, uncomplicated implementations. Automation must be able to take actions across a broad range of objects and conditions and reflect a linkage to service impact models along with an automated feedback mechanism to those models and to other processes, such as service desks. Cross-platform business services will also demand cross-platform automation, as the operation, start-up, and shut-down of various platforms will be required as part of proactively managing the business service.
Some First Steps
Arriving at proactive systems management will take organizational commitment for process change, organizational change, and possibly tooling changes, none of which happen quickly. As a starting point, consider implementing some of the following steps to increase the level of proactivity in your operations:
- Identify key business applications and the infrastructure pieces that support them.
- Know what conditions look like (resources used, work volume, etc.) when these key applications are meeting their SLAs.
- Set KPI thresholds based on compliant conditions that align with the applications and infrastructure.
- Use alarms to focus on thresholds that are being tripped.
- Set up pre-arranged automation for handling the conditions identified as problem situations.
- Integrate the alarming and event notification into Business Service Management (BSM) processes to access service catalogs, Configuration Management Database (CMDB)/Configuration Management System (CMS), event management, and service desks so the mainframe fully participates in any BSM processes.
- Evaluate the increased effectiveness of systems management efforts.
The ease with which your organization can take these steps will depend greatly on the nature of your systems management processes and tooling.
When IT can meld the processes and tooling to increase proactive management maturity, payoffs to the enterprise can be significant. These are benefits that have real value to the business and to the management of IT itself. They include:
- Reducing the business impact of infrastructure and application issues
- Lowering the cost of the mainframe
- Reducing manual errors
- Minimizing firefighting and war-room time for IT staff, giving them more time to increase IT value to the business
- Creating a transition to a new generation of technicians
- Mitigating risk to both the business and IT.