IT Management

Mainframes throw off a lot of data that’s useful in monitoring performance and diagnosing problems. A natural next step for most businesses has been to try to use this data to be more proactive in managing mainframe performance. However, going from reactive to proactive has proved elusive, largely because of the lack of good mainframe analytics.

Fortunately, the road to mainframe analytics—a rocky one to date—is about to get smoother. New, more-predictive analytical techniques are coming to mainframe teams.

The big challenge for analytics is defining what constitutes “normal” for your mainframe environment. Just like the factory settings on your car aren’t always useful, the factory settings that come with traditional mainframe tools aren’t always useful for analytics.

Normal for mainframes changes all the time based on business cycles and events. Trying to keep up with normal involves analyzing a lot of data and doing many calculations to apply deviations from the norm (often a manual process by spreadsheet-wielding IT specialists). Defining and maintaining optimal performance thresholds for the mainframe is a never-ending process that’s beyond many businesses.

Having more predictive analytics readily available is critical for mainframes because the biggest, most expensive problems are those you don’t expect. An example includes a runaway transaction that consumes additional CPU and causes a lock to occur in a data-sharing environment. Not only is there a problem with this transaction, but it’s probably causing the whole application to be delayed. What if you could understand the normal so well you could detect and, therefore, automate the correction of these unpredictable events?

In the past, a fairly standard way of doing analytics was by using capacity metrics. For example, you set your CPU capacity level at 95 percent because that’s where your CPU started to thrash. Then you watched for traffic lights: Red meant a problem, yellow meant you were about to have a problem. If you had a recurring problem—for example, a long-running SQL query that regularly locked up DB2—you diagnosed it and then fixed it using events, alerts, and automation to ensure it didn’t happen again.

Then you became more sophisticated by overlaying daytime and nighttime shifts. Around 6 p.m., you switched to a nighttime threshold because that was when utilization went down. This was better, but still not all that useful in an environment of 24x7 business cycles and unpredictable transaction workloads. That’s also taking a bottom-up approach based on what the business really cares about: how the applications are performing.

The limitations of capacity-type metrics gave rise to core metrics. A core-metrics approach splits business applications into individual pieces, looks at their day-to-day transaction rates, and creates Key Performance Indicators (KPIs) based on the utilization levels of the most critical components.

Meanwhile, tools such as persistence checking helped sharpen utilization by measuring not only  when utilization hit a threshold but also how long it stayed at the threshold. In other words: Is it a blip or a trend? Resetting thresholds to match actual trends helped reduce false alarms. But both these methods share a drawback: They happen post-processing, working with day-old or week-old data. 

It’s now technically possible and cost-effective to collect, store, and perform real-time analytics on mainframe monitoring data. Similar to the “Big Data” applications happening in healthcare, science, and other areas, intelligent analytical processing provides the best and most accurate picture of normal to date. The system collects, analyzes, and learns system and application behaviors to recommend ideal thresholds for critical metrics throughout business cycles.

The system can detect deviations at system speed and say, “That’s not normal for this environment.” Intelligent alarms provide consistency by going straight to problems and fixing them. Mainframe management becomes autonomic, or self-managing. You can’t get much more proactive than that.

Intelligent analytical processing builds on the steadily improving mainframe monitoring technologies of the past few decades, including dynamically generated, codeless automation processes (1989); intelligent batch optimization (1999); dynamic data extraction and display of consolidated monitoring data to automate first-level support (2001); intelligent mainframe, event-driven management (2002); and persistency checking (2008).

Being proactive about mainframe performance depends on being increasingly scientific about how you use analytics to define normal for your business—and then managing in real-time based on that norm. The next generation of mainframe analytics, intelligent analytical processing, will make this possible.