Feb 17 ’10

How End-User Information Can Enable More Proactive Service-Level Management

by Denise P. Kalm 

Exception-based monitoring is one way to watch for problems that may affect desktop service levels. To do this effectively requires knowing what an exception is. You also may want to be even more proactive about meeting user expectations. Experienced systems and network management teams proactively optimize service levels by better understanding business users and applying that understanding to their systems monitoring strategies. This article explains how business-centric application performance management can drive business success by providing a holistic view of a business transaction. 

The Reactive Manager

As business demands continue to grow and IT resources remain constrained, every IT organization must re-examine the maturity of its systems and network management processes. Often, increased responsibility and reduced staffing have prompted a return to reactive management. Such firefighting usually results in a frustrated IT staff, an unhappy user community, and poor business decision-making. Reactive IT organizations share several common characteristics, including: 

This type of management relies heavily on the expertise of the individual IT staff member and assumes their knowledge covers a wide variety of technologies. Enterprise technicians have acquired rich expertise and aren’t easily replaced because the learning curve is too steep. They can provide important information such as: 

The effectiveness of service-level management depends on the expertise of the technician. This makes the organization highly vulnerable to the technician’s availability and also makes it difficult to hand over management tasks to younger or less experienced staff as seasoned veterans are promoted or retire. 

Service-Level Problems

In this era of cross-platform, multi-tier applications— with users continuously online—how can technicians actually determine when IT is delivering good service? If you’re the UNIX administrator and the server farm appears to be operating well, does this mean user satisfaction will be high? Or could problems on the network exacerbate apparently minor issues with the servers?

These questions underscore the fact that “siloed” management (i.e., isolated as if working in a silo), remains a problem for IT organizations. Various resources along a transaction path may appear to be delivering good service, but the sum of the response times may be unacceptable. So, unless IT has tools that provide end-to-end response time data in addition to component response times, IT may believe it’s delivering the required service, but users don’t. Siloed management makes it hard to see things from the user’s perspective.

Siloed management doesn’t just erode the efficiency of IT organizations, it also undermines service levels to the user. A set of application servers, for example, may be down for only a few minutes here and there. But if the Web servers, network devices, back-end databases, and other components that support a complex, end-to-end transaction also are occasionally down, the cumulative result may be unacceptably low service availability. In an overly siloed management environment, this kind of low availability can become commonplace because each siloed management team focuses only on a specific infrastructure component, rather than the actual delivery of the end-to-end service.

So, satisfactory performance and availability for any individual silo may have nothing to do with a positive perception by the users. Determining what is “good enough” for the user requires knowing what they’re trying to do with the application or service and the actual impact of slow performance or availability issues. The technical parameters on the monitors don’t tell us. Much of what we’re measuring today to gauge customer experience is measured in service levels, application response time, where measurable, and calls to the help desk. But that doesn’t necessarily translate into “alerts” or other meaningful and actionable information in any given silo.

Setting thresholds for those parameters is another problem. With static thresholds, we have two options: set them so high that when the alerts occur, the problems are already severe, or set them lower and get a lot of “false positives.”

Then there are the idiosyncrasies of different platforms and devices. Mainframes, for example, can run at 100 percent capacity. Other servers peak at much lower values. But there also can be times when a 100 percent capacity metric on a mainframe indicates a problem. Again, this is an example where the personal expertise of a single IT staff member can be critical to maintenance of a healthy computing environment.

What about the valleys, where underutilized or nonperforming resources might be the issue? Most thresholds are set based on peak limit values. But, often, extremely low utilization may point to other problems. If a busy credit card authorization application suddenly plummets in volume and resource demand, for example, this issue may result from a network or middleware problem feeding the application. Depending on the nature of the incident, thresholds may not bring the problem to anyone’s attention until significant revenues are lost or customer goodwill is harmed.

This challenge increases with Service-Oriented Architecture (SOA), where dynamic pathing becomes the norm. Since transaction paths aren’t static, correlating resource problems to service levels becomes even more problematic. There’s also the matter of process. Performance problems are typically centralized to operational groups in a Network Operations Center (NOC). The operator or notification software informs the appropriate systems programmers or network administrators that a problem requires their attention and the process of human intervention to maintain optimal performance levels begins.

Once observed, the alerted resources indicate the severity. The staff navigates through displays to observe performance statistics and data. Sometimes the problem is still occurring and the data reflects that, but often, performance may have already changed. The issue may be a performance spike that can be analyzed later or a warning of performance changes that will lead to more severe problems if they aren’t adequately addressed. If the problem doesn’t fall into either of these categories, analysis can lead the user through several displays. After analyzing the information, commands can be issued to cancel a resource, bring additional resources online, or perform some other action to bring the performance of network and system resources back to acceptable levels.

This usually occurs without any actual knowledge of the importance of the problem to the business. Since information isn’t available in the displays to tell the operators whether the problem affects a strategic application or part of a batch report for a peer, management teams can’t determine the issue’s relative importance. So IT can wind up spending inappropriate amounts of time on essentially trivial issues because they lack an effective means of prioritizing problems that simultaneously occur.

ROTs can help you assess relative importance. ROTs are rules of system and network behavior learned over the years, where you often run into several “it depends” situations. ROTs come into play if you need to redefine thresholds based on a meaningful measure of what good performance looks like. For example, servers (mainframes) can mostly run at 100 percent busy; other servers peak at much lower values. You should really understand the transactions and processes IT is monitoring and you should know: 

Gathering this information requires more effective communication and understanding between the line of business and IT than what typically occurs. You can’t monitor what you don’t measure, and you can’t measure what you don’t understand. Traditionally, IT has been distinctly managed from the business. In some companies, IT is outsourced or is a separate business entity. The result is that users often think IT doesn’t understand the business and isn’t being properly supported. More often than not, they’re right. 

Break Down the Silos

The solution is to break down the silos. This first requires understanding the architecture for the applications—which determines who is on the technical team for those applications. Working relationships on all platforms can then be set up between systems, network, database and applications managers.

The next step is to learn the business. IT needs to understand how the company makes money and how the systems it supports contribute to that revenue. We’re all familiar with the cost of IT, but we also should know what each transaction costs relative to the business. When IT understands how the business makes money, it can quickly start prioritizing applications such as credit card authorizations or stock transactions ahead of less strategic applications that nevertheless have strong internal advocates. “Loved ones” may have little to do with profitability; they’re more likely the favorite application of an internal fiefdom. Business and IT are natural allies even though language and organizational barriers can make this challenging.

Understanding the business also means IT staff knows what each application means in terms of a business function. IT staff members’ understanding of applications often go no further than the name of a batch stream, a CICS region, or a UNIX process. This simply isn’t a viable condition. IT can’t effectively manage capacity if it doesn’t understand how real-world events can potentially impact its systems and its network.

If possible, every IT organization should map out business transactions to see how they traverse servers and networks. Network and systems monitoring tools should be able to help crystallize these definitions in the way monitoring data is viewed. This makes it easier to understand the impact of a problem and helps translate a user’s concern to related factors in the underlying system. Understanding that CICSREGA is called by SUNSRVRB and accesses a DB2 database makes it easier to solve problems when both the technician and user know that this arcane technology description is actually a loan application system.

Not every transaction or process is created equal. Looking at an application as a whole may give the impression that things are fine, or that things are worse than they actually are, for the user. IT staff needs to understand the application well enough to separate foreground and background workloads and focus attention on the online work such as the kinds of transactions a user must wait on. The prerequisite for this is to understand the business and how users interact with an application.

An effective service-level management team also needs to include application programmers and architects. It may help to watch a user in action; one observation can teach IT people a great deal about what they’ll see later on their monitors. At one company, users did all their print work at 10 a.m. based on an erroneous idea about the best way to interact with the system. This caused delays in online transaction processing, which was remedied when the users were simply advised to print when they wanted to.

IT should know and talk to system users. When IT understands what business users do and how they interact with the system, the metrics make more sense. Such interaction also reveals much about what bothers users the most and what’s insignificant. Users can even drive tuning exercises. By making users, rather than systems, more efficient, corporate profitability improves. Network and system metrics have meaning only in the context of a person’s experience with them.

Once IT has gathered this intelligence, this knowledge can be incorporated into existing network and systems management software. Resources can be grouped and named as components of critical applications. Labels also can reflect the importance of applications. By taking the time to customize monitoring software in this way, IT can visually determine the impact of any problems that occur.

Another useful area to customize is advanced alerting techniques. IT can improve productivity by reducing false alerts and focusing attention on issues that are meaningful. Baseline alerting determines the normal utilization or performance for applications, then highlights and alerts staff when monitored performance metrics deviate from the norm. Such alerts provide the operations center and systems programmer or network administrator with information they need to better understand the health of their sphere of management. Alerts also can be based on standard deviations from the norm. Sudden changes that fall more than one or x numbers of standard deviations are highlighted. Alerts can flag many issues of concerns such as excessive system utilization, fewer jobs than expected, low transaction or message rates, or less network traffic than expected.

Complex argument alerts let staff receive alerts only in the event of multiple, correlated problem conditions that simultaneously occur. Systems programmers, for example, may want to be alerted when batch jobs with a certain service class run longer than five minutes. Complex alerting arguments enable staff to optimize their productivity— keeping them on task and by informing them of problem conditions that require human intervention.

Business-centric application performance management is the key to business success. This transforms the job of performance management from a simple focus on IT resources by silo into a holistic view that requires a clear understanding of a business transaction. Only by working closely with users and developers and then incorporating that knowledge into their management practices can a systems or network management specialist hope to achieve true service-level management. The investment in time is small compared to the enhanced ability to better manage essential IT resources. ME