Exception-based monitoring is one way to watch for problems that may affect desktop service levels. To do this effectively requires knowing what an exception is. You also may want to be even more proactive about meeting user expectations. Experienced systems and network management teams proactively optimize service levels by better understanding business users and applying that understanding to their systems monitoring strategies. This article explains how business-centric application performance management can drive business success by providing a holistic view of a business transaction.
The Reactive Manager
As business demands continue to grow and IT resources remain constrained, every IT organization must re-examine the maturity of its systems and network management processes. Often, increased responsibility and reduced staffing have prompted a return to reactive management. Such firefighting usually results in a frustrated IT staff, an unhappy user community, and poor business decision-making. Reactive IT organizations share several common characteristics, including:
- Monitoring by exception: Management staff look for “outliers”—unusual peaks in resource utilization or performance spikes—on their network and systems monitors. They then respond to those exceptions based on their apparent severity.
- Alert on static thresholds: Monitoring is done against static thresholds that are either set to standard defaults or some earlier, internally determined baseline. As a result, Rules of Thumb (ROT) are used to set thresholds in the monitors to help track problems. Monitors don’t have the ability to use dynamic thresholds, so normal peaks and valleys in business activity can cause “false positive” alerts.
- Generalized understanding of trends: Over time, management staff begins to get used to certain patterns. These include batch job streams that consume significant resources at the end of the month or peaks in transaction rates for certain applications at regular times throughout the day.
- “Loved one” applications: With hundreds (or perhaps thousands) of applications to watch, the ones that get the most attention tend to be those deemed most important by someone among, or with access to, the management team.
This type of management relies heavily on the expertise of the individual IT staff member and assumes their knowledge covers a wide variety of technologies. Enterprise technicians have acquired rich expertise and aren’t easily replaced because the learning curve is too steep. They can provide important information such as:
- How many retransmissions on the network are too many
- How busy a disk can be and still provide acceptable performance
- At what point the network is saturated
- When degradation in response time on a given server in a multi-server business application actually impacts the business.
The effectiveness of service-level management depends on the expertise of the technician. This makes the organization highly vulnerable to the technician’s availability and also makes it difficult to hand over management tasks to younger or less experienced staff as seasoned veterans are promoted or retire.
In this era of cross-platform, multi-tier applications— with users continuously online—how can technicians actually determine when IT is delivering good service? If you’re the UNIX administrator and the server farm appears to be operating well, does this mean user satisfaction will be high? Or could problems on the network exacerbate apparently minor issues with the servers?
These questions underscore the fact that “siloed” management (i.e., isolated as if working in a silo), remains a problem for IT organizations. Various resources along a transaction path may appear to be delivering good service, but the sum of the response times may be unacceptable. So, unless IT has tools that provide end-to-end response time data in addition to component response times, IT may believe it’s delivering the required service, but users don’t. Siloed management makes it hard to see things from the user’s perspective.
Siloed management doesn’t just erode the efficiency of IT organizations, it also undermines service levels to the user. A set of application servers, for example, may be down for only a few minutes here and there. But if the Web servers, network devices, back-end databases, and other components that support a complex, end-to-end transaction also are occasionally down, the cumulative result may be unacceptably low service availability. In an overly siloed management environment, this kind of low availability can become commonplace because each siloed management team focuses only on a specific infrastructure component, rather than the actual delivery of the end-to-end service.
So, satisfactory performance and availability for any individual silo may have nothing to do with a positive perception by the users. Determining what is “good enough” for the user requires knowing what they’re trying to do with the application or service and the actual impact of slow performance or availability issues. The technical parameters on the monitors don’t tell us. Much of what we’re measuring today to gauge customer experience is measured in service levels, application response time, where measurable, and calls to the help desk. But that doesn’t necessarily translate into “alerts” or other meaningful and actionable information in any given silo.