Smart CIOs recognize that it’s necessary to evaluate how well the metrics they use continue to work. Many CIOs use a set of basic metrics for IT performance, such as response time, percent availability, and ROI. These sometimes serve as a “dashboard” for IT management.
Occasionally, the metrics that formerly provided what was needed become obsolete. This can result from technology changes or, more recently, from major re-configuration of IT architecture. Sometimes it’s easy to see a fundamental change coming and other times it can take you by complete surprise.
One CIO tells how his end users complained the Internet system was down when the help desk swore it was up. To sort it out, he called a user he trusted, who confirmed the system was in fact down. Later, the CIO learned that the availability measures used by the help desk only looked at the company’s home page, which had been cached. However, end users were trying to get to deeper pages and over different connections. Whether a system is up can depend on where people are (internal or external), what communications pipe they have connecting them to the Website, and what pages they’re viewing.
Another manager’s Service Level Agreement (SLA) promised users an average online response time of less than 2 seconds. His reports told him each day the average was below 1.8 seconds. However, a few users complained he wasn’t meeting his SLA. They happened to execute some transactions that at one point took more than 3 seconds. Although his average was well within the SLA, those users’ average that day was unacceptable. He has since revised his SLA to promise something like “95 percent of all transactions under 2 seconds and no transaction greater than 5 seconds.” (The odds of lightning striking the same cow twice in one year are very low. But for that cow, it’s 100 percent.)
Similar discrepancies can occur when you migrate applications from the mainframe to distributed servers. The reports of first shift CPU percent busy can be lower on the distributed platform than on the mainframe, but with more complaints about response time. You’ve probably already recognized that when one distributed server’s CPU was idle, its neighbor’s CPU could be overloaded, resulting in poor response time.
A different type of configuration change is involved when vendor-supplied, inter-dependent services (as opposed to standalone applications) are involved. E-commerce at one institution consisted of several vendors providing services; these were integrated, subscribed services, not just purchased software. The CIO recognized that any one of the vendors could fail, and users would see “the system is down.” He realized he needed to get the service providers to share information with each other to provide adequate management reporting.
So, how can you ensure your metrics continue to work for you? Try to:
- Ask your problem management team to categorize problems by severity level and cause as well as note over time trends in each category.
- List a few incidents when your measurements suddenly didn’t give you what you needed.
- For each metric, describe for yourself what change had occurred, whether with hindsight you might have been able to avoid being surprised, and what you wish you had done differently.
- Review the sidebar and ask yourself how each of these evolutions affects the way your operation provides service to your organization. Ask your best managers how they think this affects the metrics they use.
The most interesting problems are the ones whose metrics and solutions haven’t yet been worked out. They force you to address different disciplines, to develop non-obvious solutions, and to do more than just respond to some simple gauge. They often have side effects and turn out to be related to other interesting problems. The way a manager responds to these changes, when perfect metrics aren’t available, is what separates good managers from great ones. Do you agree? If so, let me know at email@example.com.