Operating Systems

Setting thresholds for those parameters is another problem. With static thresholds, we have two options: set them so high that when the alerts occur, the problems are already severe, or set them lower and get a lot of “false positives.”

Then there are the idiosyncrasies of different platforms and devices. Mainframes, for example, can run at 100 percent capacity. Other servers peak at much lower values. But there also can be times when a 100 percent capacity metric on a mainframe indicates a problem. Again, this is an example where the personal expertise of a single IT staff member can be critical to maintenance of a healthy computing environment.

What about the valleys, where underutilized or nonperforming resources might be the issue? Most thresholds are set based on peak limit values. But, often, extremely low utilization may point to other problems. If a busy credit card authorization application suddenly plummets in volume and resource demand, for example, this issue may result from a network or middleware problem feeding the application. Depending on the nature of the incident, thresholds may not bring the problem to anyone’s attention until significant revenues are lost or customer goodwill is harmed.

This challenge increases with Service-Oriented Architecture (SOA), where dynamic pathing becomes the norm. Since transaction paths aren’t static, correlating resource problems to service levels becomes even more problematic. There’s also the matter of process. Performance problems are typically centralized to operational groups in a Network Operations Center (NOC). The operator or notification software informs the appropriate systems programmers or network administrators that a problem requires their attention and the process of human intervention to maintain optimal performance levels begins.

Once observed, the alerted resources indicate the severity. The staff navigates through displays to observe performance statistics and data. Sometimes the problem is still occurring and the data reflects that, but often, performance may have already changed. The issue may be a performance spike that can be analyzed later or a warning of performance changes that will lead to more severe problems if they aren’t adequately addressed. If the problem doesn’t fall into either of these categories, analysis can lead the user through several displays. After analyzing the information, commands can be issued to cancel a resource, bring additional resources online, or perform some other action to bring the performance of network and system resources back to acceptable levels.

This usually occurs without any actual knowledge of the importance of the problem to the business. Since information isn’t available in the displays to tell the operators whether the problem affects a strategic application or part of a batch report for a peer, management teams can’t determine the issue’s relative importance. So IT can wind up spending inappropriate amounts of time on essentially trivial issues because they lack an effective means of prioritizing problems that simultaneously occur.

ROTs can help you assess relative importance. ROTs are rules of system and network behavior learned over the years, where you often run into several “it depends” situations. ROTs come into play if you need to redefine thresholds based on a meaningful measure of what good performance looks like. For example, servers (mainframes) can mostly run at 100 percent busy; other servers peak at much lower values. You should really understand the transactions and processes IT is monitoring and you should know: 

  • When 100 percent busy on a mainframe indicates a problem
  • Which transactions or processes directly serve a customer
  • Which ones are background processes, daemons, started tasks, etc.
  • Which File Transfer Protocol (FTP) actions can wait and which involve critical monetary exchanges
  • If and when some slow background transactions impact the ability of users and customers to interact with the system.  

Gathering this information requires more effective communication and understanding between the line of business and IT than what typically occurs. You can’t monitor what you don’t measure, and you can’t measure what you don’t understand. Traditionally, IT has been distinctly managed from the business. In some companies, IT is outsourced or is a separate business entity. The result is that users often think IT doesn’t understand the business and isn’t being properly supported. More often than not, they’re right. 

Break Down the Silos

The solution is to break down the silos. This first requires understanding the architecture for the applications—which determines who is on the technical team for those applications. Working relationships on all platforms can then be set up between systems, network, database and applications managers.

3 Pages