CICS / WebSphere

WebSphere MQ Monitoring and Alerting: A Review

5 Pages

Monitoring should be comprehensive: Using the previous example, 10,000 messages on a particular queue at 3 a.m. may not present a risk, but if there are 10,000 messages on another queue that maps to the same pageset and the page-set is 80 percent full, there’s a risk to both applications. Monitoring just for queue depth or just one queue may not be sufficient to avoid a failure.

An alert must be sent to the proper place within a proper timeframe: A monitor-generated alert sent to the wrong place or sent late to the proper place can lead to large-scale problems. Each alert should have a corresponding notification action, notification time-frame, and feedback requirement.

The failure of an OTMA connection would not usually be directed to the same problem solver as the failure of a CICS trigger monitor. The buildup of messages on a particular queue may not affect the WMQ subsystem, but it may indicate a significant application problem.

The type of notification should depend on how critical the risk is, since the monitor was designed to predict a potential problem, not announce that the problem has already occurred. Pager and e-mail alerts are asynchronous processes without a guarantee that the recipient will get them in time to analyze the situation.

An existing central “lights-on” command post would be a good recipient of alerts and can provide first-level analysis. Since command post personnel may not be familiar with WMQ, a comprehensive troubleshooting manual should be provided. An instruction that, “If a security violation message lands on the dead-letter queue between 5 p.m. and 8 a.m., send an e-mail to the support person; otherwise, phone him at his office number” would be welcomed more than, “Page the support person whenever a message lands on the dead-letter queue.”

How Can Monitoring Occur?

Monitoring systems can be either purchased from a third-party vendor or built locally. A purchased product can generally run on multiple platforms, provide a central control location, and provide standardized alerts for standard incidents. The “one-size-fits-all” approach of vendor systems may be customizable with scripts, but ultimately, those scripts may become as complex as a locally built monitor. There are also licensing costs and annual maintenance costs in a vendor product.

A locally built monitor can be written to watch for both general conditions and conditions specific to the local environment. While there’s an initial development cost, the maintenance cost can be quite low if the system is properly designed, and there are no licensing issues. A platform-centric system may have limited flexibility and extensibility; that would have to be weighed against the advantages of being able to take advantage of platform-specific features.

Combined systems can also be used, and these may provide the most flexibility. A home-built system can feed custom alert messages to a vendor product, which would use generic and standard processes to properly distribute the alerts. Messages can also be fed to an archiving system for uniform logging. A properly designed system could be easily customized for specific events.

The monitor shouldn’t generate false alarms: A monitor that continuously “cries wolf ” won’t be taken seriously for long. Users should be able to easily disable and re-enable monitoring of specific events without affecting the entire monitoring system. Continuously receiving alerts about messages on a dead-letter queue while determining what to do with them should be an avoidable annoyance.

A monitor must be reliable: If the monitor fails and the failure isn’t detected, an entire subsystem can be placed at risk. If it isn’t possible to monitor the monitor, redundant systems should be considered. Redundant systems also can be used to provide multiple monitoring frequencies. Most monitors function by either waiting for an event notification or by issuing queries against the subsystem. Additional systems can provide different query frequencies for different objects. A monitor also must not rely on what it’s monitoring to send alerts. Sending a “channel-stopped” alert across the stopped channel is a worthless exercise.

Conclusion

WMQ monitoring is manageable when broken into small tasks. z/OS operations groups have comprehensive tools to indicate platform problems. Those tools can be leveraged to provide alerts for easily identifiable mainframe issues and where subsystems issue standard codes that standard products can capture.

Where more local environmental issues exist, a customized, home-built monitor can provide for specific conditions and assist with prioritized problem determination suggestions based on local experience. Local control of the monitor would provide a level of flexibility not generally available with a systemwide vendor monitor and would expedite modifications as the risks and failure mode evaluation process matures.

5 Pages