Dec 27 ’05
WebSphere MQ Monitoring and Alerting: A Review
A monitoring and alerting system is meant to provide notification of potential problems before they become catastrophic. A standard, consistent approach is required for monitoring WebSphere MQ (WMQ) for z/OS and alerting on potentially damaging events. For this to occur, the following questions must be answered:
- What objects are at risk?
- What are the failure scenarios?
- How is a failure occurrence identified?
- How can a failure occurrence be reported?
An effective monitoring and alerting system should also meet these criteria:
- Be reliable
- Not generate false alerts
- Notify, on a timely basis, the people who can resolve the problem.
Answering those questions and the additional questions they create, and designing and building a system that meets the above criteria will provide a full plan to monitor and alert. While this article is directed at the z/OS environment, the concepts are universally applicable.
The monitor process can watch only for predefined events. The events it looks for, in themselves, generally shouldn’t be damaging to the system. When an event occurs, an alert should be generated, so some action can be taken. That action can range from further monitoring to actively making changes to the system.
The alert must be sufficiently timely and directed to the proper resource so the appropriate response can be made before the initiating event results in damage to the system. Checking the depth of a queue every 10 minutes isn’t satisfactory when a process can fill the queue to its maximum-defined depth in less than a minute. In addition, the monitor itself should be reliable, possibly self-checking or redundant, or a monitor for the monitor may be necessary.
What Objects Are at Risk?
The WMQ objects at risk are the Master and Channel Initiator address spaces, queues, channels, pagesets, logs, and buffer pools. Each has its own failure scenario. For Queue Managers and Channel Initiators:
- The jobs could fail to start or start in the wrong order.
- The TCP/IP listener could fail.
- Object definitions could be incorrect, leading to define command failures.
- The CICS and IMS adapters and the Open Transaction Manager Access (OTMA) connector, a client/server protocol used to connect WMQ as an IMS client, could fail to start or connect to the Queue Manager.
- Local queues could reach maximum defined message depth and overflow to the dead-letter queue, which itself could reach maximum depth and cause channels to stop.
- Messages could accumulate on transmit queues due to channel stoppage.
- Channels could fail to start for many reasons, including the unavailability of the partner Queue Manager, network errors, message count synchronization differences, or a full dead-letter queue.
Pagesets, which are VSAM linear datasets, can run out of extents or volumes due to unexpected message volume, unexpectedly large messages, or unavailable processes to retrieve the messages within the expected time-frame. The active logs, also VSAM linear datasets, can run out of extents or volumes, again from either an unexpectedly large volume of or large-size persistent messages, or the inability of the system to archive the logs faster than the active logs grow. Too few buffers defined to the bufferpools could extend response time by requiring extra pageset I/O.
Reporting a Failure Occurrence
Failure occurrences can be identified from the system messages and codes sent to the Queue Manager and Channel Initiator message logs or to the MVS console, from the event notifications issued by the Queue Manager to event queues, and by programmatically requesting information on the status of WMQ objects. Alerts can be sent to a control center or other continuously manned location for first-level support, and to pagers or telephones for second- and third-level support.
Individual members in a class of objects can have different failure scenarios and monitoring requirements. A queue containing messages that need to be stored for 10 hours would have a different risk attached to its failure mode than a queue that should be cleared of messages within 10 seconds, and would require different monitor criteria. Certain objects may require several levels of alerts, or escalation procedures, depending on the risk assessment.
Each object class should have a risk assessment made for each identifiable failure mode. Within that, each member of the object class should have a risk assessment made for the failure mode. Messages placed on a dead-letter queue due to an IMS security failure have a lower risk assessment than messages placed on the dead-letter queue because the real target queue is full.
For each failure mode, a corresponding list of causes should be identified for targeting, since it’s preferable to monitor for a cause rather than for a symptom. Messages could be on a dead-letter queue if a local queue filled up because the application expected to remove the messages failed to start when a batch trail prerequisite didn’t complete. If the relationship between the failed job and local queue activity was identified in the design cycle, the failure of the job could have generated an alert to the potential of a queue-full event. Such a tie-in among events isn’t simple to identify and isn’t often recognized until the first actual failure occurs.
There’s a twofold exposure that planning should minimize. The first is the effect on the system of the large number of messages on the dead-letter queue. The second is to determine what to do with those messages. A post-mortem investigation would likely lead to a requirement for additional batch job monitoring and documentation, tying specific batch trail issues to queue depth issue, which may result in earlier identification of potential problems.
Non-expiring persistent messages would be at risk if the queue’s backing pageset fills up or if there’s a log failure. The risk for relatively quickly expiring, non-persistent messages, such as request messages, would be much less for any type of failure. Placing disparate messages such as these on the same message queue would create a new failure scenario, where a sudden increase of the request messages would affect the storage requirements of the long-term persistent messages. Or, a sudden increase of the quickly expiring, non-persistent messages would cause a queue depth alert that would only be important if the messages were non-expiring or persistent. To accurately monitor an object, it’s necessary to know what it will be used for. A queue-depth-high alert on a queue for long-term persistent messages is worthless if the buffers and pagesets would fill before the event occurs.
A monitor should recognize an actual risk before issuing an alert. To adequately assess risk, object usage needs to be identified and documented. For example:
- How many messages are expected on a queue?
- How large are the messages?
- When will the messages be placed on the queue and by what method?
- How long are the messages expected to remain on the queue?
- When will the messages expire?
- How are the messages removed from the queue?
- What are the consequences of losing one or more messages?
- Can the contents of the message be recovered if there’s a loss?
With this information, the number of buffers and pageset space can be determined to reduce the risk of not being able to store the messages, and a monitoring process can be evaluated and implemented. For example, if it’s known that 10,000 messages must remain on a queue for processing during daytime business hours, an alert that there are 10,000 messages on the queue at 3 a.m. would be unnecessary, but an alert that 10,000 messages were on the queue at 10 a.m. may indicate an application failure.
A monitor should recognize levels of risk: Most help desk functions have several levels of support. An ideal monitoring system should be able to alert at several levels of danger. Using the aforementioned example, if 10,000 messages are expected on a queue during a certain period, the appearance of 11,000 messages on the queue may generate a level-one warning that something abnormal may be occurring and the queue should be watched more closely. The appearance of 50,000 messages on the queue may generate an error alert and require third-level investigation to determine if there’s a risk to the system. This also indicates the need for the application generating the messages to communicate changes in processing. The risks for known events should be mitigated when identified.
Monitoring should be comprehensive: Using the previous example, 10,000 messages on a particular queue at 3 a.m. may not present a risk, but if there are 10,000 messages on another queue that maps to the same pageset and the page-set is 80 percent full, there’s a risk to both applications. Monitoring just for queue depth or just one queue may not be sufficient to avoid a failure.
An alert must be sent to the proper place within a proper timeframe: A monitor-generated alert sent to the wrong place or sent late to the proper place can lead to large-scale problems. Each alert should have a corresponding notification action, notification time-frame, and feedback requirement.
The failure of an OTMA connection would not usually be directed to the same problem solver as the failure of a CICS trigger monitor. The buildup of messages on a particular queue may not affect the WMQ subsystem, but it may indicate a significant application problem.
The type of notification should depend on how critical the risk is, since the monitor was designed to predict a potential problem, not announce that the problem has already occurred. Pager and e-mail alerts are asynchronous processes without a guarantee that the recipient will get them in time to analyze the situation.
An existing central “lights-on” command post would be a good recipient of alerts and can provide first-level analysis. Since command post personnel may not be familiar with WMQ, a comprehensive troubleshooting manual should be provided. An instruction that, “If a security violation message lands on the dead-letter queue between 5 p.m. and 8 a.m., send an e-mail to the support person; otherwise, phone him at his office number” would be welcomed more than, “Page the support person whenever a message lands on the dead-letter queue.”
How Can Monitoring Occur?
Monitoring systems can be either purchased from a third-party vendor or built locally. A purchased product can generally run on multiple platforms, provide a central control location, and provide standardized alerts for standard incidents. The “one-size-fits-all” approach of vendor systems may be customizable with scripts, but ultimately, those scripts may become as complex as a locally built monitor. There are also licensing costs and annual maintenance costs in a vendor product.
A locally built monitor can be written to watch for both general conditions and conditions specific to the local environment. While there’s an initial development cost, the maintenance cost can be quite low if the system is properly designed, and there are no licensing issues. A platform-centric system may have limited flexibility and extensibility; that would have to be weighed against the advantages of being able to take advantage of platform-specific features.
Combined systems can also be used, and these may provide the most flexibility. A home-built system can feed custom alert messages to a vendor product, which would use generic and standard processes to properly distribute the alerts. Messages can also be fed to an archiving system for uniform logging. A properly designed system could be easily customized for specific events.
The monitor shouldn’t generate false alarms: A monitor that continuously “cries wolf ” won’t be taken seriously for long. Users should be able to easily disable and re-enable monitoring of specific events without affecting the entire monitoring system. Continuously receiving alerts about messages on a dead-letter queue while determining what to do with them should be an avoidable annoyance.
A monitor must be reliable: If the monitor fails and the failure isn’t detected, an entire subsystem can be placed at risk. If it isn’t possible to monitor the monitor, redundant systems should be considered. Redundant systems also can be used to provide multiple monitoring frequencies. Most monitors function by either waiting for an event notification or by issuing queries against the subsystem. Additional systems can provide different query frequencies for different objects. A monitor also must not rely on what it’s monitoring to send alerts. Sending a “channel-stopped” alert across the stopped channel is a worthless exercise.
WMQ monitoring is manageable when broken into small tasks. z/OS operations groups have comprehensive tools to indicate platform problems. Those tools can be leveraged to provide alerts for easily identifiable mainframe issues and where subsystems issue standard codes that standard products can capture.
Where more local environmental issues exist, a customized, home-built monitor can provide for specific conditions and assist with prioritized problem determination suggestions based on local experience. Local control of the monitor would provide a level of flexibility not generally available with a systemwide vendor monitor and would expedite modifications as the risks and failure mode evaluation process matures.