CICS / WebSphere

WebSphere MQ Monitoring and Alerting: A Review

5 Pages

-        Channels could fail to start for many reasons, including the unavailability of the partner Queue Manager, network errors, message count synchronization differences, or a full dead-letter queue.

Pagesets, which are VSAM linear datasets, can run out of extents or volumes due to unexpected message volume, unexpectedly large messages, or unavailable processes to retrieve the messages within the expected time-frame. The active logs, also VSAM linear datasets, can run out of extents or volumes, again from either an unexpectedly large volume of or large-size persistent messages, or the inability of the system to archive the logs faster than the active logs grow. Too few buffers defined to the bufferpools could extend response time by requiring extra pageset I/O.

Reporting a Failure Occurrence

Failure occurrences can be identified from the system messages and codes sent to the Queue Manager and Channel Initiator message logs or to the MVS console, from the event notifications issued by the Queue Manager to event queues, and by programmatically requesting information on the status of WMQ objects. Alerts can be sent to a control center or other continuously manned location for first-level support, and to pagers or telephones for second- and third-level support.

Individual members in a class of objects can have different failure scenarios and monitoring requirements. A queue containing messages that need to be stored for 10 hours would have a different risk attached to its failure mode than a queue that should be cleared of messages within 10 seconds, and would require different monitor criteria. Certain objects may require several levels of alerts, or escalation procedures, depending on the risk assessment.

Each object class should have a risk assessment made for each identifiable failure mode. Within that, each member of the object class should have a risk assessment made for the failure mode. Messages placed on a dead-letter queue due to an IMS security failure have a lower risk assessment than messages placed on the dead-letter queue because the real target queue is full.

For each failure mode, a corresponding list of causes should be identified for targeting, since it’s preferable to monitor for a cause rather than for a symptom. Messages could be on a dead-letter queue if a local queue filled up because the application expected to remove the messages failed to start when a batch trail prerequisite didn’t complete. If the relationship between the failed job and local queue activity was identified in the design cycle, the failure of the job could have generated an alert to the potential of a queue-full event. Such a tie-in among events isn’t simple to identify and isn’t often recognized until the first actual failure occurs.

There’s a twofold exposure that planning should minimize. The first is the effect on the system of the large number of messages on the dead-letter queue. The second is to determine what to do with those messages. A post-mortem investigation would likely lead to a requirement for additional batch job monitoring and documentation, tying specific batch trail issues to queue depth issue, which may result in earlier identification of potential problems.

Non-expiring persistent messages would be at risk if the queue’s backing pageset fills up or if there’s a log failure. The risk for relatively quickly expiring, non-persistent messages, such as request messages, would be much less for any type of failure. Placing disparate messages such as these on the same message queue would create a new failure scenario, where a sudden increase of the request messages would affect the storage requirements of the long-term persistent messages. Or, a sudden increase of the quickly expiring, non-persistent messages would cause a queue depth alert that would only be important if the messages were non-expiring or persistent. To accurately monitor an object, it’s necessary to know what it will be used for. A queue-depth-high alert on a queue for long-term persistent messages is worthless if the buffers and pagesets would fill before the event occurs.

A monitor should recognize an actual risk before issuing an alert. To adequately assess risk, object usage needs to be identified and documented. For example:

5 Pages