The immediate cause of the DLQ messages may not be the root cause of the problem. You should explore the following questions:
- Was the “objectname unknown” because someone typed it incorrectly? Who is checking the administrators? Or was it unknown because someone deleted it? Where is the security and oversight?.
- Was the “not authorized” error because the security request was still in the pipeline or did someone try to hack into the system?
- Did the target queue fill up because it was sized incorrectly or did the application removing the messages fail?
- Were the submitters aware there’s an offline period for IMS transactions and databases and they shouldn’t be sending to them during that period?
The amount of time available to make these determinations depends on the business impact. No one really wants to do an impact analysis while the messages are sitting in the DLQ; that should have been done during the application development phase. When the pageset full problem occurred, we knew the receiving CICS application was able to handle the larger messages without any ill effects. This enabled us to move the messages back to the original target queue in stages to avoid filling the pageset again, permitting normal business. The program was later corrected with the proper copy member. If the larger messages couldn’t have been safely accommodated, resending them to the original target wouldn’t have been possible. A side issue that this uncovered was the danger of backing the DLQ with a pageset used by other queues. If possible, the DLQ should also have its own bufferpool.
Many companies don’t have dedicated 24x7 production support teams for every application. Even though messages belong to a business group, an operations team with only general knowledge is often the only off-hours, first-level support available. For this reason, a failure and impact analysis, which should be part of the development stage, must include managing dead letter messages and the results included with the production support documentation. This isn’t such a tremendous undertaking.
Document and Automate
You need to document the following:
- The source and the destination of messages: If a problem exists or is imminent, can the sender of the messages be stopped?
- The timeliness of the messages: Can the messages be expired without intervention?
- The resources needed to process the messages: When can a retry be initiated? Restoration of Service (ROS) activity of a failed application, platform, or service (a different topic), must be completed first.
- The escalation process: At what point do data specialists get involved, and how are they contacted?
With that information, handling DLQ messages could become an automated process. IBM provides a dead letter handler utility, CSQUDLQH, described in the System Administration Guide. A home-grown equivalent could be created for platforms that don’t have one. This utility will perform some action, such as retrying or forwarding messages in the DLQ based on a rules-matching process against fields in the Dead Letter Header (DLH) and Message Descriptor (MQMD). Since putting messages in the DLQ could be considered a processing failure, a supplemental, home-grown utility could be written to provide reports for audit and additional data gathering. We found such a utility useful in identifying the source of messages that expired before someone could manually review them. It provides the information necessary to identify the messages while maintaining data confidentiality. Figure 1 shows one such sample report.
When the messages are identified, the appropriate documentation can be referenced on what action to take. If the messages will expire, no action is needed. Other actions can be programmed into the DLQ handler as necessary. Such software would be useful, and may even be necessary, for identifying the “needle in the haystack” scenario where a single critical message is buried in the queue among thousands of less-critical messages. The software, automatically initiated by a trigger process, could rapidly browse the messages, categorizing them based on previously defined rules, and identifying those it doesn’t recognize. First-level alerts should go to a continuously monitored console at the previously agreed upon criticality. Shops that have many batch processes usually have skilled support teams to handle failures. A similar, or even the same, team can handle DLQ alerts and initiate escalation procedures for time-sensitive issues.
Some administrators panic when messages are directed to the DLQ. The DLQ is just another system object that needs to be managed, like any other object. By setting up a process and requiring adherence to the input parameters, managing messages in the DLQ could be a simple, even automated task.