Aug 19 ’10
Managing the WebSphere MQ Dead Letter Queue
Among the stated objectives of WebSphere MQ (WMQ) is assured delivery, once and only once, but there are instances when messages can’t be delivered to intended recipients. Causes for this include:
- Non-existing recipient (no queue defined to the queue manager)
- Incorrect recipient (queue name misspelled)
- Recipient’s mailbox full (max queue depth reached)
- Recipient unavailable—Open Transaction Manager Access (OTMA) is unable to deliver a message to an IMS queue.
WMQ has provided for these occurrences through the Dead Letter Queue, (DLQ) the repository of almost last resort that’s similar to a post office department’s dead letter office. Messages landing in that queue require special, usually manual, attention.
There are tools that can be used to delete, copy, and move messages around, but the policies surrounding their use are often murky. Are the contents of the messages private, personal, secret, or otherwise restricted from general viewing? Could the contents be subject to Sarbanes-Oxley (SOX) or other legal regulations? With only one DLQ per subsystem, how do you manage it without breaking any of the unique rules that may exist around the multitude of different messages? Just rerouting the message automatically to its intended destination is insufficient. If the message was a request-reply across platforms, the sender may no longer be waiting for the reply, which may then land in the DLQ of the other queue manager. Obviously, a more comprehensive solution is required.
Message data is “owned” by a business project, just like file records are “owned” by a business project. During creation of that project, the ownership of the message data must be established at a level that can make decisions on disposition of the messages if they land in the DLQ. That level also must be established so it will be unaffected by personnel changes or departmental reorganizations. If the data is sent by one project, processed by a second, and delivered to a third, management becomes more complex, especially if one of those projects is in a different company. Often, decisions made during development to get the project started are carried forward to production. How many developers really have the authority to decide if it’s legal to allow some stranger access to DLQ messages containing a customer’s unencrypted financial information? Without responsible message ownership, it’s impossible to establish a DLQ management process that will satisfy the rules to which the message content may be subject.
There’s an implied assumption that the message data owner could be linked to a specific message or types of messages. This can be by “any message destined for a particular queue,” “any message that will execute a specific transaction,” “any message with a specific origin,” “any message with a particular data string in a certain location,” or some other identification method. It may seem obvious, but, for example, with publish-subscribe (pub-sub) messages, the same data may have several owners who have different requirements. Even though WMQ is a time-independent process, the message sender may require timely notification of a break or delay in the delivery path. Stopping the sender to prevent additional messages could help recovery and cleanup when the problem is resolved.
Root Cause Determination
The options are to discard the message, retry delivery, or redirect the message. If the message can be discarded, the simplest method is to set an expiry period and let a scavenger program remove the messages. But even then, you should try to determine the reason the messages landed in the DLQ. If delivery is to be retried, further analysis is required. Retrying delivery is only a useful option after the receiving process is restored. Redirecting a message is useful as an interim step preceding redelivery as a means to clear the DLQ while recovery is under way, or as a means to give the owner access to the message to assist in problem determination and ultimate restoration of service and data recovery. However, available resources may not permit alternate queues for each application; security and change procedures may not permit “on the fly” creation of production objects for message redirection. What’s acceptable at one shop may not fly at another.
The most important piece of information is the reason the message was placed in the DLQ. Anyone who has administered a WMQ system for any significant time knows that Murphy can be at his most creative here. We once carefully calculated the requirements for a new batch application initial feed to a CICS processing application and added a 10 percent safety margin. When the job ran, the pageset behind the queue was quickly filled up, and half the messages landed in the DLQ. The investigation uncovered that the developer used the wrong copy member for the message—one that was three times longer than anyone was told—and, of course, used the COBOL “LENGTH OF” special register specification. It escaped notice in three levels of testing.
There’s also no reason why all the messages in the DLQ would be from the same source or have the same target. Consider messages designated for stopped IMS transactions using the OTMA-IMS Bridge. Unable to deliver the messages to the IMS input queues, OTMA will send all of them to the same DLQ with the notoriously generic code 00000146 ‘OTMA X’1A’ IMS detected error.’
The immediate cause of the DLQ messages may not be the root cause of the problem. You should explore the following questions:
- Was the “objectname unknown” because someone typed it incorrectly? Who is checking the administrators? Or was it unknown because someone deleted it? Where is the security and oversight?.
- Was the “not authorized” error because the security request was still in the pipeline or did someone try to hack into the system?
- Did the target queue fill up because it was sized incorrectly or did the application removing the messages fail?
- Were the submitters aware there’s an offline period for IMS transactions and databases and they shouldn’t be sending to them during that period?
The amount of time available to make these determinations depends on the business impact. No one really wants to do an impact analysis while the messages are sitting in the DLQ; that should have been done during the application development phase. When the pageset full problem occurred, we knew the receiving CICS application was able to handle the larger messages without any ill effects. This enabled us to move the messages back to the original target queue in stages to avoid filling the pageset again, permitting normal business. The program was later corrected with the proper copy member. If the larger messages couldn’t have been safely accommodated, resending them to the original target wouldn’t have been possible. A side issue that this uncovered was the danger of backing the DLQ with a pageset used by other queues. If possible, the DLQ should also have its own bufferpool.
Many companies don’t have dedicated 24x7 production support teams for every application. Even though messages belong to a business group, an operations team with only general knowledge is often the only off-hours, first-level support available. For this reason, a failure and impact analysis, which should be part of the development stage, must include managing dead letter messages and the results included with the production support documentation. This isn’t such a tremendous undertaking.
Document and Automate
You need to document the following:
- The source and the destination of messages: If a problem exists or is imminent, can the sender of the messages be stopped?
- The timeliness of the messages: Can the messages be expired without intervention?
- The resources needed to process the messages: When can a retry be initiated? Restoration of Service (ROS) activity of a failed application, platform, or service (a different topic), must be completed first.
- The escalation process: At what point do data specialists get involved, and how are they contacted?
With that information, handling DLQ messages could become an automated process. IBM provides a dead letter handler utility, CSQUDLQH, described in the System Administration Guide. A home-grown equivalent could be created for platforms that don’t have one. This utility will perform some action, such as retrying or forwarding messages in the DLQ based on a rules-matching process against fields in the Dead Letter Header (DLH) and Message Descriptor (MQMD). Since putting messages in the DLQ could be considered a processing failure, a supplemental, home-grown utility could be written to provide reports for audit and additional data gathering. We found such a utility useful in identifying the source of messages that expired before someone could manually review them. It provides the information necessary to identify the messages while maintaining data confidentiality. Figure 1 shows one such sample report.
When the messages are identified, the appropriate documentation can be referenced on what action to take. If the messages will expire, no action is needed. Other actions can be programmed into the DLQ handler as necessary. Such software would be useful, and may even be necessary, for identifying the “needle in the haystack” scenario where a single critical message is buried in the queue among thousands of less-critical messages. The software, automatically initiated by a trigger process, could rapidly browse the messages, categorizing them based on previously defined rules, and identifying those it doesn’t recognize. First-level alerts should go to a continuously monitored console at the previously agreed upon criticality. Shops that have many batch processes usually have skilled support teams to handle failures. A similar, or even the same, team can handle DLQ alerts and initiate escalation procedures for time-sensitive issues.
Some administrators panic when messages are directed to the DLQ. The DLQ is just another system object that needs to be managed, like any other object. By setting up a process and requiring adherence to the input parameters, managing messages in the DLQ could be a simple, even automated task.