IT Management

Managing the WebSphere MQ Dead Letter Queue

2 Pages

Among the stated objectives of WebSphere MQ (WMQ) is assured delivery, once and only once, but there are instances when messages can’t be delivered to intended recipients. Causes for this include:

  • Non-existing recipient (no queue defined to the queue manager)
  • Incorrect recipient (queue name misspelled)
  • Recipient’s mailbox full (max queue depth reached)
  • Recipient unavailable—Open Transaction Manager Access (OTMA) is unable to deliver a message to an IMS queue.

WMQ has provided for these occurrences through the Dead Letter Queue, (DLQ) the repository of almost last resort that’s similar to a post office department’s dead letter office. Messages landing in that queue require special, usually manual, attention.

Murky Policies

There are tools that can be used to delete, copy, and move messages around, but the policies surrounding their use are often murky. Are the contents of the messages private, personal, secret, or otherwise restricted from general viewing? Could the contents be subject to Sarbanes-Oxley (SOX) or other legal regulations? With only one DLQ per subsystem, how do you manage it without breaking any of the unique rules that may exist around the multitude of different messages? Just rerouting the message automatically to its intended destination is insufficient. If the message was a request-reply across platforms, the sender may no longer be waiting for the reply, which may then land in the DLQ of the other queue manager. Obviously, a more comprehensive solution is required.

Message data is “owned” by a business project, just like file records are “owned” by a business project. During creation of that project, the ownership of the message data must be established at a level that can make decisions on disposition of the messages if they land in the DLQ. That level also must be established so it will be unaffected by personnel changes or departmental reorganizations. If the data is sent by one project, processed by a second, and delivered to a third, management becomes more complex, especially if one of those projects is in a different company. Often, decisions made during development to get the project started are carried forward to production. How many developers really have the authority to decide if it’s legal to allow some stranger access to DLQ messages containing a customer’s unencrypted financial information? Without responsible message ownership, it’s impossible to establish a DLQ management process that will satisfy the rules to which the message content may be subject.

There’s an implied assumption that the message data owner could be linked to a specific message or types of messages. This can be by “any message destined for a particular queue,” “any message that will execute a specific transaction,” “any message with a specific origin,” “any message with a particular data string in a certain location,” or some other identification method. It may seem obvious, but, for example, with publish-subscribe (pub-sub) messages, the same data may have several owners who have different requirements. Even though WMQ is a time-independent process, the message sender may require timely notification of a break or delay in the delivery path. Stopping the sender to prevent additional messages could help recovery and cleanup when the problem is resolved.

Root Cause Determination

The options are to discard the message, retry delivery, or redirect the message. If the message can be discarded, the simplest method is to set an expiry period and let a scavenger program remove the messages. But even then, you should try to determine the reason the messages landed in the DLQ. If delivery is to be retried, further analysis is required. Retrying delivery is only a useful option after the receiving process is restored. Redirecting a message is useful as an interim step preceding redelivery as a means to clear the DLQ while recovery is under way, or as a means to give the owner access to the message to assist in problem determination and ultimate restoration of service and data recovery. However, available resources may not permit alternate queues for each application; security and change procedures may not permit “on the fly” creation of production objects for message redirection. What’s acceptable at one shop may not fly at another.

The most important piece of information is the reason the message was placed in the DLQ. Anyone who has administered a WMQ system for any significant time knows that Murphy can be at his most creative here. We once carefully calculated the requirements for a new batch application initial feed to a CICS processing application and added a 10 percent safety margin. When the job ran, the pageset behind the queue was quickly filled up, and half the messages landed in the DLQ. The investigation uncovered that the developer used the wrong copy member for the message—one that was three times longer than anyone was told—and, of course, used the COBOL “LENGTH OF” special register specification. It escaped notice in three levels of testing.

There’s also no reason why all the messages in the DLQ would be from the same source or have the same target. Consider messages designated for stopped IMS transactions using the OTMA-IMS Bridge. Unable to deliver the messages to the IMS input queues, OTMA will send all of them to the same DLQ with the notoriously generic code 00000146 ‘OTMA X’1A’ IMS detected error.’

2 Pages