Written by Martin Führlinger, Software Engineer Backend
In my previous posts about the message bus, I wrote about using RabbitMQ for decoupling our services and how we defined our message content. The last one was about how we keep our receivers fast and resilient. Now I want to provide more insights into dead letter handling, which was mentioned in my earlier posts.
A message can become a dead letter for various reasons. As mentioned in my previous blog post about keeping receivers fast, invalid content of a message can cause the message to not be accepted. In this case, the message is not acknowledged (NACKed). But there are several other reasons which can cause them to not be acknowledged as well, such as if the enqueuing into the Sidekiq does not work properly, maybe because the underlying Redis is not available.
No matter which reason led to the NACK, the message is pushed back into the message bus. In this case you can pass a parameter, which defines whether it will be redelivered (default) or not.
If a message is NACKed with the requeue option set, it is redelivered immediately. This can cause pretty high redelivery rates which can, in the best case, slow down the whole RabbitMQ, or even kill it if too many messages are redelivered to too many consumers. Our internal implementation of dead letter handling retries NACKed messages once. In case of invalid content, it would obviously be NACKed again. If there was another reason, it might work on the second try on another consumer machine. If the processing still does not work, it is dropped afterwards.
Dead Letter Exchange
Dropped messages are basically lost forever, unless you have defined a dead-letter exchange (dlx). This exchange is created to collect all dead letters and pass them into a special queue, which we named “all.dead-letters”. This queue is created manually, too, and just “stores” all the lost messages.
To create the dead letter exchange we took the following steps:
- Create new FANOUT exchange with the name “all.dead-letters” and the following options:
- durable: true
- internal: true
- Create new queue “all.dead-letters”
- Bind “all.dead-letters” queue to “all.dead-letters” exchange
- Define the queues to use the dead-letter exchange. This can be done separately for each queue or using a policy (see dlx documentation).
Once we set up the dlx correctly, all NACKed messages end up in the defined queue. This leads to a higher amount of memory/storage used, which needs to be monitored. Our OPS team has some alerts on the size of that queue. As soon they get alerted, they make sure that the backend team cleans up those invalid messages.
To clean up that queue, we wrote a small script. The script connects to the RabbitMQ, attaches to that particular queue, and just behaves as any other consumer from RabbitMQ’s point of view. As long as this consumer is running, it receives the messages. It then reads the message content, and its meta-data and decides what to do. This can be one of the following things:
- Requeue to the original queue, which means another retry.
- Drop the message, e.g. it has invalid data.
- Requeue it into the dead-letters queue.
Our metadata contains, among other things, the target queue name. You can use this information to process messages which failed at queue XY during one run of the script, and ignore all the others. Retrying again makes sense if the issue was not message related, like out-of-memory on the consumers or an unreachable Redis. Dropping makes sense if the message cannot be read, so if it has invalid data, or if the consumer of the message would just drop it again since it is already outdated. If the message is put back in the dead-letter queue, it is received by the running script as long as it is running.
To decide what to do, you have to know what happens in the system and what’s the purpose of that particular consumer.
It would also be possible to implement some more sophisticated solutions, like automatically retrying or dropping a message after a certain amount of time, or storing them in another database.
To prevent endless redeliveries of invalid messages, messages should be NACKed without requeuing enabled. To prevent relevant messages from being dropped, they need to be collected and reprocessed somehow. Whether the reprocessing is done automatically or manually and what to do with single messages highly depends on the use-case and type of message, and also on the amount of dead letters collected over time.