MQ Pitfall Avoidance Guide
Li Wei
MQ Pitfall Avoidance Guide
Message Backlog
Problem Category: Message Backlog
Related Description: During MQ usage, various issues can cause messages to be consumed late, leading to a large backlog.
Root Causes
- Consumer process gets stuck.
- Message consumption takes too long.
- Consumer‑group client fails to start.
- Too few consumer threads, insufficient processing capacity.
- After a consumption failure the client returns
CONSUME_FAILURE; if it cannot recover, it will keep retrying indefinitely.
Best Practices
- Keep business logic in the consumer short; if there is long‑running work, handle it asynchronously.
- Minimize interactions with external services to avoid their problems throttling your consumption rate.
- Classify exceptions in consumer threads and handle them appropriately; do not let a simple exception terminate the consumer or deregister the node.
- When a backlog occurs, use a shovel (message forwarding) for emergency handling to prevent loss.
- For single‑partition consumption, enable parallel processing when ordering is not required.
- Detect issues early and scale out partitions and consumer machines as needed.
- Optimize consumption logic; make anything that can be processed asynchronously do so.
- On consumption failure, avoid
CONSUME_FAILURE; useRECONSUME_LATERand implement proper fallback/back‑up logic.
Message Loss
Problem Category: Message Loss
Related Description: Message loss can occur due to MQ system failures or misuse.
Root Causes
- Kafka partition leader election issues causing loss.
- Reliability level not set to
ack=-1. - During a machine restart, asynchronous sends have not completed before the client is destroyed.
- Oversized messages cause send failures.
- Send failures are not monitored promptly.
- Large‑scale cluster outages.
- Some business logic discards messages after a timeout.
Best Practices
- Do not acknowledge consumption unless the business processing has succeeded.
- Gracefully shut down consumers and producers before the application exits.
- If zero tolerance for loss is required, set client
ack=-1. - Implement robust cluster disaster‑recovery; for Kafka, strive for an even distribution of partitions across all brokers.
- Avoid sending messages larger than 1 MB.
Duplicate Consumption
Problem Category: Duplicate Consumption
Related Description: Duplicate consumption is a common issue in MQ usage; if not handled correctly it can cause production problems.
Root Causes
- Most message middleware cannot guarantee exactly‑once delivery.
- Producers may publish the same message multiple times.
Best Practices
- Enforce strict idempotency in consumption. There are many ways to achieve idempotency, such as using distributed locks to serialize parallel processing, leveraging database transactions, or employing a state‑machine that tracks record status in the database.
Message Send Failure
Problem Category: Message Send Failure
Related Description: Improper usage or system faults can lead to failed message sends.
Root Causes
- Improper client usage, such as repeatedly creating instances, consumes excessive system resources.
- System anomalies are not monitored, leading to uncontrolled traffic without throttling or fallback plans.
- Send results are ignored.
Best Practices
- Create clients according to best practices, e.g., configure them as Spring beans to ensure a single instance per consumer group or producer.
- Pay attention to send results.
- Build effective traffic monitoring and emergency response plans.
Originally written by Li Wei (李唯_) and published in Chinese on 后端技术栈全书 (Full-Stack Backend Engineering). Translated and adapted for DriftSeas with permission.