Picture this: you're working on a massive e-commerce application that relies on Kafka for messaging. Everything is running smoothly until one day, you check your metrics dashboard and notice a sudden increase in the number of deliveries compared to the orders placed. This anomaly throws your team into a frenzy as you try to pinpoint the problem. After extensive debugging sessions powered by endless cups of coffee, you finally discover the culprit isn't a coding bug, but an influx of duplicate messages overwhelming your topics. In this blog, we're going to thoroughly explore the scenarios that can lead to these duplicates and examine various strategies to prevent them.
Let's take, for example, an order service that sends out messages to an order topic.
In step 3, transient network issues, like a brief loss of network connectivity, might cause the acknowledgement from Kafka to be lost. Consequently, the order service will continue to resend the same message until it gets a successful acknowledgement, potentially creating a duplicate.
Resolution: Idempotent Producer
Using an idempotent producer can address the issue effectively. For this method to work, the order service must have a distinct producer ID (PID), and every message it sends needs a unique sequence number. Kafka keeps track of each message by looking at the combination of the PID and the sequence number as its unique identifier. Thus, if there's a need to retry an already sent message, Kafka will acknowledge the request without adding the message again to the log.
To enable idempotency in a Kafka producer, you need to set the enable.idempotence
configuration property to true
, adjust retries
to a value higher than 0, and set acks
to all.
enable.idempotence
— Setting this to true
allows the producer to automatically retry in case of certain retryable errors. These errors are typically transient, like when the leader is unavailable or there aren't enough replicas.
acks
– When the value is set to all
, Kafka ensures the leader waits until at least the minimum number of in-sync replica partitions have acknowledged the message before it sends an acknowledgement to the producer.
Now, let's take a look at a fulfillment service that:
- Reads messages from the Order Topic.
- Makes a POST request to the Audit Service.
- Adds a fresh entry to the Fulfillment Table.
- Sends a message to the Fulfillment Topic.
- Adjusts the offset in Kafka.
If the service instance fails to complete steps 2–5 within the designated time frame, Kafka will consider it non-responsive. As a result, the service instance will be excluded from the consumer group, and the partition will undergo rebalancing. Consequently, the same message will be assigned to another consumer in the group for processing.
Resolution: Idempotent Consumer
Keeping a record of all successfully processed messages is essential to prevent potential issues. This can be managed by giving each message generated by the producer (order service) a unique ID and monitoring these on the consumer side (fulfillment service). This is done by saving each ID in a database table called the Message ID Tracking Table. When a message with a duplicate ID comes in, it can be detected by checking the Message ID Tracking Table. The offset is then instantly updated, and any subsequent processing is bypassed.
Furthermore, adding the record to the Tracking and Fulfillment Tables should be executed within a DB transaction. This guarantees that both operations are undone if any issue arises.
After publishing a message to the fulfillment topic, there's a chance that the transaction could fail. This failure triggers a retry, ultimately causing a duplicate message in the fulfillment topic. This method does not resolve this issue.
Resolution: Idempotent Consumer + Transactional Outbox
Having a distributed transaction that covers both the database and Kafka is not feasible since Kafka doesn't support XA transactions. To address this, you can use an outbox table to capture events that need to be published. The data written to this table should be part of the same database transaction that updates the Tracking and Fulfillment tables.
This guarantees that database writes and message publishing to Kafka occur as a single, atomic operation. Afterward, a Change Data Capture (CDC) tool like Kafka Connect or Debezium can be used to publish the event to the fulfillment topic.
Despite using this method, there's still a chance that duplicate POST calls may occur. This can happen if retries are initiated when a transaction fails and subsequently rolls back after the call. This issue can arise regardless of the sequence in which the call is executed. The only way to address this is to ensure that the POST call is idempotent on the receiver side.
These methods add numerous elements that elevate complexity and upkeep challenges. Therefore, it would be prudent to adopt a gradual approach and only apply these methods when there are significant metrics validating their inclusion.