-
Notifications
You must be signed in to change notification settings - Fork 3.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fatal error "out of order sequence number" makes idempotent producer not functional anymore #3584
Comments
Some of my own investigations lead me to believe that this is an issue with librdkafka. I did the same steps to reproduce as listed above, making 8 successful producer messages before the Kafka broker downtime. I've attached a producer log snippet (producer_log_1.txt) and i also found this single error message on the kafka broker: After the ~20 minutes of downtime, it seems that librdkafka triggered some internal timer, indicating that the PID should be invalidated on the next producer message. When producing after the downtime, in the log snippet, you can see that librdkafka managed to change the PID to 5000 and assume that the message sequence number has been reset to 1: However, the PID=5000 was also assigned to this producer before the downtime (not shown in the log snippets), and, in the Kafka broker error log above, you can clearly see that the broker still expected to continue the sequence numbers from 8 instead of 0 for the PID=5000. So librdkafka, presumably, using some internal timer, has kept the old PID, but reset the sequence number while the broker has retained the original value after the downtime. I'm not really familiar with the librdkafka source code, so i'm struggling to follow this flow there. Edit: When the Kafka broker downtime is less (~1 minute), then librdkafka successfully reconnects, obtaining the same PID and continuing from the sequence number it left off. Some digging through the source code makes me believe that it falls into this if statement when it shouldn't, leading to a call to rd_kafka_toppar_reset_base_msgid, which resets the MsgId. The only real way it can be true with the same PID is if the epoch's are different: https://github.com/edenhill/librdkafka/blob/158e83aea9416f7bd665c9429a262378f6b65ec1/src/rdkafka_proto.h#L586-L589 |
Read the FAQ first: https://github.com/edenhill/librdkafka/wiki/FAQ
Description
We met some fatal error with idempotence producer after kafka cluster was down, -- the producer would not be able to send any more message from then on!
The problem still exists even with latest librdkafka release (
v1.8.2
) -- thus not the same problem with #3577Here's the log while error occurred,
(However, it seems no related log from broker side)
How to reproduce
Checklist
IMPORTANT: We will close issues where the checklist has not been completed.
Please provide the following information:
1.8.2
kafka_2.13-2.8.0
enable.idempotence=true
ubuntu20.04 x64
debug=..
as necessary) from librdkafkaThe text was updated successfully, but these errors were encountered: