Consumers should not be able to commit during a rebalance #4059

roxelo · 2022-11-09T15:09:16Z

Description

Now that librdkafka supports cooperative sticky partitions assignment strategy, we should ensure that consumers that commit offsets manually can’t commit offsets during rebalance as it triggers a follow up rebalance.

I don’t think there is any valid use cases for allowing this type of behavior:

This behavior could lead to an infinite amount of rebalances. Rebalances will only stop once none of the consumers attempt to commit offsets during a rebalance.
The Java Kafka Consumer lib does not appear to allow clients to commit offsets during a rebalance
Auto commit callback in librdkafka does not commit offsets when there is an ongoing rebalance

I believe this issue with manual auto commit is isolated to cooperative sticky strategy because when a consumer uses the eager strategy, a rebalance starts with all the partitions being revoked and ends when new partitions have been assigned to the consumer. As a result, the consumer will never attempt to commit offsets because there are no offsets to be committed. Of course, if the consumer uses cooperative sticky, we can’t ensure that the consumer won’t attempt to commit offsets during a rebalance as the consumer might still own partitions during a rebalance. Furthermore, clients have no way of knowing that a rebalance is ongoing or not and so they can’t prevent consumers from committing offsets when necessary.

I see three potential solutions to this problem, but I think the first one makes the most sense to implement:

Add the same check that exists in the auto commit callback in rd_kafka_commit. We can throw an error to the client to let them know why we did not attempt to commit offsets
Provide a new endpoint to librdkafka clients that allows them to check whether a rebalance is ongoing or not, so they are able to prevent consumers from committing offsets on their end
Provided the option to client (by passing a flag of some sort) to decide whether they want rd_kafka_commit to not attempt to commit offsets during rebalances

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): v1.8.2
Apache Kafka version: v2.7.1
librdkafka client configuration:

"partition.assignment.strategy": "cooperative-sticky"
"enable.auto.commit": false,

Operating system: <REPLACE with e.g., Centos 5 (x64)>
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

edenhill · 2022-11-09T17:05:16Z

I think the root of problem lies in OffsetCommit response error handling, we're probably getting a ERR_ILLEGAL_GENERATION back from the coordinator at this point, which causes us to rejoin the group - thus the rebalance.
Even if we add a rebalancing check to commit() it is possible for us to receive this error anyway, due to timing, so I think it might be better to also improve the error handling here and not trigger a rejoin if the group is rebalancing.

roxelo · 2022-11-09T18:02:20Z

Agreed, improving the error handling would make a great difference as well. I haven't been able to confirm this yet, but I believe a new rebalance is also triggered when the error is ERR_REBALANCE_IN_PROGRESS

mironovdm · 2022-12-24T23:49:11Z

It would be great if this behaviour were documented somewhere in the wiki or in the config description. We spent some efforts to understand what the heck is going on and why the cooperative-sticky rebalance is not working as expected and leads to endless rebalances.

We use confluent_kafka for python. I tried to implement a workaround using on_assign/on_revoke callbacks and prevent manual commits for some time when these callbacks are triggered. But noticed the group generation id can be changed even before these callbacks are triggered. As I understand currently with confluent_kafka python there is now way to know for sure the rebalance is in progress and we can't use manual commits with cooperative-sticky. Also I noticed when auto commit enabled librdkafka can send OffsetCommitRequests when rebalance is in progress and it doesn't lead to invalid group generation id error. I see this offset commit happens between the on_revoke callback triggered in one consumer and on_assign in another.

so I think it might be better to also improve the error handling here and not trigger a rejoin if the group is rebalancing.

Is not this contradicts to the behavior described in this KIP document? It says:

If received UNKNOWN_MEMBER_ID or ILLEGAL_GENERATION from join-group / sync-group / commit / heartbeat response: reset generation / clear member-id correspondingly, call rebalance listener's onPartitionsLost for all the partition and then re-join group with empty assigned partition.

wmorgan6796 · 2023-02-17T18:30:51Z

Just adding an additional data point that we see this exact behavior with our setup in kafka. I can try to take a crack at fixing this in my off time, but wondering if there is anyone on Confluent side working on this @milindl @emasab since if you're using manual offset commits, you cannot use Cooperative rebalancing today in any context.

scanterog · 2023-10-04T20:01:47Z

is there any ongoing work for this?

wmorgan6796 · 2023-10-04T20:33:30Z

@scanterog see #4220 , but in short, I think that this issue has dropped off the radar. Unfortunately the fix that @roxelo created (and I attempted to get merged) broke other behavior in librdkafka. @milindl said the librdkafka team was looking into this internally, but unfortunately from my side there are few things I can do as my place of work doesn't pay for confluent support and therefore we have no sway on their roadmap. If anyone who is interested in getting this more attention on the confluent side and is a paying confluent customer I highly recommend they push via their TAMs to get traction on this.

massimeddu-sj · 2024-08-27T10:24:40Z

Same issue here. Hopefully this will be prioritized by Confluent because it makes cooperative-stiky unusable with manual commits.

emasab · 2024-08-27T12:23:11Z

Commits are possible during a rebalance, before a partition is revoked the user can commit offsets for the revoked partitions in the rebalance callback, that is possible in Java client too.
What isn't needed is to avoid a re-join when the ILLEGAL_GENERATION error happens, as proposed by Magnus.

massimeddu-sj · 2024-08-27T13:54:32Z

Commits are possible during a rebalance, before a partition is revoked the user can commit offsets for the revoked partitions in the rebalance callback, that is possible in Java client too. What isn't needed is to avoid a re-join when the ILLEGAL_GENERATION error happens, as proposed by Magnus.

Thanks for looking into this @emasab. Any hope to see this sorted out soon?

Thank you!

emasab · 2024-08-28T07:48:40Z

@massimeddu-sj at the moment there's a strategy that should prevent this error during commit and a subsequent rebalance.
It requires you have a rebalance callback set, but that is often used when using manual offset management. The consumer should always commit in the consume loop and only when a message or an EOF is returned.
Given between the first and second phase of the cooperative rebalance the revoke callback is called, it returns from the consume call after calling it with no messages. Then second rejoin is started that increases the generation id, there is a second assignment callback with no message returning from the consume call and finally messages start coming again. If you commit manually only when you have a new message or receive a partition EOF, that error should be avoided.

massimeddu-sj · 2024-08-28T08:39:23Z

@massimeddu-sj at the moment there's a strategy that should prevent this error during commit and a subsequent rebalance. It requires you have a rebalance callback set, but that is often used when using manual offset management. The consumer should always commit in the consume loop and only when a message or an EOF is returned. Given between the first and second phase of the cooperative rebalance the revoke callback is called, it returns from the consume call after calling it with no messages. Then second rejoin is started that increases the generation id, there is a second assignment callback with no message returning from the consume call and finally messages start coming again. If you commit manually only when you have a new message or receive a partition EOF, that error should be avoided.

Thank you very much for the additional information @emasab . Unfortunately I'm not too familiar with the rebalance/revoke callbacks, and I'm not too confident on overriding the default implementation. It would be great if you are able to share any snippet or implementation example.

My current implementation is actually quite simple:

consumer_conf = {
    [...]
    'enable.auto.commit': False,
    'partition.assignment.strategy': 'cooperative-sticky'
}

kafka_consumer = Consumer(consumer_conf, logger=logger)

while True:
    message = kafka_consumer.poll()

    if message is None: continue

    if message.error():
        if message.error().code() == KafkaError._PARTITION_EOF:
            # End of partition event
            logger.error('%% %s [%d] reached end at offset %d\n' %
                        (message.topic(), message.partition(), message.offset()))
        else:
            raise KafkaConsumerException(Exception(message.error()))

    message_handler(message)

    kafka_consumer.commit(asynchronous=False)

…world" scenarios (e.g., cooperative sticky). * Enabled automatic committing with `confluent auto commit: true` instead of relying solely on manual commits, but only when the consumer strategy is cooperative sticky. (Refer to the open librdkafka issue at confluentinc/librdkafka#4059).

…world" scenarios (e.g., cooperative sticky). Fixes issue Farfetch#557 and Fixes issue Farfetch#456 * Enabled automatic committing with `confluent auto commit: true` instead of relying solely on manual commits, but only when the consumer strategy is cooperative sticky. (Refer to the open librdkafka issue at confluentinc/librdkafka#4059).

sjportalatin · 2024-10-07T22:45:21Z

Hi team, for anyone still looking for a workaround. It seems like we have been able to get around this issue by using Kafka autocommit, as suggested, however we still want control over when offsets are committed, so we used store_offsets() to mark them ready for commit. This requires setting enable.auto.offset.store to False and enable.auto.commit to True. In our code, anywhere we would use commit() we now use store_offsets([offsets]) and let Kafka commit automatically using auto.commit.interval.ms.

This way, the cooperative-sticky assignment does not trigger endless rebalances and we also have control over which offsets to commit. Hope this helps!

emasab · 2024-11-20T10:57:07Z

If you commit manually only when you have a new message or receive a partition EOF, that error should be avoided.

TL;DR: manual commits during a rebalance will be sent but they won't cause an additional rebalance or losing the assignment.

It can happen that the consumer receives a message after the first rejoin of the cooperative incremental assignment, in that case the partitions are resumed, and only later the second rejoin is done to redistribute the revoked partitions.

The message can be committed manually, causing an ILLEGAL_GENERATION error.
That is possible, but must not cause it to lose the assignment while rebalance is in progress, as in this Java client code.

Generation id was incorrectly reset to -1 on OffsetCommit ILLEGAL_GENERATION error and, immediately after that, the second SyncGroup fails with same error because of the wrong generation id, this time causing lost partitions.

Differences from Java client: it avoids sending the commit and the exception is a RebalanceInProgressException instead.
As users are already expecting an ILLEGAL_GENERATION error, in librdkafka we leave this change to the error code for 3.x.

edenhill added the GREAT REPORT label Nov 9, 2022

roxelo mentioned this issue Dec 1, 2022

Prevent offsets from being committed manually when a rebalance is in progress and partitions are owned by consumer #4089

Closed

mironovdm mentioned this issue Dec 25, 2022

Partitions get revoked and assigned multiple times on other consumer on shut down of one consumer for Cooperative sticky assignment strategybeh #3891

Open

3 tasks

wmorgan6796 mentioned this issue Mar 14, 2023

Only commit when a rebalance is not in progress (Fix for Cooperative-sticky partition assignment strategy with manual commits) #4220

Open

RonAmihai mentioned this issue Apr 10, 2024

[CooperativeSticky] Best practice to use manual commits & avoid commits during rebalance confluentinc/confluent-kafka-dotnet#2206

Open

golanbz mentioned this issue Sep 2, 2024

Fix: Issues with cooperative-sticky strategy Farfetch/kafkaflow#593

Open

4 tasks

emasab added the bug label Nov 20, 2024

airlock-confluentinc bot pushed a commit that referenced this issue Nov 20, 2024

Closes #4059

5701bb8

emasab linked a pull request Nov 20, 2024 that will close this issue

Fix assignment lost, on illegal generation, during a commit #4908

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consumers should not be able to commit during a rebalance #4059

Consumers should not be able to commit during a rebalance #4059

roxelo commented Nov 9, 2022

edenhill commented Nov 9, 2022

roxelo commented Nov 9, 2022

mironovdm commented Dec 24, 2022

wmorgan6796 commented Feb 17, 2023

scanterog commented Oct 4, 2023

wmorgan6796 commented Oct 4, 2023

massimeddu-sj commented Aug 27, 2024

emasab commented Aug 27, 2024

massimeddu-sj commented Aug 27, 2024

emasab commented Aug 28, 2024 •

edited

Loading

massimeddu-sj commented Aug 28, 2024

sjportalatin commented Oct 7, 2024

emasab commented Nov 20, 2024 •

edited

Loading

Consumers should not be able to commit during a rebalance #4059

Consumers should not be able to commit during a rebalance #4059

Comments

roxelo commented Nov 9, 2022

Description

Checklist

edenhill commented Nov 9, 2022

roxelo commented Nov 9, 2022

mironovdm commented Dec 24, 2022

wmorgan6796 commented Feb 17, 2023

scanterog commented Oct 4, 2023

wmorgan6796 commented Oct 4, 2023

massimeddu-sj commented Aug 27, 2024

emasab commented Aug 27, 2024

massimeddu-sj commented Aug 27, 2024

emasab commented Aug 28, 2024 • edited Loading

massimeddu-sj commented Aug 28, 2024

sjportalatin commented Oct 7, 2024

emasab commented Nov 20, 2024 • edited Loading

emasab commented Aug 28, 2024 •

edited

Loading

emasab commented Nov 20, 2024 •

edited

Loading