Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issues with "partition.assignment.strategy=cooperative-sticky" #3306

Closed
4 of 7 tasks
shanson7 opened this issue Mar 19, 2021 · 7 comments
Closed
4 of 7 tasks

Issues with "partition.assignment.strategy=cooperative-sticky" #3306

shanson7 opened this issue Mar 19, 2021 · 7 comments

Comments

@shanson7
Copy link
Contributor

Description

We are making changes to support cooperative/incremental rebalance. We are seeing errors during rebalance and what seems to be sub-optimal rebalancing (both when using the default rebalance_cb and our custom one, which just logs the partitions that are passed in and calls incremental_(un)assign).

During rebalance it is not uncommon to get errors like:

19MAR2021_12:25:56.621 289 139893795714816 INFO ipbkafka_consumer.cpp 387 IPBKAFKA.CONSUMER KAFKA LOG: severity = 4 fac = COMMITFAIL event = [thrd:main]: Offset commit (unassigned partitions) failed for 2/5 partition(s) in join-state wait-incr-unassign-to-complete: Broker: Specified group generation id is not valid: apdx[5]@581544535(Broker: Specified group generation id is not valid), apdx[11]@418573170(Broker: Specified group generation id is not valid)

Additionally, it seems when a consumer leaves the group, partitions are being shuffled between the remaining consumers (I would expect that cooperative rebalance would only assign partitions from the consumer that left without revoking any).

How to reproduce

Bring up a consumer and let it start consuming. Bring up a second and let it rebalance. Bring up a third. bring down the third.

When the third consumer is added, we see the above error. When the third consumer leaves, we see partitions shuffled unexpectedly. I can reproduce consistently, so I can upload load with whatever debug settings are relevant (at least "cgrp" I imagine).

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

  • librdkafka version (release number or git tag): 1.6.1
  • Apache Kafka version: kafka_2.13-2.6.0
  • librdkafka client configuration: partition.assignment.strategy=cooperative-sticky
  • Operating system: rhel7
  • Provide logs (with debug=.. as necessary) from librdkafka
  • Provide broker log excerpts
  • Critical issue
@mhowlett
Copy link
Contributor

thanks @shanson7 - can reproduced both, investigating.

@mhowlett
Copy link
Contributor

mhowlett commented Mar 19, 2021

i believe the shuffling is happening because the prevAssignment as initialized here:

https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_sticky_assignor.c#L1189

is not using rkgm_owned, so it's always empty.

@mhowlett
Copy link
Contributor

the COMMITFAIL is apparently causing the partitions to be lost as well, so additional rebalance. i don't have a good idea how/why this could be happening. will investigate more Tuesdayish.

@edenhill
Copy link
Contributor

edenhill commented Apr 6, 2021

Also possibly related: 1b40aad

edenhill added a commit that referenced this issue Apr 19, 2021
edenhill added a commit that referenced this issue Apr 19, 2021
edenhill added a commit that referenced this issue Apr 20, 2021
rkgm_assignment, which is the member assignment after running the assignor,
was mixed up with rkgm_owned, which is the current member assignment before
running the assignor. This resulted in the sticky assignor not taking the
current assignment into consideration on rebalance and thus not being able to
provide the stickyness.
edenhill added a commit that referenced this issue Apr 20, 2021
rkgm_assignment, which is the member assignment after running the assignor,
was mixed up with rkgm_owned, which is the current member assignment before
running the assignor. This resulted in the sticky assignor not taking the
current assignment into consideration on rebalance and thus not being able to
provide the stickyness.
@shanson7
Copy link
Contributor Author

Is this fixed in 1.7.0?

@edenhill
Copy link
Contributor

@shanson7 Yes!

@pratikthakkar24
Copy link

pratikthakkar24 commented Jul 4, 2022

hey @edenhill @mhowlett , the issue related to partition shuffling between the remaining consumers which is mentioned here, is it still open ?
I am facing similar issue and i have reported it, kindly find the link below.
#3891

I am using librdkafka version 1.9.0.1

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants