Issues with "partition.assignment.strategy=cooperative-sticky" #3306

shanson7 · 2021-03-19T12:38:48Z

Description

We are making changes to support cooperative/incremental rebalance. We are seeing errors during rebalance and what seems to be sub-optimal rebalancing (both when using the default rebalance_cb and our custom one, which just logs the partitions that are passed in and calls incremental_(un)assign).

During rebalance it is not uncommon to get errors like:

19MAR2021_12:25:56.621 289 139893795714816 INFO ipbkafka_consumer.cpp 387 IPBKAFKA.CONSUMER KAFKA LOG: severity = 4 fac = COMMITFAIL event = [thrd:main]: Offset commit (unassigned partitions) failed for 2/5 partition(s) in join-state wait-incr-unassign-to-complete: Broker: Specified group generation id is not valid: apdx[5]@581544535(Broker: Specified group generation id is not valid), apdx[11]@418573170(Broker: Specified group generation id is not valid)

Additionally, it seems when a consumer leaves the group, partitions are being shuffled between the remaining consumers (I would expect that cooperative rebalance would only assign partitions from the consumer that left without revoking any).

How to reproduce

Bring up a consumer and let it start consuming. Bring up a second and let it rebalance. Bring up a third. bring down the third.

When the third consumer is added, we see the above error. When the third consumer leaves, we see partitions shuffled unexpectedly. I can reproduce consistently, so I can upload load with whatever debug settings are relevant (at least "cgrp" I imagine).

Checklist

IMPORTANT: We will close issues where the checklist has not been completed.

Please provide the following information:

librdkafka version (release number or git tag): 1.6.1
Apache Kafka version: kafka_2.13-2.6.0
librdkafka client configuration: partition.assignment.strategy=cooperative-sticky
Operating system: rhel7
Provide logs (with debug=.. as necessary) from librdkafka
Provide broker log excerpts
Critical issue

The text was updated successfully, but these errors were encountered:

mhowlett · 2021-03-19T16:39:17Z

thanks @shanson7 - can reproduced both, investigating.

mhowlett · 2021-03-19T18:48:25Z

i believe the shuffling is happening because the prevAssignment as initialized here:

https://github.com/edenhill/librdkafka/blob/master/src/rdkafka_sticky_assignor.c#L1189

is not using rkgm_owned, so it's always empty.

mhowlett · 2021-03-19T19:56:45Z

the COMMITFAIL is apparently causing the partitions to be lost as well, so additional rebalance. i don't have a good idea how/why this could be happening. will investigate more Tuesdayish.

edenhill · 2021-04-06T15:49:46Z

Also possibly related: 1b40aad

…ct count comparison (#3306)

rkgm_assignment, which is the member assignment after running the assignor, was mixed up with rkgm_owned, which is the current member assignment before running the assignor. This resulted in the sticky assignor not taking the current assignment into consideration on rebalance and thus not being able to provide the stickyness.

shanson7 · 2021-06-25T16:17:49Z

Is this fixed in 1.7.0?

edenhill · 2021-06-30T19:00:51Z

@shanson7 Yes!

pratikthakkar24 · 2022-07-04T16:08:44Z

hey @edenhill @mhowlett , the issue related to partition shuffling between the remaining consumers which is mentioned here, is it still open ?
I am facing similar issue and i have reported it, kindly find the link below.
#3891

I am using librdkafka version 1.9.0.1

mhowlett mentioned this issue Mar 22, 2021

Prevent auto-commit during rebalance #3313

Closed

mhowlett mentioned this issue Apr 6, 2021

cooperative-sticky doesn't work as expected confluentinc/confluent-kafka-go#622

Open

7 tasks

edenhill added a commit that referenced this issue Apr 19, 2021

Sticky assignor: fix balancing and reassignment issues due to incorre…

445e7a2

…ct count comparison (#3306)

edenhill added a commit that referenced this issue Apr 19, 2021

Sticky assignor: fix balancing and reassignment issues due to incorre…

31da24e

…ct count comparison (#3306)

sgjurano mentioned this issue May 11, 2021

When will v1.7.0 be released for python binding? confluentinc/confluent-kafka-python#1118

Closed

7 tasks

edenhill closed this as completed Jun 30, 2021

caiquefigueiredo mentioned this issue Feb 10, 2022

Set partition.assignment.strategy to range and cooperative NoRedInk/haskell-libraries#35

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Issues with "partition.assignment.strategy=cooperative-sticky" #3306

Issues with "partition.assignment.strategy=cooperative-sticky" #3306

shanson7 commented Mar 19, 2021

mhowlett commented Mar 19, 2021

mhowlett commented Mar 19, 2021 •

edited

Loading

mhowlett commented Mar 19, 2021

edenhill commented Apr 6, 2021

shanson7 commented Jun 25, 2021

edenhill commented Jun 30, 2021

pratikthakkar24 commented Jul 4, 2022 •

edited

Loading

Issues with "partition.assignment.strategy=cooperative-sticky" #3306

Issues with "partition.assignment.strategy=cooperative-sticky" #3306

Comments

shanson7 commented Mar 19, 2021

Description

How to reproduce

Checklist

mhowlett commented Mar 19, 2021

mhowlett commented Mar 19, 2021 • edited Loading

mhowlett commented Mar 19, 2021

edenhill commented Apr 6, 2021

shanson7 commented Jun 25, 2021

edenhill commented Jun 30, 2021

pratikthakkar24 commented Jul 4, 2022 • edited Loading

mhowlett commented Mar 19, 2021 •

edited

Loading

pratikthakkar24 commented Jul 4, 2022 •

edited

Loading