Fix to remove fetch queue messages that blocked the destroy of rdkafka instances #4724

emasab · 2024-05-21T08:25:26Z

Circular dependencies from a partition fetch queue message to the same partition blocked the destroy of an instance, that happened in case the partition was removed from the cluster while it was being consumed. Solved by purging internal partition queue, after being stopped and removed, to allow reference count to reach zero and trigger a destroy

emasab · 2024-05-21T08:27:41Z

How to reproduce: happening sporadically with test 0113 subtest n_wildcard. Run it it with TEST_DEBUG=all and until-fail.sh to see the refcnt not reaching zero.

zuellig · 2024-08-26T19:22:59Z

FWIW I've confirmed that this branch also fixes a non-constant but somewhat frequent issue I've been observing. I reproduced it by stressing the client node's swap while also stressing the broker node's cpu. It took a few restart cycles but within several hours the deadlocked destroy call occurred.

antaljanosbenjamin · 2024-09-11T17:21:02Z

Is there any expected date on this to be merged?

confluent-cla-assistant · 2024-09-24T13:16:45Z

🎉 All Contributor License Agreements have been signed. Ready to merge.
_{Please push an empty commit if you would like to re-run the checks to verify CLA status for all contributors.}

antaljanosbenjamin · 2024-09-25T14:47:55Z

All Contributor License Agreements have been signed. Ready to merge.

One step closer! 🎉

pranavrth · 2024-10-28T05:36:39Z

src/rdkafka_broker.c

@@ -3407,6 +3407,8 @@ rd_kafka_broker_op_serve(rd_kafka_broker_t *rkb, rd_kafka_op_t *rko) {
                                : (topic_err
                                       ? topic_err
                                       : RD_KAFKA_RESP_ERR__UNKNOWN_PARTITION));
+
+                        rd_kafka_toppar_purge_internal_fetch_queue_maybe(rktp);


Shall we do this only for the Consumer?

Yes, better to avoid it for the producer given rktp_fetchq is allocated there too but the queue isn't forwarded we avoid looping through this empty queue.

…a instances Circular dependencies from a partition fetch queue message to the same partition blocked the destroy of an instance, that happened in case the partition was removed from the cluster while it was being consumed. Solved by purging internal partition queue, after being stopped and removed, to allow reference count to reach zero and trigger a destroy

on removing the partition only for the consumer

emasab · 2024-10-29T14:50:35Z

Addressed comment, updated CHANGELOG and rebased

ydsun90 · 2024-10-30T09:48:07Z

Hi, is there an estimated time to have this fix in a release? I have got the same issue when closing kafka consumer #4885.

pranavrth

LGTM!.

ydsun90 · 2024-11-05T18:39:38Z

Hi, is there an estimated time to have this fix in a release? I have got the same issue when closing kafka consumer #4885.

Any plan to release the fix in a new version?

ydsun90 · 2024-11-13T10:58:40Z

Hi, is there an estimated time to have this fix in a release? I have got the same issue when closing kafka consumer #4885.

Any plan to release the fix in a new version?

@pranavrth @emasab Hi, is there any expected time to have this fix in a new release?

…a instances (#4724) Circular dependencies from a partition fetch queue message to the same partition blocked the destroy of an instance, that happened in case the partition was removed from the cluster while it was being consumed. Solved by purging internal partition queue, after being stopped and removed, to allow reference count to reach zero and trigger a destroy. Purging internal fetch queue on removing the partition only for the consumer.

Related PR: confluentinc/librdkafka#4724

emasab force-pushed the dev_fix_op_toppar_blocking_destroy branch from 108046f to 82080db Compare May 21, 2024 08:28

emasab mentioned this pull request May 21, 2024

Starting with 8e20e1ee, after broker goes down and back up, rd_kafka_destroy of groupconsumer hangs #4674

Open

emasab force-pushed the dev_fix_op_toppar_blocking_destroy branch from 82080db to 9b9d001 Compare June 12, 2024 13:06

emasab requested a review from a team as a code owner June 12, 2024 13:06

filimonov mentioned this pull request Jun 14, 2024

Kafka Engine improvements RFC ClickHouse/ClickHouse#65297

Open

ilejn mentioned this pull request Jul 12, 2024

ThreadSanitizer: lock-order-inversion (potential deadlock) #4782

Open

4 tasks

emasab mentioned this pull request Oct 15, 2024

ThreadSanitizer: data race + hang in rd_kafka_destroy (or rd_kafka_destroy_flags) #4811

Open

6 tasks

pranavrth reviewed Oct 28, 2024

View reviewed changes

emasab added 2 commits October 29, 2024 15:48

Purge internal fetch queue

7ce0d72

on removing the partition only for the consumer

airlock-confluentinc bot force-pushed the dev_fix_op_toppar_blocking_destroy branch from 9b9d001 to 7ce0d72 Compare October 29, 2024 14:49

emasab requested a review from pranavrth October 29, 2024 14:50

pranavrth approved these changes Oct 30, 2024

View reviewed changes

emasab merged commit 9199335 into master Oct 30, 2024
2 checks passed

emasab deleted the dev_fix_op_toppar_blocking_destroy branch October 30, 2024 12:44

sollhui mentioned this pull request Dec 3, 2024

[fix](third party) fix hang when destroy of rdkafka instances apache/doris#44913

Merged

16 tasks

liaoxin01 pushed a commit to apache/doris that referenced this pull request Dec 3, 2024

[fix](third party) fix hang when destroy of rdkafka instances (#44913)

802ba71

Related PR: confluentinc/librdkafka#4724

dataroaring pushed a commit to apache/doris that referenced this pull request Dec 5, 2024

[fix](third party) fix hang when destroy of rdkafka instances (#44913)

983edb6

Related PR: confluentinc/librdkafka#4724

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix to remove fetch queue messages that blocked the destroy of rdkafka instances #4724

Fix to remove fetch queue messages that blocked the destroy of rdkafka instances #4724

emasab commented May 21, 2024

emasab commented May 21, 2024

zuellig commented Aug 26, 2024

antaljanosbenjamin commented Sep 11, 2024

confluent-cla-assistant bot commented Sep 24, 2024

antaljanosbenjamin commented Sep 25, 2024

pranavrth Oct 28, 2024

emasab Oct 29, 2024

emasab commented Oct 29, 2024

ydsun90 commented Oct 30, 2024

pranavrth left a comment

ydsun90 commented Nov 5, 2024 •

edited

Loading

ydsun90 commented Nov 13, 2024

Fix to remove fetch queue messages that blocked the destroy of rdkafka instances #4724

Fix to remove fetch queue messages that blocked the destroy of rdkafka instances #4724

Conversation

emasab commented May 21, 2024

emasab commented May 21, 2024

zuellig commented Aug 26, 2024

antaljanosbenjamin commented Sep 11, 2024

confluent-cla-assistant bot commented Sep 24, 2024

antaljanosbenjamin commented Sep 25, 2024

pranavrth Oct 28, 2024

Choose a reason for hiding this comment

emasab Oct 29, 2024

Choose a reason for hiding this comment

emasab commented Oct 29, 2024

ydsun90 commented Oct 30, 2024

pranavrth left a comment

Choose a reason for hiding this comment

ydsun90 commented Nov 5, 2024 • edited Loading

ydsun90 commented Nov 13, 2024

ydsun90 commented Nov 5, 2024 •

edited

Loading