-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[fix][broker] Fix the behavior of delayed message in Key_Shared mode #20233
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
By the way, pulsar didn't support delayed messages in Key_Shared mode before, right?
/pulsarbot run-failure-checks |
@poorbarcode The document said
The key-shared subscription can allow out-of-order delivery, or we can say a loose ordering guarantee. So it still makes sense to let them work together. But we should have clear documentation for the expected behavior of the consumption of the delayed message under a key-shared subscription. |
Codecov Report
@@ Coverage Diff @@
## master #20233 +/- ##
=============================================
+ Coverage 34.48% 72.93% +38.44%
- Complexity 12537 31964 +19427
=============================================
Files 1614 1868 +254
Lines 126170 138591 +12421
Branches 13771 15246 +1475
=============================================
+ Hits 43509 101076 +57567
+ Misses 77053 29470 -47583
- Partials 5608 8045 +2437
Flags with carried forward coverage won't be shown. Click here to find out more.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work
…pache#20233) (cherry picked from commit b07ae00)
@poorbarcode Did you end up documenting the change of behavior? |
Could we removeConsumersFromRecentJoinedConsumers to fix it? |
As discussed on the mailing list https://lists.apache.org/thread/w4jzk27qhtosgsz7l9bmhf1t7o9mxjhp, there is no plan to release 2.9.6, so I am going to remove the release/2.9.6 label |
…ed mode (apache#20233)" This reverts commit 00f17e8.
Doc is WIP, waiting for review comments from @codelipenghui and @coderzc on the following plans:
|
Fixes #8703
Motivation
Fixed the unexpected consumer stuck issue of Key_Shared subscription by delayed messages.
The issue can be reproduced by the test provided in #8703.
Background
The expected behavior
if there are delayed messages in [0, 4] and [5,9]. All the messages from [5, 9] can be dispatched to consumer-B until all the messages from [0, 4] are acked. If there are messages with 1 hour delay in [0,4], consumer-B will be stuck for 1 hour. But maybe there are messages only with 10 sec delay in [5, 9]. This is still a very strange behavior for users, but it is expected for the current implementation
The unexpected behavior
The consumers can not able to receive all the messages even if all the delayed messages are ready to deliver to consumers. The newly added test can reproduce the issue. The logs of the test will help us to understand the issue.
But the delayed messages are delivered to the first consumer when the test fails. The receive timeout is 30 seconds, and the messages are with 10 seconds. So the delayed messages for the first consumer should be delivered.
The root cause
The broker first replays messages from the replay queue without looking at the delayed tracker.
pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/persistent/PersistentDispatcherMultipleConsumers.java
Lines 1072 to 1084 in c4aec66
So, if all the messages in the relay queue are for the second consumer, the dispatch will run into an infinite loop(it also should be fixed, but not in this PR) because the messages are not allowed to be delivered at that time due to the first consumer hadn't acked the messages before the second consumer joined.
Modifications
Change the getMessagesToReplayNow method to dispatch messages based on the message ID (FIFO). So that the first consumer can receive the delayed messages, and then the second consumer can start to consume messages after the first consumer acked the delayed messages.
Verifying this change
New test is added
Does this pull request potentially affect one of the following parts:
If the box was checked, please highlight the changes
Documentation
doc
doc-required
doc-not-needed
doc-complete