Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix] [broker] Msg delivery is stuck due to items in the collection recentlyJoinedConsumers are out-of-order #23795

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Dec 31, 2024

Motivation

Version: 3.0.7, thanks for @shibd investigating the issue together and mentioning the key clue that items are out of order.

Issue 1: concurrentcy access a un-thread-safety collection

  • The variable recentlyJoinedConsumers of PersistentStickyKeyDispatcherMultipleConsumers is used to guarantee the delivery order of Key_Shared subscriptions, which is not a thread-safety collection.
  • Almost all operations that access this collection are in a lock block except in one place that the current PR fixes

Issue 2: the items in recentlyJoinedConsumers are out of order

  • First node 104142:122103
  • ...
  • middle node 1 104142:122103
  • middle node 2 104046:283984
  • ...
  • tail node 104046:284275

But we have more than one place that relies on the order rule, such as follows

        if (readType == ReadType.Replay) {
            // broker assumes the first node is the eldest one
            PositionImpl minReadPositionForRecentJoinedConsumer = recentlyJoinedConsumers.values().iterator().next();
            if (minReadPositionForRecentJoinedConsumer != null
                    && minReadPositionForRecentJoinedConsumer.compareTo(maxReadPosition) < 0) {
                maxReadPosition = minReadPositionForRecentJoinedConsumer;
            }
        }
            PositionImpl nextPositionOfTheMarkDeletePosition =
                    ((ManagedLedgerImpl) cursor.getManagedLedger()).getNextValidPosition(mdp);
            while (itr.hasNext()) {
                Map.Entry<Consumer, PositionImpl> entry = itr.next();
                if (entry.getValue().compareTo(nextPositionOfTheMarkDeletePosition) <= 0) {
                    itr.remove();
                    hasConsumerRemovedFromTheRecentJoinedConsumers = true;
                } else {
                    // break if one node does not match the check
                    break;
                }
            }
 "consumersAfterMarkDeletePosition" : {
        "rudder-src-router-19-2" : "104142:122103",
        "rudder-src-router-22-3" : "104142:122103",
        "rudder-src-router-22-2" : "104142:122103",
        "rudder-src-router-12-3" : "104142:122103",
        "rudder-src-router-2-0" : "104142:122103",
        "rudder-src-router-4-1" : "104142:122103",
        "rudder-src-router-18-3" : "104142:122103",
        "rudder-src-router-29-2" : "104046:283984",
        "rudder-src-router-13-1" : "104046:283984",
        "rudder-src-router-13-0" : "104046:283984",
        "rudder-src-router-21-0" : "104046:283984",
        "rudder-src-router-21-1" : "104046:283984",
        "rudder-src-router-17-1" : "104046:283984",
        "rudder-src-router-1-1" : "104046:283984",
        "rudder-src-router-24-1" : "104046:283990",
        "rudder-src-router-3-1" : "104046:283990",
        "rudder-src-router-3-3" : "104046:283990",
        "rudder-src-router-13-2" : "104046:283990",
        "rudder-src-router-7-1" : "104046:283990",
        "rudder-src-router-7-3" : "104046:283990",
        "rudder-src-router-12-2" : "104046:283990",
        "rudder-src-router-2-1" : "104046:284068",
        "rudder-src-router-17-2" : "104046:284068",
        "rudder-src-router-17-0" : "104046:284068",
        "rudder-src-router-6-0" : "104046:284123",
        "rudder-src-router-1-2" : "104046:284123",
        "rudder-src-router-1-0" : "104046:284123",
        "rudder-src-router-18-0" : "104046:284123",
        "rudder-src-router-0-0" : "104046:284123",
        "rudder-src-router-16-0" : "104046:284123",
        "rudder-src-router-27-3" : "104046:284123",
        "rudder-src-router-15-1" : "104046:284124",
        "rudder-src-router-27-2" : "104046:284124",
        "rudder-src-router-15-2" : "104046:284124",
        "rudder-src-router-22-1" : "104046:284138",
        "rudder-src-router-12-0" : "104046:284138",
        "rudder-src-router-24-3" : "104046:284150",
        "rudder-src-router-4-0" : "104046:284150",
        "rudder-src-router-28-0" : "104046:284150",
        "rudder-src-router-28-1" : "104046:284150",
        "rudder-src-router-23-2" : "104046:284150",
        "rudder-src-router-6-1" : "104046:284154",
        "rudder-src-router-5-1" : "104046:284154",
        "rudder-src-router-6-3" : "104046:284154",
        "rudder-src-router-6-2" : "104046:284157",
        "rudder-src-router-5-2" : "104046:284157",
        "rudder-src-router-5-0" : "104046:284157",
        "rudder-src-router-5-3" : "104046:284161",
        "rudder-src-router-0-3" : "104046:284164",
        "rudder-src-router-28-3" : "104046:284164",
        "rudder-src-router-26-3" : "104046:284164",
        "rudder-src-router-11-3" : "104046:284164",
        "rudder-src-router-14-2" : "104046:284169",
        "rudder-src-router-4-2" : "104046:284177",
        "rudder-src-router-29-1" : "104046:284177",
        "rudder-src-router-2-3" : "104046:284177",
        "rudder-src-router-22-0" : "104046:284177",
        "rudder-src-router-23-1" : "104046:284177",
        "rudder-src-router-20-0" : "104046:284182",
        "rudder-src-router-10-2" : "104046:284182",
        "rudder-src-router-20-2" : "104046:284182",
        "rudder-src-router-25-3" : "104046:284182",
        "rudder-src-router-8-3" : "104046:284182",
        "rudder-src-router-8-2" : "104046:284182",
        "rudder-src-router-26-0" : "104046:284182",
        "rudder-src-router-26-2" : "104046:284182",
        "rudder-src-router-4-3" : "104046:284182",
        "rudder-src-router-9-1" : "104046:284182",
        "rudder-src-router-18-1" : "104046:284182",
        "rudder-src-router-0-2" : "104046:284182",
        "rudder-src-router-21-2" : "104046:284182",
        "rudder-src-router-7-2" : "104046:284193",
        "rudder-src-router-27-0" : "104046:284193",
        "rudder-src-router-27-1" : "104046:284193",
        "rudder-src-router-16-3" : "104046:284193",
        "rudder-src-router-21-3" : "104046:284193",
        "rudder-src-router-8-1" : "104046:284193",
        "rudder-src-router-10-1" : "104046:284193",
        "rudder-src-router-10-3" : "104046:284193",
        "rudder-src-router-15-0" : "104046:284193",
        "rudder-src-router-11-1" : "104046:284193",
        "rudder-src-router-11-2" : "104046:284193",
        "rudder-src-router-17-3" : "104046:284193",
        "rudder-src-router-25-0" : "104046:284193",
        "rudder-src-router-19-1" : "104046:284200",
        "rudder-src-router-24-0" : "104046:284200",
        "rudder-src-router-20-1" : "104046:284200",
        "rudder-src-router-14-0" : "104046:284200",
        "rudder-src-router-9-3" : "104046:284200",
        "rudder-src-router-8-0" : "104046:284204",
        "rudder-src-router-14-3" : "104046:284206",
        "rudder-src-router-9-0" : "104046:284206",
        "rudder-src-router-23-0" : "104046:284206",
        "rudder-src-router-23-3" : "104046:284211",
        "rudder-src-router-19-0" : "104046:284211",
        "rudder-src-router-3-0" : "104046:284221",
        "rudder-src-router-3-2" : "104046:284221",
        "rudder-src-router-14-1" : "104046:284223",
        "rudder-src-router-9-2" : "104046:284223",
        "rudder-src-router-29-0" : "104046:284223",
        "rudder-src-router-25-1" : "104046:284223",
        "rudder-src-router-25-2" : "104046:284223",
        "rudder-src-router-13-3" : "104046:284247",
        "rudder-src-router-10-0" : "104046:284251",
        "rudder-src-router-20-3" : "104046:284259",
        "rudder-src-router-18-2" : "104046:284266",
        "rudder-src-router-28-2" : "104046:284266",
        "rudder-src-router-26-1" : "104046:284266",
        "rudder-src-router-0-1" : "104046:284266",
        "rudder-src-router-7-0" : "104046:284266",
        "rudder-src-router-2-2" : "104046:284266",
        "rudder-src-router-15-3" : "104046:284268",
        "rudder-src-router-11-0" : "104046:284268",
        "rudder-src-router-19-3" : "104046:284268",
        "rudder-src-router-1-3" : "104046:284268",
        "rudder-src-router-24-2" : "104046:284275",
        "rudder-src-router-12-1" : "104046:284275",
        "rudder-src-router-16-1" : "104046:284275",
        "rudder-src-router-16-2" : "104046:284275"
}

Screenshot 2024-12-31 at 18 04 16

Modifications

  • fix the concurrency issue
  • sort recentlyJoinedConsumers when adding/removing consumers, I have not found the root cause of items in recentlyJoinedConsumers being out of order, I will continue working on it. Before that let us fix the issue fast

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository: x

… un-thread-safety collection recentlyJoinedConsumers
@poorbarcode poorbarcode self-assigned this Dec 31, 2024
@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Dec 31, 2024
@poorbarcode poorbarcode added the type/bug The PR fixed a bug or issue reported a bug label Dec 31, 2024
@poorbarcode
Copy link
Contributor Author

/pulsarbot rerun-failure-checks

@poorbarcode poorbarcode changed the title [fix] [broker] Msg delivery is stuck due to concurrency modifying the un-thread-safety collection recentlyJoinedConsumers [fix] [broker] Msg topic stats failed due to concurrency modifying&reading the un-thread-safety collection recentlyJoinedConsumers Dec 31, 2024
… un-thread-safety collection recentlyJoinedConsumers
… un-thread-safety collection recentlyJoinedConsumers
@poorbarcode poorbarcode changed the title [fix] [broker] Msg topic stats failed due to concurrency modifying&reading the un-thread-safety collection recentlyJoinedConsumers [fix] [broker] Msg delivery is stuck due to items in the collection recentlyJoinedConsumers are out-of-order Dec 31, 2024
@poorbarcode poorbarcode requested a review from lhotari January 1, 2025 12:48
@poorbarcode poorbarcode requested a review from Technoboy- January 2, 2025 02:03
@codecov-commenter
Copy link

codecov-commenter commented Jan 2, 2025

Codecov Report

Attention: Patch coverage is 96.29630% with 1 line in your changes missing coverage. Please review.

Project coverage is 74.14%. Comparing base (bbc6224) to head (5a190f4).
Report is 824 commits behind head on master.

Files with missing lines Patch % Lines
...ntStickyKeyDispatcherMultipleConsumersClassic.java 96.29% 0 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #23795      +/-   ##
============================================
+ Coverage     73.57%   74.14%   +0.56%     
+ Complexity    32624    31761     -863     
============================================
  Files          1877     1853      -24     
  Lines        139502   143412    +3910     
  Branches      15299    16286     +987     
============================================
+ Hits         102638   106326    +3688     
+ Misses        28908    28711     -197     
- Partials       7956     8375     +419     
Flag Coverage Δ
inttests 26.70% <0.00%> (+2.12%) ⬆️
systests 23.18% <0.00%> (-1.14%) ⬇️
unittests 73.64% <96.29%> (+0.79%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...ntStickyKeyDispatcherMultipleConsumersClassic.java 88.04% <96.29%> (ø)

... and 1026 files with indirect coverage changes

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@lhotari lhotari merged commit 4a01423 into apache:master Jan 2, 2025
52 checks passed
lhotari pushed a commit that referenced this pull request Jan 2, 2025
…nsumers are out-of-order (#23795)

(cherry picked from commit 4a01423)
lhotari pushed a commit that referenced this pull request Jan 2, 2025
…nsumers are out-of-order (#23795)

(cherry picked from commit 4a01423)
lhotari pushed a commit that referenced this pull request Jan 2, 2025
…centlyJoinedConsumers are out-of-order (#23795)

(cherry picked from commit 4a01423)
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Jan 3, 2025
…nsumers are out-of-order (apache#23795)

(cherry picked from commit 4a01423)
(cherry picked from commit ca535a2)
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Jan 3, 2025
…nsumers are out-of-order (apache#23795)

(cherry picked from commit 4a01423)
(cherry picked from commit ca535a2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants