Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][meta] Bookie Info lost by notification race condition. #20642

Merged
merged 28 commits into from
Jun 30, 2023
Merged

[fix][meta] Bookie Info lost by notification race condition. #20642

merged 28 commits into from
Jun 30, 2023

Conversation

mattisonchao
Copy link
Member

@mattisonchao mattisonchao commented Jun 25, 2023

Motivation

This fixes #20643.

Investigating

Why is the exception Bookie handle is not available?

We found this problem in the broker's continued to report Bookie handle is not available exception longer than 6 hours. The call chain should be as follows:

  1. Bk client is trying to obtain a new channel for connection.
  2. The channel was disconnected, So it is trying to connect again.
  3. Bk Client is trying to get the bookie info by bookie id.
  4. Bk client uses pulsar implemented PulsarRegistrationClient to get bookie info, but the local cache doesn't have it, and then throws the BKBookieHandleNotAvailableException

Why did cache lose Bookie's info?

The following is an analysis that has yet to be rigorously tested. (Since this will consume a relatively long time)

  1. Check out our implementation. We will update the local cache async/sync(Depending on the internal metadata cache) and invalidate the local cache synchronously.
  2. In Bookie ZKRegistrationManager's implementation, transit bookie mode from writable to read-only should create a read-only bookie node and then delete the writable bookie node.
  3. writable -> read-only notification should be as follow:
  • CREATED [read-only path]:bookieId
  • DELETE [writable path]: bookieId
  1. Consider that metadata has a cache that invokes the CREATEED notification logic synchronously. So, we may create this bookie info first and then delete it. (bookie info lost here)

Modifications

  • Make the notification process sequentially.
  • Support bookie mode in the local cache.

Verifying this change

  • Make sure that the change passes the CI checks.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions
Copy link

@mattisonchao Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

@github-actions github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jun 25, 2023
@mattisonchao mattisonchao changed the title [fix][meta] Bookie Info lost by notification race condition. [DON'T MERGE][fix][meta] Bookie Info lost by notification race condition. Jun 27, 2023
Copy link
Contributor

@hangc0276 hangc0276 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice catch!

@mattisonchao mattisonchao changed the title [DON'T MERGE][fix][meta] Bookie Info lost by notification race condition. [fix][meta] Bookie Info lost by notification race condition. Jun 29, 2023
@mattisonchao mattisonchao reopened this Jun 29, 2023
@mattisonchao mattisonchao requested a review from JooHyukKim June 29, 2023 09:11
@codecov-commenter
Copy link

Codecov Report

Merging #20642 (40d326a) into master (43b3622) will increase coverage by 1.06%.
The diff coverage is 81.11%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #20642      +/-   ##
============================================
+ Coverage     72.05%   73.12%   +1.06%     
- Complexity    31718    32061     +343     
============================================
  Files          1855     1867      +12     
  Lines        138376   138782     +406     
  Branches      15198    15266      +68     
============================================
+ Hits          99703   101479    +1776     
+ Misses        30656    29262    -1394     
- Partials       8017     8041      +24     
Flag Coverage Δ
inttests 24.11% <37.77%> (+0.01%) ⬆️
systests 24.97% <25.55%> (?)
unittests 72.41% <81.11%> (+0.60%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
.../metadata/bookkeeper/PulsarRegistrationClient.java 80.85% <81.11%> (-4.11%) ⬇️

... and 243 files with indirect coverage changes

@mattisonchao mattisonchao merged commit 43dc123 into apache:master Jun 30, 2023
@mattisonchao mattisonchao deleted the bk-registration/fix branch June 30, 2023 01:21
@tisonkun
Copy link
Member

Great.

@mattisonchao one suggestion is that you can link to a permalink instead of blob/main where the line number will quickly be invalid.

@liangyepianzhou
Copy link
Contributor

@hangc0276 @mattisonchao The zookeeper version quoted in this P person is 3.8.1(master) instead of 3.6.4 (branch-2.10).
We should not upgrade the version in branch 2.10, according to #19425.
Could you please help take a look at whether we need to cherry-pick this PR to branch 2.10?

nicoloboschi pushed a commit to datastax/pulsar that referenced this pull request Jul 17, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug] Bookie handle is not available for a long time.