Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][ml] There are two same-named managed ledgers in the one broker #18688

Merged
merged 5 commits into from
Jun 7, 2023

Conversation

poorbarcode
Copy link
Contributor

@poorbarcode poorbarcode commented Nov 30, 2022

Motivation

In PR #17526, we know that a topic can be closed multiple times and that it is possible to have two same-named objects of class Persistenttopic in the same broker instance.

We know that closing the topic triggers the closure of the ManagedLedger. The topic object can be closed multiple times which means the ManagedLedger can be closed multiple times. This PR is used to prove: If a ManagedLedger is closed more than once, and switched ledgerHandle operation of ManagedLedger and method closed executed concurrently, there will be two of the same-named Managedledger in the same broker, possibly with different numbers of cursors.

If both Managedledgers are available and there are different numbers of cursors, this can cause the operation trimLedgers to delete too many ledgers from the meta of ManagedLedger.

Here is the process:

managedLedger_1.close switch ledgerHandle(managedLedger_1) create managedLedger_2 create managedLedger_3
close current LedgerHandle
async create new LedgerHandle
do close
set the state to Closed
remove managedLedger_1 from ManagedFactory.ledgers
set current ledger to the new LedgerHandle
set the state to LedgerOpened
do close the second time
create managedLedger_2
put managedLedger_2 to ManagedFactory.ledgers
remove managedLedger_2 from ManagedFactory.ledgers
create cursor_1 into managedLedger_2
create managedLedger_3
put managedLedger_3 to ManagedFactory.ledgers
recover cursor_1 from meta
create cursor_2 into managedLedger_3
has one cursor cursor_1 has two cursor: cursor_1, cursor_2

Modifications

  • Use Method remove(k,v) instead of remove(k) when deleting ManagedLedger from ManagedLedgerFactroy.ledgers.
  • fix the wrong state of the closed managed ledger.
  • release the ledgerHandle, which is created after the ML is closed

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

Matching PR in forked repository

PR in forked repository:

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Nov 30, 2022
@poorbarcode poorbarcode force-pushed the fix/managed_ledger_duplicate branch from 9089c62 to c3bf336 Compare November 30, 2022 18:59
@poorbarcode poorbarcode force-pushed the fix/managed_ledger_duplicate branch from c3bf336 to b439bcd Compare December 15, 2022 20:39
@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Jan 15, 2023
@poorbarcode poorbarcode force-pushed the fix/managed_ledger_duplicate branch from b439bcd to e491665 Compare February 13, 2023 05:24
@Technoboy- Technoboy- added this to the 3.0.0 milestone Feb 13, 2023
@Technoboy- Technoboy- changed the title [fix] [ml] There are two same-named managed ledgers in the one broker [fix][ml] There are two same-named managed ledgers in the one broker Feb 13, 2023
@Technoboy- Technoboy- closed this Feb 13, 2023
@Technoboy- Technoboy- reopened this Feb 13, 2023
@codecov-commenter
Copy link

Codecov Report

Merging #18688 (f64173d) into master (5c8f929) will increase coverage by 0.60%.
The diff coverage is 53.05%.

Impacted file tree graph

@@             Coverage Diff              @@
##             master   #18688      +/-   ##
============================================
+ Coverage     63.30%   63.90%   +0.60%     
+ Complexity    26123     3490   -22633     
============================================
  Files          1836     1843       +7     
  Lines        134416   135163     +747     
  Branches      14772    14859      +87     
============================================
+ Hits          85087    86371    +1284     
+ Misses        41649    40949     -700     
- Partials       7680     7843     +163     
Flag Coverage Δ
inttests 24.70% <5.95%> (-0.18%) ⬇️
systests 25.36% <0.49%> (-0.20%) ⬇️
unittests 61.21% <53.05%> (+0.58%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...g/apache/pulsar/client/cli/AbstractCmdConsume.java 20.45% <20.45%> (ø)
.../java/org/apache/pulsar/client/cli/CmdConsume.java 40.44% <25.00%> (+8.45%) ⬆️
...ain/java/org/apache/pulsar/client/cli/CmdRead.java 41.13% <41.13%> (ø)
...r/impl/SnapshotSegmentAbortedTxnProcessorImpl.java 62.79% <62.79%> (ø)
...kkeeper/mledger/impl/ManagedLedgerFactoryImpl.java 78.86% <71.42%> (-0.57%) ⬇️
...ransaction/buffer/impl/TopicTransactionBuffer.java 74.27% <85.71%> (+11.77%) ⬆️
...r/service/SystemTopicTxnBufferSnapshotService.java 78.26% <100.00%> (ø)
...er/systopic/NamespaceEventsSystemTopicFactory.java 85.71% <100.00%> (+12.38%) ⬆️
...ransactionBufferSnapshotBaseSystemTopicClient.java 72.36% <100.00%> (+4.36%) ⬆️
...er/impl/SingleSnapshotAbortedTxnProcessorImpl.java 77.65% <100.00%> (+7.00%) ⬆️
... and 217 more

@github-actions
Copy link

The pr had no activity for 30 days, mark with Stale label.

@github-actions github-actions bot added the Stale label Mar 16, 2023
@poorbarcode poorbarcode modified the milestones: 3.0.0, 3.1.0 Apr 10, 2023
@lhotari lhotari requested review from hangc0276 and dlg99 May 10, 2023 10:42
@lhotari
Copy link
Member

lhotari commented May 10, 2023

@poorbarcode I assume this PR is still relevant? Please merge latest changes from master and check that tests still pass.

@poorbarcode poorbarcode force-pushed the fix/managed_ledger_duplicate branch from f64173d to 35a073e Compare May 10, 2023 10:54
@poorbarcode
Copy link
Contributor Author

@lhotari

I assume this PR is still relevant?

Yes.

Please merge latest changes from master and check that tests still pass.

Done.

Could you please help review this PR?

@lhotari
Copy link
Member

lhotari commented May 10, 2023

Could you please help review this PR?

@poorbarcode The change to production code itself looks good, but I don't like the test. Addressing that would require more changes to ML factory to make it possible to sub class it for tests with test hooks instead of relying on Mockito.

One way would be to extract a protected method for the logic here:

final ManagedLedgerImpl newledger = config.getShadowSource() == null
? new ManagedLedgerImpl(this, bk, store, config, scheduledExecutor, name, mlOwnershipChecker)
: new ShadowManagedLedgerImpl(this, bk, store, config, scheduledExecutor, name,
mlOwnershipChecker);

and here to be able to use a test specific ManagedLedgerFactoryImpl subclass where it's possible to create a custom ManagerLedgerImpl subclass with the required hooks for testing:

factory = new ManagedLedgerFactoryImpl(metadataStore, bkc);

Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch @poorbarcode. The production code change is good. I've suggested some changes to the way the test is implemented.

Comment on lines 105 to 134
private ManagedLedgerImpl makeManagedLedgerWorksWithStrictlySequentially(ManagedLedgerImpl originalManagedLedger,
ProcessCoordinator processCoordinator)
throws Exception {
ManagedLedgerImpl sequentiallyManagedLedger = spy(originalManagedLedger);
// step-1.
doAnswer(invocation -> {
synchronized (originalManagedLedger) {
// step-3.
// Wait for `managedLedger.close`, then do task: "asyncCreateLedger()".
// Because the thread selector in "managedLedger.executor" is random logic, so it is possible to fail.
// Adding 1000 tasks to stuck the executor gives a high chance of success.
for (int i = 0; i < 1000; i++) {
originalManagedLedger.getExecutor().execute(() -> {
processCoordinator.waitPreviousAndSetStep(3);
});
}
LedgerHandle lh = (LedgerHandle) invocation.getArguments()[0];
processCoordinator.waitPreviousAndSetStep(1);
originalManagedLedger.ledgerClosed(lh);
}
return null;
}).when(sequentiallyManagedLedger).ledgerClosed(any(LedgerHandle.class));
// step-2.
doAnswer(invocation -> {
processCoordinator.waitPreviousAndSetStep(2);
originalManagedLedger.close();
return null;
}).when(sequentiallyManagedLedger).close();
return sequentiallyManagedLedger;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this looks like a hack, especially the loop of adding 1000 tasks to the executor. The "ProcessCoordinator" implementation looks like something that could be handled with java.util.concurrent.Phaser.
It would be better to modify ManagedLedgerFactoryImpl so that it's possible to override a method that creates the ledger instance. That way it would be possible to have a way to override the method for tests and inject test logic without relying on Mockito, which isn't thread safe. That itself could cause issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari

Thanks for your suggestion. I have rewritten the test to make it simpler. Could you take a look again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lhotari

And I added other changes[1]:

  • fix the wrong state of the closed managed ledger.
  • release the ledgerHandle, which is created after the ML is closed

[1]: https://github.com/apache/pulsar/pull/18688/files#diff-f6a849bd8fdb782ef6c17a2e07a2c54c3dc7d1655c00ec3546d5f3b3fc61e970R1531

@lhotari lhotari requested a review from codelipenghui May 10, 2023 11:31
@poorbarcode poorbarcode force-pushed the fix/managed_ledger_duplicate branch from 35a073e to 40298db Compare May 10, 2023 17:41
@github-actions github-actions bot removed the Stale label May 11, 2023
Copy link
Member

@lhotari lhotari left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Good work @poorbarcode .

@michaeljmarshall
Copy link
Member

As discussed on the mailing list https://lists.apache.org/thread/w4jzk27qhtosgsz7l9bmhf1t7o9mxjhp, there is no plan to release 2.9.6, so I am going to remove the release/2.9.6 label

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants