Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[fix][broker] Fix unload operation stuck when use ExtensibleLoadManager #21332

Merged

Conversation

Demogorgon314
Copy link
Member

@Demogorgon314 Demogorgon314 commented Oct 9, 2023

Fix #20157

Motivation

When enabling the topic level policies in ExtensibleLoadManagerImplTest, the unload operation will get stuck

The root cause is when unloading the bundle, it will close the topic, and if UpdateCursorInfo fails, it will call the ledger.mlOwnershipChecker.get().

// it means previous owner of the ml might have updated the version incorrectly. So, check
// the ownership and refresh the version again.
if (ledger.mlOwnershipChecker != null && ledger.mlOwnershipChecker.get()) {

The mlOwnershipChecker will check the topic owner again.

}, () -> isTopicNsOwnedByBroker(topicName), null);

Since the current state is Releasing, this feature will not be complete since the closeServiceUnit is incomplete, and it will not send the Free state to the channel.

The stack log.

"bookkeeper-ml-scheduler-OrderedScheduler-9-0" #45 prio=5 os_prio=31 cpu=31.51ms elapsed=39.28s tid=0x0000000118db2600 nid=0xb303 waiting on condition  [0x0000000176b79000]
   java.lang.Thread.State: TIMED_WAITING (parking)
	at jdk.internal.misc.Unsafe.park([email protected]/Native Method)
	- parking to wait for  <0x0000200030154288> (a java.util.concurrent.CompletableFuture$Signaller)
	at java.util.concurrent.locks.LockSupport.parkNanos([email protected]/LockSupport.java:252)
	at java.util.concurrent.CompletableFuture$Signaller.block([email protected]/CompletableFuture.java:1866)
	at java.util.concurrent.ForkJoinPool.unmanagedBlock([email protected]/ForkJoinPool.java:3463)
	at java.util.concurrent.ForkJoinPool.managedBlock([email protected]/ForkJoinPool.java:3434)
	at java.util.concurrent.CompletableFuture.timedGet([email protected]/CompletableFuture.java:1939)
	at java.util.concurrent.CompletableFuture.get([email protected]/CompletableFuture.java:2095)
	at org.apache.pulsar.broker.namespace.NamespaceService.isServiceUnitOwned(NamespaceService.java:1153)
	at java.lang.invoke.LambdaForm$DMH/0x00000008010a1000.invokeVirtual([email protected]/LambdaForm$DMH)
	at java.lang.invoke.LambdaForm$MH/0x0000000801570000.invoke([email protected]/LambdaForm$MH)
	at java.lang.invoke.LambdaForm$MH/0x0000000800f4a400.invoke([email protected]/LambdaForm$MH)
	at java.lang.invoke.LambdaForm$MH/0x0000000800c14400.invokeExact_MT([email protected]/LambdaForm$MH)
	at java.lang.invoke.MethodHandle.invokeWithArguments([email protected]/MethodHandle.java:732)
	at org.mockito.internal.util.reflection.InstrumentationMemberAccessor$Dispatcher$ByteBuddy$ILgTSeGv.invokeWithArguments(Unknown Source)
	at org.mockito.internal.util.reflection.InstrumentationMemberAccessor.invoke(InstrumentationMemberAccessor.java:239)
	at org.mockito.internal.util.reflection.ModuleMemberAccessor.invoke(ModuleMemberAccessor.java:55)
	at org.mockito.internal.creation.bytebuddy.MockMethodAdvice.tryInvoke(MockMethodAdvice.java:333)
	at org.mockito.internal.creation.bytebuddy.MockMethodAdvice.access$500(MockMethodAdvice.java:60)
	at org.mockito.internal.creation.bytebuddy.MockMethodAdvice$RealMethodCall.invoke(MockMethodAdvice.java:253)
	at org.mockito.internal.invocation.InterceptedInvocation.callRealMethod(InterceptedInvocation.java:142)
	at org.mockito.internal.stubbing.answers.CallsRealMethods.answer(CallsRealMethods.java:45)
	at org.mockito.Answers.answer(Answers.java:99)
	at org.mockito.internal.handler.MockHandlerImpl.handle(MockHandlerImpl.java:110)
	at org.mockito.internal.handler.NullResultGuardian.handle(NullResultGuardian.java:29)
	at org.mockito.internal.handler.InvocationNotifierHandler.handle(InvocationNotifierHandler.java:34)
	at org.mockito.internal.creation.bytebuddy.MockMethodInterceptor.doIntercept(MockMethodInterceptor.java:82)
	at org.mockito.internal.creation.bytebuddy.MockMethodAdvice.handle(MockMethodAdvice.java:151)
	at org.apache.pulsar.broker.namespace.NamespaceService.isServiceUnitOwned(NamespaceService.java:1153)
	at org.apache.pulsar.broker.service.BrokerService.isTopicNsOwnedByBroker(BrokerService.java:2141)
	at org.apache.pulsar.broker.service.BrokerService.lambda$createPersistentTopic$64(BrokerService.java:1751)
	at org.apache.pulsar.broker.service.BrokerService$$Lambda$871/0x00000008013f22d0.get(Unknown Source)
	at org.apache.bookkeeper.mledger.impl.ManagedCursorImpl$25.operationFailed(ManagedCursorImpl.java:2691)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl.lambda$asyncUpdateCursorInfo$17(MetaStoreImpl.java:271)
	at org.apache.bookkeeper.mledger.impl.MetaStoreImpl$$Lambda$1924/0x0000000801540818.run(Unknown Source)
	at org.apache.bookkeeper.common.util.OrderedExecutor$TimedRunnable.run(OrderedExecutor.java:201)
	at org.apache.bookkeeper.common.util.SingleThreadSafeScheduledExecutorService$SafeRunnable.run(SingleThreadSafeScheduledExecutorService.java:46)
	at java.util.concurrent.Executors$RunnableAdapter.call([email protected]/Executors.java:539)
	at java.util.concurrent.FutureTask.run([email protected]/FutureTask.java:264)
	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run([email protected]/ScheduledThreadPoolExecutor.java:304)
	at java.util.concurrent.ThreadPoolExecutor.runWorker([email protected]/ThreadPoolExecutor.java:1136)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run([email protected]/ThreadPoolExecutor.java:635)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run([email protected]/Thread.java:833)

Modifications

  • Set inFlightStateWaitingTimeInMillis and ownershipMonitorDelayTimeInSecs to 5 sec to avoid unloading stuck in unit test.
  • Handle exception events in the unload manager.

Documentation

  • doc
  • doc-required
  • doc-not-needed
  • doc-complete

@github-actions github-actions bot added the doc-not-needed Your PR changes do not impact docs label Oct 9, 2023
@lhotari
Copy link
Member

lhotari commented Oct 9, 2023

Great work @Demogorgon314!

@Demogorgon314 Demogorgon314 force-pushed the Demogorgon314/Fix-unload-get-stuck branch from 552dda6 to 965ae9c Compare October 10, 2023 01:25
@Demogorgon314 Demogorgon314 self-assigned this Oct 10, 2023
@Demogorgon314 Demogorgon314 added type/bug The PR fixed a bug or issue reported a bug area/broker labels Oct 10, 2023
@codecov-commenter
Copy link

codecov-commenter commented Oct 10, 2023

Codecov Report

Merging #21332 (0b6670d) into master (a7acdf6) will increase coverage by 36.56%.
Report is 14 commits behind head on master.
The diff coverage is 82.35%.

Impacted file tree graph

@@              Coverage Diff              @@
##             master   #21332       +/-   ##
=============================================
+ Coverage     36.73%   73.30%   +36.56%     
- Complexity    12220    32591    +20371     
=============================================
  Files          1711     1888      +177     
  Lines        130588   140261     +9673     
  Branches      14227    15414     +1187     
=============================================
+ Hits          47971   102812    +54841     
+ Misses        76264    29389    -46875     
- Partials       6353     8060     +1707     
Flag Coverage Δ
inttests 24.19% <0.00%> (+0.01%) ⬆️
systests 24.79% <11.76%> (+0.11%) ⬆️
unittests 72.58% <82.35%> (+40.68%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
...dbalance/extensions/ExtensibleLoadManagerImpl.java 77.75% <100.00%> (+75.87%) ⬆️
...xtensions/channel/ServiceUnitStateChannelImpl.java 85.57% <100.00%> (+85.03%) ⬆️
...ache/pulsar/broker/namespace/NamespaceService.java 71.87% <ø> (+28.10%) ⬆️
...rg/apache/pulsar/broker/service/BrokerService.java 80.95% <100.00%> (+24.69%) ⬆️
.../loadbalance/extensions/manager/UnloadManager.java 81.63% <40.00%> (+81.63%) ⬆️

... and 1446 files with indirect coverage changes

@Demogorgon314 Demogorgon314 marked this pull request as draft October 10, 2023 03:43
@lhotari
Copy link
Member

lhotari commented Oct 10, 2023

Did the exception now change to this one?

https://github.com/apache/pulsar/actions/runs/6463552567/job/17547185928#step:10:1542

   --- An unexpected error occurred in the server ---
  
  Message: Namespace is being unloaded, cannot add topic persistent://public/test/__change_events
  
  Stacktrace:
  
  org.apache.pulsar.broker.service.BrokerServiceException$ServiceUnitNotReadyException: Namespace is being unloaded, cannot add topic persistent://public/test/__change_events
  	at org.apache.pulsar.broker.service.BrokerService.lambda$checkOwnershipAndCreatePersistentTopic$58(BrokerService.java:1596)
  	at java.base/java.util.concurrent.CompletableFuture.uniAcceptNow(CompletableFuture.java:757)
  	at java.base/java.util.concurrent.CompletableFuture.uniAcceptStage(CompletableFuture.java:735)
  	at java.base/java.util.concurrent.CompletableFuture.thenAccept(CompletableFuture.java:2182)
  	at org.apache.pulsar.broker.service.BrokerService.checkOwnershipAndCreatePersistentTopic(BrokerService.java:1573)
  	at org.apache.pulsar.broker.service.BrokerService.lambda$loadOrCreatePersistentTopic$51(BrokerService.java:1520)
  	at java.base/java.util.concurrent.CompletableFuture.uniRunNow(CompletableFuture.java:819)
  	at java.base/java.util.concurrent.CompletableFuture.uniRunStage(CompletableFuture.java:803)
  	at java.base/java.util.concurrent.CompletableFuture.thenRun(CompletableFuture.java:2195)
  	at org.apache.pulsar.broker.service.BrokerService.loadOrCreatePersistentTopic(BrokerService.java:1516)
  	at org.apache.pulsar.broker.service.BrokerService.lambda$getTopic$28(BrokerService.java:1059)
  	at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap$Section.put(ConcurrentOpenHashMap.java:438)
  	at org.apache.pulsar.common.util.collections.ConcurrentOpenHashMap.computeIfAbsent(ConcurrentOpenHashMap.java:257)
  	at org.apache.pulsar.broker.service.BrokerService.getTopic(BrokerService.java:1047)
  	at org.apache.pulsar.broker.service.BrokerService.getTopic(BrokerService.java:1012)
  	at org.apache.pulsar.broker.service.BrokerService.getTopic(BrokerService.java:1007)

I think I've seen that before when investigating a flaky test.

@Demogorgon314 Demogorgon314 force-pushed the Demogorgon314/Fix-unload-get-stuck branch from 875d933 to 089befc Compare October 11, 2023 03:27
@Demogorgon314 Demogorgon314 marked this pull request as ready for review October 11, 2023 03:36
@lhotari
Copy link
Member

lhotari commented Oct 11, 2023

How about this test failure in the CI builds for this PR:

  Error:  Tests run: 108, Failures: 1, Errors: 0, Skipped: 107, Time elapsed: 21.518 s <<< FAILURE! - in org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImplTest
  Error:  org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImplTest.testGetOwnedServiceUnitsAndGetOwnedNamespaceStatus  Time elapsed: 0.101 s  <<< FAILURE!
  java.lang.AssertionError: expected [3] but found [4]
  	at org.testng.Assert.fail(Assert.java:110)
  	at org.testng.Assert.failNotEquals(Assert.java:1577)
  	at org.testng.Assert.assertEqualsImpl(Assert.java:149)
  	at org.testng.Assert.assertEquals(Assert.java:131)
  	at org.testng.Assert.assertEquals(Assert.java:1418)
  	at org.testng.Assert.assertEquals(Assert.java:1382)
  	at org.testng.Assert.assertEquals(Assert.java:1428)
  	at org.apache.pulsar.broker.loadbalance.extensions.ExtensibleLoadManagerImplTest.testGetOwnedServiceUnitsAndGetOwnedNamespaceStatus(ExtensibleLoadManagerImplTest.java:1124)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
  	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
  	at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  	at java.base/java.lang.reflect.Method.invoke(Method.java:568)
  	at org.testng.internal.invokers.MethodInvocationHelper.invokeMethod(MethodInvocationHelper.java:139)
  	at org.testng.internal.invokers.InvokeMethodRunnable.runOne(InvokeMethodRunnable.java:47)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:76)
  	at org.testng.internal.invokers.InvokeMethodRunnable.call(InvokeMethodRunnable.java:11)
  	at java.base/java.util.concurrent.FutureTask.run(FutureTask.java:264)
  	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
  	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
  	at java.base/java.lang.Thread.run(Thread.java:833)

@Demogorgon314 Could you please check it? Is it flaky?

@Demogorgon314
Copy link
Member Author

@lhotari Yes, I pushed a commit to fix this issue.

@Demogorgon314 Demogorgon314 force-pushed the Demogorgon314/Fix-unload-get-stuck branch from 777254e to 0ef57bb Compare October 14, 2023 15:41
@Demogorgon314 Demogorgon314 force-pushed the Demogorgon314/Fix-unload-get-stuck branch from c65e34c to 7da7057 Compare October 16, 2023 12:48
@Technoboy- Technoboy- merged commit ecd40e4 into apache:master Oct 19, 2023
Demogorgon314 added a commit that referenced this pull request Oct 19, 2023
nikhil-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
srinath-ctds pushed a commit to datastax/pulsar that referenced this pull request Dec 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky-test: ExtensibleLoadManagerImplTest.initializeState
5 participants