HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnapshotBackgroundServices #6026

hemantk-12 · 2024-01-18T05:52:27Z

What changes were proposed in this pull request?

There is a deadlock between checkpointing creation for Bootstrapping and RocksDBCheckpointDiffer#pruneSstFiles.

Bootstrapping takes the BootstrapStateHandler#lock before checkpointing creation and then takes lock on RocksDBCheckpointDiffer instance to unpause the compaction thread/s. On the other hand RocksDBCheckpointDiffer#pruneSstFiles is synchronized function which first takes lock on RocksDBCheckpointDiffer instance and takes BootstrapStateHandler#lock before removing any files.
Hence both threads, checkpointing creation for Bootstrapping and pruneSstFiles background task, are blocking each other.

In this change, synchronized block to notify RocksDBCheckpointDiffer to unpause the compaction has been removed. Now heckpointing creation for Bootstrapping and pruneSstFiles background threads won't block each other.

To fix the deadlock, removed the nested locks in RocksDBCheckpointDiffer#pruneSstFiles. RocksDBCheckpointDiffer's instance lock is only needed to know which sstFiles can be removed. Once that's know, actual removal has to be inside BootstrapStateHandler#lock.

What is the link to the Apache JIRA

HDDS-9486

How was this patch tested?

Ran test 10x10 times (run) on a059ae6 and it passed.

1. Ran the test 10x10 times on local branch: https://github.com/hemantk-12/ozone/actions/runs/7575604154. One test failed and couple of tests got stuck. Test failed with as issue same as in HDDS-9455.
2. Fixed HDDS-9455 issue by replacing RandomStringUtils with counter and ran the test again 25x10 times: https://github.com/hemantk-12/ozone/actions/runs/7576736447. Didn't see any failure but few tests are still running.

adoroszlai

Thanks @hemantk-12 for the patch.

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java

hemantk-12 · 2024-01-18T19:17:57Z

@mladjan-gadzic please review it.

…BackgroundServices to eliminate the key collision

mladjan-gadzic · 2024-01-19T19:06:33Z

@hemantk-12 thanks for the patch!

I tried running this locally - it passed for about 40 times and then there was java.util.concurrent.TimeoutException. It might be an issue with resources available on my machine because after that there was java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Direct buffer memory - not sure.

However, I had a look at CI that you've run https://github.com/hemantk-12/ozone/actions/runs/7576736447, and it seems that few of those timed out. Might be worth checking out why.

hemantk-12 · 2024-01-19T19:12:55Z

Thanks @mladjan-gadzic for the review.

However, I had a look at CI that you've run https://github.com/hemantk-12/ozone/actions/runs/7576736447, and it seems that few of those timed out. Might be worth checking out why.

I'm looking into this if it is still a deadlock issue or something else.

It might be an issue with resources available on my machine because after that there was java.util.concurrent.CompletionException: java.lang.OutOfMemoryError: Direct buffer memory.

I faced the same error on my local machine.

hemantk-12 · 2024-01-19T23:14:23Z

I, with Attila's help, run test again and was able to get thread dump of a failure iteration. https://github.com/hemantk-12/ozone/actions/runs/7588632499/job/20671595413

Key creation is getting stuck.
Test thread:

main" 
   java.lang.Thread.State: WAITING
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:502)
        at org.apache.hadoop.util.concurrent.AsyncGet$Util.wait(AsyncGet.java:65)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1555)
        at org.apache.hadoop.ipc.Client.call(Client.java:1513)
        at org.apache.hadoop.ipc.Client.call(Client.java:1410)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:250)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
        at com.sun.proxy.$Proxy54.submitRequest(Unknown Source)
        at sun.reflect.GeneratedMethodAccessor45.invoke(Unknown Source)
        at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.lang.reflect.Method.invoke(Method.java:498)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
        at org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
        at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
        at com.sun.proxy.$Proxy54.submitRequest(Unknown Source)
        at org.apache.hadoop.ozone.om.protocolPB.Hadoop3OmTransport.submitRequest(Hadoop3OmTransport.java:80)
        at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.submitRequest(OzoneManagerProtocolClientSideTranslatorPB.java:330)
        at org.apache.hadoop.ozone.om.protocolPB.OzoneManagerProtocolClientSideTranslatorPB.openKey(OzoneManagerProtocolClientSideTranslatorPB.java:729)
        at org.apache.hadoop.ozone.client.rpc.RpcClient.createKey(RpcClient.java:1363)
        at org.apache.hadoop.ozone.client.rpc.RpcClient.createKey(RpcClient.java:1322)
        at org.apache.hadoop.ozone.client.OzoneBucket.createKey(OzoneBucket.java:446)
        at org.apache.hadoop.ozone.om.TestOzoneManagerHA.createKey(TestOzoneManagerHA.java:226)
        at org.apache.hadoop.ozone.om.TestSnapshotBackgroundServices.writeKeys(TestSnapshotBackgroundServices.java:653)
        at org.apache.hadoop.ozone.om.TestSnapshotBackgroundServices.createSnapshotsEachWithNewKeys(TestSnapshotBackgroundServices.java:361)
        at org.apache.hadoop.ozone.om.TestSnapshotBackgroundServices.testSnapshotAndKeyDeletionBackgroundServices(TestSnapshotBackgroundServices.java:211)

Request handler thread:

"IPC Server handler 7 on default port 15133" 
   java.lang.Thread.State: WAITING
        at sun.misc.Unsafe.park(Native Method)
        at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
        at java.util.concurrent.CompletableFuture$Signaller.block(CompletableFuture.java:1707)
        at java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3323)
        at java.util.concurrent.CompletableFuture.waitingGet(CompletableFuture.java:1742)
        at java.util.concurrent.CompletableFuture.get(CompletableFuture.java:1908)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatisImpl(OzoneManagerRatisServer.java:320)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.lambda$3(OzoneManagerRatisServer.java:292)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer$$Lambda$1508/1834248801.get(Unknown Source)
        at org.apache.hadoop.util.MetricUtil.captureLatencyNs(MetricUtil.java:45)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequestToRatis(OzoneManagerRatisServer.java:290)
        at org.apache.hadoop.ozone.om.ratis.OzoneManagerRatisServer.submitRequest(OzoneManagerRatisServer.java:262)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestToRatis(OzoneManagerProtocolServerSideTranslatorPB.java:250)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.internalProcessRequest(OzoneManagerProtocolServerSideTranslatorPB.java:229)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:165)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$1498/1352936337.apply(Unknown Source)
        at org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:89)
        at org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:156)
        at org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server.processCall(ProtobufRpcEngine.java:484)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:595)
        at org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1227)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1094)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1017)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3048)

Need to check more if it is Ozone issue or Ratis.

Full thread dump: 2024-01-19T20-26-24_615-jvmRun1.dump.txt

adoroszlai · 2024-01-20T09:32:53Z

@hemantk-12 fork timeout in TestOMRatisSnapshots was resolved by #5412. TestSnapshotBackgroundServices may have the same problem (among others). It passed 10x10 by applying the same logic: adoroszlai@c20d3df

We need to consider whether the problem is specific to integration tests. If it affects production usage, the default value may need to be changed in the config.

CC @szetszwo

mladjan-gadzic · 2024-01-20T09:54:54Z

AFAIK that value is set to 0ms or 1ms in production. Ideally, tests should match that.

hemantk-12 · 2024-01-22T17:48:07Z

AFAIK that value is set to 0ms or 1ms in production. Ideally, tests should match that.

I see default value is 0 code. I don't see any override in the code. Don't know if it means no wait? or indefinite wait?

For now, I fixed the integration test by overriding the setting as in adoroszlai@c20d3df.

Created a follow up jira: HDDS-10185 to look into OzoneManagerRatisServerConfig#LogAppenderWaitTimeMin in more details.

szetszwo · 2024-01-22T21:23:26Z

I see default value is 0 code. I don't see any override in the code. Don't know if it means no wait? or indefinite wait?

0 means no wait.

There is a deadlock between checkpointing creation for Bootstrapping and RocksDBCheckpointDiffer#pruneSstFiles.

If there is a deadlock, changing wait-time probably is only changing the probability but not really fixing the problem.

hemantk-12 · 2024-01-22T21:41:25Z

If there is a deadlock, changing wait-time probably is only changing the probability but not really fixing the problem.

Deadlock is fixed by 19ce672 but we are still seeing random long waits which eventually timeout as mentioned in previous comment: #6026 (comment).

Attila pointed out that similar issue was fixed in #5412. So I made the same change in TestSnapshotBackgroundServices to fix the long pause/timeout.

I'm unsure how this config (=0) is causing the timeout. Hence created a follow up task, HDDS-10185, to investigate the issue if it is a test code issue or source code. And we can merge this PR with deadlock fix independently.

adoroszlai · 2024-01-27T13:58:46Z

Thanks @hemantk-12 for the patch, @mladjan-gadzic for the review.

…shotBackgroundServices (apache#6026) (cherry picked from commit 2c0580d)

… in TestSnapshotBackgroundServices (apache#6026) (cherry picked from commit 2c0580d)

hemantk-12 added the snapshot https://issues.apache.org/jira/browse/HDDS-6517 label Jan 18, 2024

hemantk-12 force-pushed the HDDS-9486 branch from 662ee63 to 5ea7ae0 Compare January 18, 2024 05:58

adoroszlai reviewed Jan 18, 2024

View reviewed changes

hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java Outdated Show resolved Hide resolved

hemantk-12 force-pushed the HDDS-9486 branch from 09bbc58 to ff214c3 Compare January 18, 2024 18:23

hemantk-12 added 4 commits January 18, 2024 10:38

Removed synchronized block to notify compaction threads

22fa8d0

Flattened the nested locks in pruneSstFiles to remove the deadlock.

19ce672

Fixed checkstyle

7b34cfd

Fixed findbugs

881c84f

hemantk-12 force-pushed the HDDS-9486 branch from ff214c3 to 881c84f Compare January 18, 2024 18:38

hemantk-12 requested review from GeorgeJahad and smengcl January 18, 2024 19:16

hemantk-12 marked this pull request as ready for review January 18, 2024 19:17

Replaced RandomStringUtils with AtomicInteger counter in TestSnapshot…

18fc1ec

…BackgroundServices to eliminate the key collision

Fixed the test timeout by setting LogAppenderWaitTimeMin for Ratis

a059ae6

hemantk-12 added the test label Jan 23, 2024

adoroszlai removed the test label Jan 27, 2024

adoroszlai changed the title ~~HDDS-9486. Fix for intermittent fork timeout in TestSnapshotBackgroundServices~~ HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnapshotBackgroundServices Jan 27, 2024

adoroszlai approved these changes Jan 27, 2024

View reviewed changes

adoroszlai merged commit 2c0580d into apache:master Jan 27, 2024

adoroszlai pushed a commit to adoroszlai/ozone that referenced this pull request Feb 22, 2024

HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnap…

b6a1226

…shotBackgroundServices (apache#6026) (cherry picked from commit 2c0580d)

ptlrs pushed a commit to ptlrs/ozone that referenced this pull request Mar 8, 2025

CDPD-62620. HDDS-9486. Fix deadlock causing intermittent fork timeout…

3871058

… in TestSnapshotBackgroundServices (apache#6026) (cherry picked from commit 2c0580d)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnapshotBackgroundServices #6026

HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnapshotBackgroundServices #6026

Uh oh!

hemantk-12 commented Jan 18, 2024 •

edited

Loading

Uh oh!

adoroszlai left a comment

Uh oh!

Uh oh!

hemantk-12 commented Jan 18, 2024

Uh oh!

mladjan-gadzic commented Jan 19, 2024

Uh oh!

hemantk-12 commented Jan 19, 2024 •

edited

Loading

Uh oh!

hemantk-12 commented Jan 19, 2024 •

edited

Loading

Uh oh!

adoroszlai commented Jan 20, 2024

Uh oh!

mladjan-gadzic commented Jan 20, 2024

Uh oh!

hemantk-12 commented Jan 22, 2024 •

edited

Loading

Uh oh!

szetszwo commented Jan 22, 2024

Uh oh!

hemantk-12 commented Jan 22, 2024 •

edited

Loading

Uh oh!

adoroszlai commented Jan 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnapshotBackgroundServices #6026

HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnapshotBackgroundServices #6026

Uh oh!

Conversation

hemantk-12 commented Jan 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

hemantk-12 commented Jan 18, 2024

Uh oh!

mladjan-gadzic commented Jan 19, 2024

Uh oh!

hemantk-12 commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hemantk-12 commented Jan 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Jan 20, 2024

Uh oh!

mladjan-gadzic commented Jan 20, 2024

Uh oh!

hemantk-12 commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

szetszwo commented Jan 22, 2024

Uh oh!

hemantk-12 commented Jan 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

adoroszlai commented Jan 27, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

hemantk-12 commented Jan 18, 2024 •

edited

Loading

hemantk-12 commented Jan 19, 2024 •

edited

Loading

hemantk-12 commented Jan 19, 2024 •

edited

Loading

hemantk-12 commented Jan 22, 2024 •

edited

Loading

hemantk-12 commented Jan 22, 2024 •

edited

Loading