-
Notifications
You must be signed in to change notification settings - Fork 589
HDDS-9486. Fix deadlock causing intermittent fork timeout in TestSnapshotBackgroundServices #6026
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @hemantk-12 for the patch.
hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/OMDBCheckpointServlet.java
Outdated
Show resolved
Hide resolved
|
@mladjan-gadzic please review it. |
…BackgroundServices to eliminate the key collision
|
@hemantk-12 thanks for the patch! I tried running this locally - it passed for about 40 times and then there was However, I had a look at CI that you've run https://github.com/hemantk-12/ozone/actions/runs/7576736447, and it seems that few of those timed out. Might be worth checking out why. |
|
Thanks @mladjan-gadzic for the review.
I'm looking into this if it is still a deadlock issue or something else.
I faced the same error on my local machine. |
|
I, with Attila's help, run test again and was able to get thread dump of a failure iteration. https://github.com/hemantk-12/ozone/actions/runs/7588632499/job/20671595413 Key creation is getting stuck. Request handler thread: Need to check more if it is Ozone issue or Ratis. Full thread dump: 2024-01-19T20-26-24_615-jvmRun1.dump.txt |
|
@hemantk-12 fork timeout in We need to consider whether the problem is specific to integration tests. If it affects production usage, the default value may need to be changed in the config. CC @szetszwo |
|
AFAIK that value is set to 0ms or 1ms in production. Ideally, tests should match that. |
I see default value is 0 code. I don't see any override in the code. Don't know if it means no wait? or indefinite wait? For now, I fixed the integration test by overriding the setting as in adoroszlai@c20d3df. Created a follow up jira: HDDS-10185 to look into |
0 means no wait.
If there is a deadlock, changing wait-time probably is only changing the probability but not really fixing the problem. |
Deadlock is fixed by 19ce672 but we are still seeing random long waits which eventually timeout as mentioned in previous comment: #6026 (comment). Attila pointed out that similar issue was fixed in #5412. So I made the same change in TestSnapshotBackgroundServices to fix the long pause/timeout. I'm unsure how this config (=0) is causing the timeout. Hence created a follow up task, HDDS-10185, to investigate the issue if it is a test code issue or source code. And we can merge this PR with deadlock fix independently. |
|
Thanks @hemantk-12 for the patch, @mladjan-gadzic for the review. |
…shotBackgroundServices (apache#6026) (cherry picked from commit 2c0580d)
… in TestSnapshotBackgroundServices (apache#6026) (cherry picked from commit 2c0580d)
What changes were proposed in this pull request?
There is a deadlock between checkpointing creation for Bootstrapping and RocksDBCheckpointDiffer#pruneSstFiles.
Bootstrapping takes the BootstrapStateHandler#lock before checkpointing creation and then takes lock on RocksDBCheckpointDiffer instance to unpause the compaction thread/s. On the other hand RocksDBCheckpointDiffer#pruneSstFiles is synchronized function which first takes lock on RocksDBCheckpointDiffer instance and takes BootstrapStateHandler#lock before removing any files.
Hence both threads, checkpointing creation for Bootstrapping and pruneSstFiles background task, are blocking each other.
In this change, synchronized block to notify RocksDBCheckpointDiffer to unpause the compaction has been removed. Now heckpointing creation for Bootstrapping and pruneSstFiles background threads won't block each other.To fix the deadlock, removed the nested locks in RocksDBCheckpointDiffer#pruneSstFiles. RocksDBCheckpointDiffer's instance lock is only needed to know which sstFiles can be removed. Once that's know, actual removal has to be inside BootstrapStateHandler#lock.
What is the link to the Apache JIRA
HDDS-9486
How was this patch tested?
Ran test 10x10 times (run) on a059ae6 and it passed.
1. Ran the test 10x10 times on local branch: https://github.com/hemantk-12/ozone/actions/runs/7575604154. One test failed and couple of tests got stuck. Test failed with as issue same as in HDDS-9455.2. Fixed HDDS-9455 issue by replacing RandomStringUtils with counter and ran the test again 25x10 times: https://github.com/hemantk-12/ozone/actions/runs/7576736447. Didn't see any failure but few tests are still running.