HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots #5022
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
This patch fixes issues with three methods running under
TestOMRatisSnapshots. This class was 30% of the time resulting in a timeout due to the configuration setup.The issue comes from the config key
ozone.om.ratis.snapshot.auto.trigger.thresholdwhich is referring to the number of unappended logs before triggering a snapshot installation. All tests under this class depend on frequent Ratis snapshot installations.#4770 Increased the number of writes performed by
testInstallSnapshot()which in combination with the frequent Ratis snapshot installations led 30% of runs to a system freeze that caused a timeout. To fix the issue, onlytestInstallSnapshot()runs configured with a higher threshold number.Furthermore,
testInstallIncrementalSnapshotandtestInstallIncrementalSnapshotWithFailureare installing a Ratis snapshot on the follower and then checking that the follower's key table has the keys from the Ratis snapshot. Sometimes this check is taking place sooner than expected making the methods flaky. By adding a wait check on reading the key table, the issue is fixed.There is an existing issue with
testInstallIncrementalSnapshotWithFailure, that hasn't been fixed. At some point we are expecting that there is a new checkpoint generated in the leader OM and we check the metrics to see if they are updated accordingly. Rarely, this doesn't happen and there is a failure. I added the@Flakyannotation above the method until I address the issue in HDDS-8876.This class is still flaky but I'm creating this PR to remove the
@Disabledannotation and get this test class running along with the rest of the CI.Should we add
above the class as well?
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8880
https://issues.apache.org/jira/browse/HDDS-8952
How was this patch tested?
CI on my fork: https://github.com/xBis7/ozone/actions/runs/5452838798
CI running
TestOMRatisSnapshotin 100 iterationshttps://github.com/xBis7/ozone/actions/runs/5447212761
https://github.com/xBis7/ozone/actions/runs/5447866879
https://github.com/xBis7/ozone/actions/runs/5447868491
We are getting a fork timeout in 1/100 or 2/100 runs.
https://github.com/xBis7/ozone/actions/runs/5447868491/jobs/9910384871#step:5:4472