-
Notifications
You must be signed in to change notification settings - Fork 588
HDDS-8355. Intermittent failure in TestOMRatisSnapshots#testInstallSnapshot #4592
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks @GeorgeJahad for the patch. Checkstyle reports some indentation problem:
Other than that it looks good so far. |
|
@GeorgeJahad can you please share the link to repeated CI runs? |
|
@adoroszlai Why do you ask? Are you seeing a problem somewhere? The links are here, https://github.com/GeorgeJahad/ozone/actions/workflows/post-commit.yml , labeled "trigger new CI check 1-25" Some of those runs failed for other reasons and the rest I canceled once the om tests passed, (because I didn't see the need to wait), but in all of them the om integration tests passed. |
|
In this one: "trigger new CI check 26" https://github.com/GeorgeJahad/ozone/actions/runs/4740249336 the om integration test failed for a different reason but the TestOMRatisSnapshots test did pass. |
adoroszlai
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you ask? Are you seeing a problem somewhere?
No, but I think it's useful to document it.
Anyway, I triggered a repeated run with the patch, and it passed in all runs (except one split was cancelled at the very start).
(BTW, when running repeated tests in CI, disabling unnecessary checks helps save time, example.)
* master: (440 commits) HDDS-8445. Move PlacementPolicy back to SCM (apache#4588) HDDS-8335. ReplicationManager: EC Mis and Under replication handlers should handle overloaded exceptions (apache#4593) HDDS-8355. Intermittent failure in TestOMRatisSnapshots#testInstallSnapshot (apache#4592) HDDS-8444. Increase timeout of CI build (apache#4586) HDDS-8446. Selective checks: handle change in ci.yaml (apache#4587) HDDS-8440. Ozone Manager crashed with ClassCastException when deleting FSO bucket. (apache#4582) HDDS-7309. Enable by default GRPC between S3G and OM (apache#3820) HDDS-8458. Mark TestBlockDeletion#testBlockDeletion as flaky HDDS-8385. Ozone can't process snapshot when service UID > 2097151 (apache#4580) HDDS-8424: Preserve legacy bucket getKeyInfo behavior (apache#4576) HDDS-8453. Mark TestDirectoryDeletingServiceWithFSO#testDirDeletedTableCleanUpForSnapshot as flaky HDDS-8137. [Snapshot] SnapDiff to use tombstone entries in SST files (apache#4376) HDDS-8270. Measure checkAccess latency for Ozone objects (apache#4467) HDDS-8109. Seperate Ratis and EC MisReplication Handling (apache#4577) HDDS-8429. Checkpoint is not closed properly in OMDBCheckpointServlet (apache#4575) HDDS-8253. Set ozone.metadata.dirs to temporary dir if not defined in S3 Gateway (apache#4455) HDDS-8400. Expose rocksdb last sequence number through metrics (apache#4557) HDDS-8333. ReplicationManager: Allow partial EC reconstruction if insufficient nodes available (apache#4579) HDDS-8147. Introduce latency metrics for S3 Gateway operations (apache#4383) HDDS-7908. Support OM Metadata operation Generator in `Ozone freon` (apache#4251) ...
What changes were proposed in this pull request?
This test was flaky because one thread trys to read the log output from another thread before the log has been generated.
I confirmed this by hitting a breakpoint on the assertion failure, then looking at the log and seeing that it had subsequently been updated by another thread.
The fix is to poll the log for a few seconds until it is updated to the expected contents.
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-8355
How was this patch tested?
Without the fix, I was able to reproduce the problem by running the test class 5 times repeatedly in intellij.
With the fix, I was unable to reproduce the problem after running it in intellij 50 times.
I also ran the CI 25 times without this failing.