Skip to content

Conversation

@GeorgeJahad
Copy link
Contributor

@GeorgeJahad GeorgeJahad commented Apr 19, 2023

What changes were proposed in this pull request?

This test was flaky because one thread trys to read the log output from another thread before the log has been generated.

I confirmed this by hitting a breakpoint on the assertion failure, then looking at the log and seeing that it had subsequently been updated by another thread.

The fix is to poll the log for a few seconds until it is updated to the expected contents.

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8355

How was this patch tested?

Without the fix, I was able to reproduce the problem by running the test class 5 times repeatedly in intellij.

With the fix, I was unable to reproduce the problem after running it in intellij 50 times.

I also ran the CI 25 times without this failing.

@adoroszlai
Copy link
Contributor

Thanks @GeorgeJahad for the patch. Checkstyle reports some indentation problem:

hadoop-ozone/integration-test/src/test/java/org/apache/hadoop/ozone/om/TestOMRatisSnapshots.java
 679: 'method def' child has incorrect indentation level 6, expected level should be 4.
 680: 'block' child has incorrect indentation level 10, expected level should be 8.

@Flaky can be removed completely.

Other than that it looks good so far.

@adoroszlai
Copy link
Contributor

@GeorgeJahad can you please share the link to repeated CI runs?

@GeorgeJahad
Copy link
Contributor Author

GeorgeJahad commented Apr 19, 2023

@adoroszlai Why do you ask? Are you seeing a problem somewhere?

The links are here, https://github.com/GeorgeJahad/ozone/actions/workflows/post-commit.yml , labeled "trigger new CI check 1-25"

Some of those runs failed for other reasons and the rest I canceled once the om tests passed, (because I didn't see the need to wait), but in all of them the om integration tests passed.

@GeorgeJahad
Copy link
Contributor Author

GeorgeJahad commented Apr 19, 2023

In this one: "trigger new CI check 26" https://github.com/GeorgeJahad/ozone/actions/runs/4740249336 the om integration test failed for a different reason but the TestOMRatisSnapshots test did pass.

Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you ask? Are you seeing a problem somewhere?

No, but I think it's useful to document it.

Anyway, I triggered a repeated run with the patch, and it passed in all runs (except one split was cancelled at the very start).

(BTW, when running repeated tests in CI, disabling unnecessary checks helps save time, example.)

@GeorgeJahad
Copy link
Contributor Author

GeorgeJahad commented Apr 20, 2023

Thanks @adoroszlai
For the record these were my runs. (The om integration tests including the TestOmRatisSnapshot class, passed on each.):

https://github.com/GeorgeJahad/ozone/actions/runs/4740037723
https://github.com/GeorgeJahad/ozone/actions/runs/4739882033
https://github.com/GeorgeJahad/ozone/actions/runs/4739688108
https://github.com/GeorgeJahad/ozone/actions/runs/4739541532
https://github.com/GeorgeJahad/ozone/actions/runs/4739375783
https://github.com/GeorgeJahad/ozone/actions/runs/4739240425
https://github.com/GeorgeJahad/ozone/actions/runs/4739066965
https://github.com/GeorgeJahad/ozone/actions/runs/4738911442
https://github.com/GeorgeJahad/ozone/actions/runs/4738729784
https://github.com/GeorgeJahad/ozone/actions/runs/4738570900
https://github.com/GeorgeJahad/ozone/actions/runs/4738393612
https://github.com/GeorgeJahad/ozone/actions/runs/4738223920
https://github.com/GeorgeJahad/ozone/actions/runs/4738034314
https://github.com/GeorgeJahad/ozone/actions/runs/4737887420
https://github.com/GeorgeJahad/ozone/actions/runs/4737696448
https://github.com/GeorgeJahad/ozone/actions/runs/4737534048
https://github.com/GeorgeJahad/ozone/actions/runs/4737320242
https://github.com/GeorgeJahad/ozone/actions/runs/4737131814
https://github.com/GeorgeJahad/ozone/actions/runs/4736904674
https://github.com/GeorgeJahad/ozone/actions/runs/4736904216
https://github.com/GeorgeJahad/ozone/actions/runs/4736378602
https://github.com/GeorgeJahad/ozone/actions/runs/4736064711
https://github.com/GeorgeJahad/ozone/actions/runs/4735675034
https://github.com/GeorgeJahad/ozone/actions/runs/4735250360
https://github.com/GeorgeJahad/ozone/actions/runs/4734973217

@adoroszlai adoroszlai merged commit 9d066ea into apache:master Apr 20, 2023
errose28 added a commit to errose28/ozone that referenced this pull request Apr 20, 2023
* master: (440 commits)
  HDDS-8445. Move PlacementPolicy back to SCM (apache#4588)
  HDDS-8335. ReplicationManager: EC Mis and Under replication handlers should handle overloaded exceptions (apache#4593)
  HDDS-8355. Intermittent failure in TestOMRatisSnapshots#testInstallSnapshot (apache#4592)
  HDDS-8444. Increase timeout of CI build (apache#4586)
  HDDS-8446. Selective checks: handle change in ci.yaml (apache#4587)
  HDDS-8440. Ozone Manager crashed with ClassCastException when deleting FSO bucket. (apache#4582)
  HDDS-7309. Enable by default GRPC between S3G and OM (apache#3820)
  HDDS-8458. Mark TestBlockDeletion#testBlockDeletion as flaky
  HDDS-8385. Ozone can't process snapshot when service UID > 2097151 (apache#4580)
  HDDS-8424: Preserve legacy bucket getKeyInfo behavior (apache#4576)
  HDDS-8453. Mark TestDirectoryDeletingServiceWithFSO#testDirDeletedTableCleanUpForSnapshot as flaky
  HDDS-8137. [Snapshot] SnapDiff to use tombstone entries in SST files (apache#4376)
  HDDS-8270. Measure checkAccess latency for Ozone objects (apache#4467)
  HDDS-8109. Seperate Ratis and EC MisReplication Handling (apache#4577)
  HDDS-8429. Checkpoint is not closed properly in OMDBCheckpointServlet (apache#4575)
  HDDS-8253. Set ozone.metadata.dirs to temporary dir if not defined in S3 Gateway (apache#4455)
  HDDS-8400. Expose rocksdb last sequence number through metrics (apache#4557)
  HDDS-8333. ReplicationManager: Allow partial EC reconstruction if insufficient nodes available (apache#4579)
  HDDS-8147. Introduce latency metrics for S3 Gateway operations (apache#4383)
  HDDS-7908. Support OM Metadata operation Generator in `Ozone freon` (apache#4251)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants