Skip to content

Conversation

@xBis7
Copy link
Contributor

@xBis7 xBis7 commented Jul 4, 2023

What changes were proposed in this pull request?

This patch fixes issues with three methods running under TestOMRatisSnapshots. This class was 30% of the time resulting in a timeout due to the configuration setup.

The issue comes from the config key ozone.om.ratis.snapshot.auto.trigger.threshold which is referring to the number of unappended logs before triggering a snapshot installation. All tests under this class depend on frequent Ratis snapshot installations.

#4770 Increased the number of writes performed by testInstallSnapshot() which in combination with the frequent Ratis snapshot installations led 30% of runs to a system freeze that caused a timeout. To fix the issue, only testInstallSnapshot() runs configured with a higher threshold number.

Furthermore, testInstallIncrementalSnapshot and testInstallIncrementalSnapshotWithFailure are installing a Ratis snapshot on the follower and then checking that the follower's key table has the keys from the Ratis snapshot. Sometimes this check is taking place sooner than expected making the methods flaky. By adding a wait check on reading the key table, the issue is fixed.

There is an existing issue with testInstallIncrementalSnapshotWithFailure, that hasn't been fixed. At some point we are expecting that there is a new checkpoint generated in the leader OM and we check the metrics to see if they are updated accordingly. Rarely, this doesn't happen and there is a failure. I added the @Flaky annotation above the method until I address the issue in HDDS-8876.

This class is still flaky but I'm creating this PR to remove the @Disabled annotation and get this test class running along with the rest of the CI.

Should we add

@Flaky("HDDS-8876")

above the class as well?

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8880

https://issues.apache.org/jira/browse/HDDS-8952

How was this patch tested?

CI on my fork: https://github.com/xBis7/ozone/actions/runs/5452838798

CI running TestOMRatisSnapshot in 100 iterations

https://github.com/xBis7/ozone/actions/runs/5447212761

https://github.com/xBis7/ozone/actions/runs/5447866879

https://github.com/xBis7/ozone/actions/runs/5447868491

We are getting a fork timeout in 1/100 or 2/100 runs.

https://github.com/xBis7/ozone/actions/runs/5447868491/jobs/9910384871#step:5:4472

@xBis7
Copy link
Contributor Author

xBis7 commented Jul 4, 2023

@adoroszlai @GeorgeJahad Can you please take a look at this PR?

@adoroszlai adoroszlai changed the title HDDS-8880. [disabled] Intermittent fork timeout in TestOMRatisSnapshots HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots Jul 4, 2023
Copy link
Contributor

@adoroszlai adoroszlai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xBis7 for the patch and extensive testing. 👍

@xBis7
Copy link
Contributor Author

xBis7 commented Jul 4, 2023

@adoroszlai Thanks for the review.

@xBis7
Copy link
Contributor Author

xBis7 commented Jul 4, 2023

@adoroszlai Should we mark the class as flaky? Or adding the annotation to a method makes the whole class run under the flaky CI?

@adoroszlai
Copy link
Contributor

adding the annotation to a method makes the whole class run under the flaky CI?

Only the methods tagged as @Flaky are run in the flaky split.

Should we mark the class as flaky?

The class should be tagged if all (or most) methods may fail intermittently (and the problem is such that repeated attempts may succeed).

@adoroszlai adoroszlai merged commit 857491c into apache:master Jul 5, 2023
@xBis7
Copy link
Contributor Author

xBis7 commented Jul 5, 2023

@adoroszlai Thanks for all the help and the info.

The class should be tagged if all (or most) methods may fail intermittently (and the problem is such that repeated attempts may succeed).

That's not the case here, we might get 1 or 2 failures every 100 runs. I'll try to address that in HDDS-8876.

@xBis7 xBis7 deleted the HDDS-8880 branch July 5, 2023 21:16
errose28 added a commit to errose28/ozone that referenced this pull request Jul 10, 2023
* master: (36 commits)
  HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039)
  Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)"
  HDDS-7750. Incorrect WRITE ACL check. (apache#4992)
  HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033)
  HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030)
  HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028)
  HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029)
  HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025)
  HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode
  HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961)
  HDDS-8979. error validating kustomization.yaml (apache#5024)
  HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018)
  HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015)
  HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021)
  HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943)
  HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825)
  HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022)
  HDDS-8962. Ensure docker env is stopped (apache#5011)
  HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007)
  HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants