HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots #5022

xBis7 · 2023-07-04T12:03:56Z

What changes were proposed in this pull request?

This patch fixes issues with three methods running under TestOMRatisSnapshots. This class was 30% of the time resulting in a timeout due to the configuration setup.

The issue comes from the config key ozone.om.ratis.snapshot.auto.trigger.threshold which is referring to the number of unappended logs before triggering a snapshot installation. All tests under this class depend on frequent Ratis snapshot installations.

#4770 Increased the number of writes performed by testInstallSnapshot() which in combination with the frequent Ratis snapshot installations led 30% of runs to a system freeze that caused a timeout. To fix the issue, only testInstallSnapshot() runs configured with a higher threshold number.

Furthermore, testInstallIncrementalSnapshot and testInstallIncrementalSnapshotWithFailure are installing a Ratis snapshot on the follower and then checking that the follower's key table has the keys from the Ratis snapshot. Sometimes this check is taking place sooner than expected making the methods flaky. By adding a wait check on reading the key table, the issue is fixed.

There is an existing issue with testInstallIncrementalSnapshotWithFailure, that hasn't been fixed. At some point we are expecting that there is a new checkpoint generated in the leader OM and we check the metrics to see if they are updated accordingly. Rarely, this doesn't happen and there is a failure. I added the @Flaky annotation above the method until I address the issue in HDDS-8876.

This class is still flaky but I'm creating this PR to remove the @Disabled annotation and get this test class running along with the rest of the CI.

Should we add

@Flaky("HDDS-8876")

above the class as well?

What is the link to the Apache JIRA

https://issues.apache.org/jira/browse/HDDS-8880

https://issues.apache.org/jira/browse/HDDS-8952

How was this patch tested?

CI on my fork: https://github.com/xBis7/ozone/actions/runs/5452838798

CI running TestOMRatisSnapshot in 100 iterations

https://github.com/xBis7/ozone/actions/runs/5447212761

https://github.com/xBis7/ozone/actions/runs/5447866879

https://github.com/xBis7/ozone/actions/runs/5447868491

We are getting a fork timeout in 1/100 or 2/100 runs.

https://github.com/xBis7/ozone/actions/runs/5447868491/jobs/9910384871#step:5:4472

xBis7 · 2023-07-04T12:04:24Z

@adoroszlai @GeorgeJahad Can you please take a look at this PR?

adoroszlai

Thanks @xBis7 for the patch and extensive testing. 👍

xBis7 · 2023-07-04T14:35:46Z

@adoroszlai Thanks for the review.

xBis7 · 2023-07-04T14:36:37Z

@adoroszlai Should we mark the class as flaky? Or adding the annotation to a method makes the whole class run under the flaky CI?

adoroszlai · 2023-07-04T19:35:11Z

adding the annotation to a method makes the whole class run under the flaky CI?

Only the methods tagged as @Flaky are run in the flaky split.

Should we mark the class as flaky?

The class should be tagged if all (or most) methods may fail intermittently (and the problem is such that repeated attempts may succeed).

xBis7 · 2023-07-05T09:05:35Z

@adoroszlai Thanks for all the help and the info.

The class should be tagged if all (or most) methods may fail intermittently (and the problem is such that repeated attempts may succeed).

That's not the case here, we might get 1 or 2 failures every 100 runs. I'll try to address that in HDDS-8876.

* master: (36 commits) HDDS-8990. Intermittent timeout waiting on datanode4 9856 to become available (apache#5039) Revert "HDDS-7750. Incorrect WRITE ACL check. (apache#4992)" HDDS-7750. Incorrect WRITE ACL check. (apache#4992) HDDS-8985. Intermittent timeout exiting safe mode in HA secure tests (apache#5033) HDDS-8593. Add RootCARotationPoller to CertClient (apache#5030) HDDS-7645. Kubernetes check should fail fast if cluster cannot start (apache#5028) HDDS-8981. TestRootedOzoneFileSystem runs out of disk space (apache#5029) HDDS-8592. Fetch and save all root certificates during service's certificate rotation. (apache#5025) HDDS-8981. Disable TestRootedOzoneFileSystem#testSafeMode HDDS-8591. Create scheduler to check for new root ca certificates (apache#4961) HDDS-8979. error validating kustomization.yaml (apache#5024) HDDS-8973. Ozone SCM HA should not allocates duplicate IDs when transferring leadership (apache#5018) HDDS-8970. Snapshot Diff should return path relative to bucket root (apache#5015) HDDS-8975. Clarify SCM HA auto-bootstrap doc (apache#5021) HDDS-8689. Rotate Root CA and Sub CA in SCM. (apache#4943) HDDS-8436. Support setSafeMode(), isFileClosed() FileSystem API (apache#4825) HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots (apache#5022) HDDS-8962. Ensure docker env is stopped (apache#5011) HDDS-7794. [snapshot] SnapshotDiff should throw better error messages for exception handling (apache#5007) HDDS-7922. [FSO] S3G folder support fso layout filestatus s3A compatibility (apache#4448) ...

Intermittent fork timeout in TestOMRatisSnapshots fix

5dbac45

adoroszlai changed the title ~~HDDS-8880. [disabled] Intermittent fork timeout in TestOMRatisSnapshots~~ HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots Jul 4, 2023

adoroszlai approved these changes Jul 4, 2023

View reviewed changes

adoroszlai merged commit 857491c into apache:master Jul 5, 2023

xBis7 deleted the HDDS-8880 branch July 5, 2023 21:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots #5022

HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots #5022

Uh oh!

xBis7 commented Jul 4, 2023

Uh oh!

xBis7 commented Jul 4, 2023

Uh oh!

adoroszlai left a comment

Uh oh!

xBis7 commented Jul 4, 2023

Uh oh!

xBis7 commented Jul 4, 2023

Uh oh!

adoroszlai commented Jul 4, 2023

Uh oh!

xBis7 commented Jul 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots #5022

HDDS-8880. Intermittent fork timeout in TestOMRatisSnapshots #5022

Uh oh!

Conversation

xBis7 commented Jul 4, 2023

What changes were proposed in this pull request?

What is the link to the Apache JIRA

How was this patch tested?

Uh oh!

xBis7 commented Jul 4, 2023

Uh oh!

adoroszlai left a comment

Choose a reason for hiding this comment

Uh oh!

xBis7 commented Jul 4, 2023

Uh oh!

xBis7 commented Jul 4, 2023

Uh oh!

adoroszlai commented Jul 4, 2023

Uh oh!

xBis7 commented Jul 5, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants