Skip to content

fix(server/etcdserver/api/v3compactor): fix flaky TestPeriodicHourly and others with Retry Mechanism#19748

Merged
ahrtr merged 1 commit intoetcd-io:mainfrom
amosehiguese:test/19499-flaky-testperiodichourly
Apr 29, 2025
Merged

fix(server/etcdserver/api/v3compactor): fix flaky TestPeriodicHourly and others with Retry Mechanism#19748
ahrtr merged 1 commit intoetcd-io:mainfrom
amosehiguese:test/19499-flaky-testperiodichourly

Conversation

@amosehiguese
Copy link
Contributor

This commit fixes the flaky TestPeriodicHourly test (issue #19499) and similar test cases by introducing an exponential backoff retry mechanism for waiting on actions rather than relying on a fixed timeout.

The root cause of the flakiness was timing inconsistencies in CI environments, where actions sometimes take slightly longer than the expected 10ms timeout set when creating a compactable object.

The new waitWithRetry function attempts to get the expected actions multiple times with exponentially increasing wait periods (10ms, 20ms, 40ms, etc.) before ultimately failing.

The solution was verified by reproducing the issue with an artificial delay and stress-testing the fix with 1000 consecutive runs.

The following test functions were updated to use the new retry mechanism:

  • TestPeriodicHourly
  • TestPeriodicMinutes
  • TestPeriodicPause
  • TestPeriodicSkipRevNotChange

Fixes #19499

test

Please read https://github.com/etcd-io/etcd/blob/main/CONTRIBUTING.md#contribution-flow.

…with retry mechanism

This commit fixes the flaky TestPeriodicHourly test (issue etcd-io#19499) and similar
test cases by introducing an exponential backoff retry mechanism for waiting
on actions rather than relying on a fixed timeout.

The root cause of the flakiness was timing inconsistencies in CI environments,
where actions sometimes take slightly longer than the expected 10ms timeout
set when creating a compactable object.

The new waitWithRetry function attempts to get the expected actions multiple
times with exponentially increasing wait periods (10ms, 20ms, 40ms, etc.)
before ultimately failing.

The solution was verified by reproducing the issue with an artificial
delay and stress-testing the fix with 1000 consecutive runs.

The following test functions were updated to use the new retry mechanism:
- TestPeriodicHourly
- TestPeriodicMinutes
- TestPeriodicPause
- TestPeriodicSkipRevNotChange

Fixes etcd-io#19499

Signed-off-by: amosehiguese <amosehiguese@gmail.com>
@k8s-ci-robot
Copy link

Hi @amosehiguese. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@siyuanfoundation
Copy link
Contributor

/ok-to-test

@codecov
Copy link

codecov bot commented Apr 15, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 68.72%. Comparing base (d3cebae) to head (650232c).
Report is 60 commits behind head on main.

Additional details and impacted files

see 25 files with indirect coverage changes

@@            Coverage Diff             @@
##             main   #19748      +/-   ##
==========================================
- Coverage   68.80%   68.72%   -0.09%     
==========================================
  Files         421      421              
  Lines       35857    35857              
==========================================
- Hits        24672    24642      -30     
- Misses       9752     9780      +28     
- Partials     1433     1435       +2     

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d3cebae...650232c. Read the comment docs.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@ahrtr ahrtr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

cc @fuweid

}
// Exponential backoff
backoffTime := time.Duration(10*(1<<retry)) * time.Millisecond
t.Logf("Retry %d: waiting %v before next attempt (last error: %v)", retry+1, backoffTime, lastErr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: %s is more human readable

Suggested change
t.Logf("Retry %d: waiting %v before next attempt (last error: %v)", retry+1, backoffTime, lastErr)
t.Logf("Retry %d: waiting %s before next attempt (last error: %v)", retry+1, backoffTime, lastErr)

@k8s-ci-robot
Copy link

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: ahrtr, amosehiguese

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@amosehiguese
Copy link
Contributor Author

You are welcome @ahrtr and thanks for the correction. I guess I can move on to other issues. 😊

@ahrtr ahrtr requested review from fuweid and ivanvc April 17, 2025 12:23
@ahrtr ahrtr merged commit 51e38f4 into etcd-io:main Apr 29, 2025
34 checks passed
@amosehiguese amosehiguese deleted the test/19499-flaky-testperiodichourly branch May 1, 2025 14:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

Flaky go.etcd.io/etcd/server/v3/etcdserver/api/v3compactor: TestPeriodicHourly

4 participants