Reduce Flakiness of ERS/PRS e2e Tests Using Retries With a Timeout#10720
Merged
GuptaManan100 merged 3 commits intovitessio:mainfrom Jul 18, 2022
Merged
Reduce Flakiness of ERS/PRS e2e Tests Using Retries With a Timeout#10720GuptaManan100 merged 3 commits intovitessio:mainfrom
GuptaManan100 merged 3 commits intovitessio:mainfrom
Conversation
Signed-off-by: Matt Lord <mattalord@gmail.com>
Contributor
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
mattlord
added a commit
to planetscale/vitess
that referenced
this pull request
Jul 17, 2022
This is a test of the fix in: vitessio#10720 Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
GuptaManan100
approved these changes
Jul 18, 2022
Contributor
GuptaManan100
left a comment
There was a problem hiding this comment.
This is great! I had seen this failure a couple of times before too and had this on my TODO list for a while now. Thank you for making this change.
mattlord
added a commit
that referenced
this pull request
Jul 22, 2022
* Don't allow resume if VDiff not completed.
An uncompleted vdiff should only be retried, not resumed.
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Record and display VDiff errors per shard
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Better align errors in text based output
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Work with underlying database errors
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Retry when appropriate on vdiff engine startup
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Address bugs in resume/retry logic after adding support for both
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Make auto-retry the default
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Ensure report is valid json before unmarshaling
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Auto retry error'd VDiffs
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Don't retry error'd VDiffs on engine start anymore
We now have a goroutine that will do this periodically
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Limit withDDL usage to entry points
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Add e2e test and fix bugs it exposed
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Ensure we always signal the retry goroutine to stop on engine.Close()
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Use vdiff engine mutex during retries
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Close retry goroutine on vde ctx cancel w/o done channel
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Open & Close of VDiff engine controls retry goroutine
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Rely on sync.Once to apply VDiff schema
And avoid withDDL usage anywhere else.
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Minor change after self review
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Tidy up vdiff retry goroutine mgmt
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Improving gorouting mgmt -- trying to eliminate flakes
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Moar safety
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Making more tweaks to exec and term more quickly
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Aye dios mio
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Go back to 30s ticker
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Make engine open more efficient to improve PRS times
Otherwise PRS was timing out too often waiting for the replicas to
catch up with the new promoted primary.
Signed-off-by: Matt Lord <mattalord@gmail.com>
* isOpen=true before initControllers to ensure proper Close
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Do lazy init in 2 entry points when doing actual work.
Check for cancelled context in the controller.run code path.
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Bug fixes and improvements
TODO: figure out how we should be tracking progress for merges where
there's a single target shard -- and thus a single
_vt.vdiff_table record -- but > 1 source shard.
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Final (🤞) set of bug fixes
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Add progress reporting
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Minor changes after self review
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Improved template for error handling in text format
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Bug fixes around progress calculation
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Comment improvement
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Ensure we get correct DB name from vreplication workflow.
And quickly timeout the receive attempted on the controller's
done channel when checking if it's active.
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Bug fixes and improvements to progress reporting
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Minor changes after self review
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Remove unnecessary case in select
We'll never hit that as the first one will receive immediately as
the channel was closed or we'll hit the default.
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Add unit test for progress reporting
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Be more realistic in half way unit test
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Use more descriptive var names in unit test to make logic clear
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Protect all access to vde.controllers
Add comment about safe usage of vde.addController()
Signed-off-by: Matt Lord <mattalord@gmail.com>
* De-flake ers_prs tests using state waits with timeouts
This is a test of the fix in:
#10720
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Minor improvement -- also to exec the test suite again
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Use simpler defer stmt
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Correct VDiff final state handling
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Improve WorkflowDiffer logging & error handling
Signed-off-by: Matt Lord <mattalord@gmail.com>
* Use withDDL.ExecIgnore during engine startup
The VReplication engine does the same.
Signed-off-by: Matt Lord <mattalord@gmail.com>
rsajwani
pushed a commit
to planetscale/vitess
that referenced
this pull request
Aug 1, 2022
…itessio#10720) (vitessio#880) * Reduce flakiness of ers/prs e2e tests using retries with a timeout Signed-off-by: Matt Lord <mattalord@gmail.com> * Minor improvement -- will also exercise the tests again Signed-off-by: Matt Lord <mattalord@gmail.com> * Improve failure message Signed-off-by: Matt Lord <mattalord@gmail.com> Signed-off-by: Manan Gupta <manan@planetscale.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
The
ers_prs_heavyworkflow has been very flaky in this PR: #10639. Often having to run 4+ times to succeed. I suspect that all of the new DDL executed at tablet init when the VDiff engine opens is causing replication to be slightly slower and thus we're getting this basic error from various individual tests (the ID and specific test differing on the failed runs):This patch was tested and verified here and it passed on the first try. 🥳 It was then re-tested and re-verified after this minor follow-up and it passed on the first try again. 🥳 🥳
Related Issue(s)
Checklist