Skip to content

Reduce Flakiness of ERS/PRS e2e Tests Using Retries With a Timeout#10720

Merged
GuptaManan100 merged 3 commits intovitessio:mainfrom
planetscale:ers_prs_flakes
Jul 18, 2022
Merged

Reduce Flakiness of ERS/PRS e2e Tests Using Retries With a Timeout#10720
GuptaManan100 merged 3 commits intovitessio:mainfrom
planetscale:ers_prs_flakes

Conversation

@mattlord
Copy link
Member

@mattlord mattlord commented Jul 17, 2022

Description

The ers_prs_heavy workflow has been very flaky in this PR: #10639. Often having to run 4+ times to succeed. I suspect that all of the new DDL executed at tablet init when the VDiff engine opens is causing replication to be slightly slower and thus we're getting this basic error from various individual tests (the ID and specific test differing on the failed runs):

2022-07-17T21:32:58.8852569Z     utils.go:348: 
2022-07-17T21:32:58.8852912Z         	Error Trace:	utils.go:354
2022-07-17T21:32:58.8853472Z         	            				utils.go:348
2022-07-17T21:32:58.8853942Z         	            				utils.go:526
2022-07-17T21:32:58.8854402Z         	            				utils.go:460
2022-07-17T21:32:58.8854875Z         	            				ers_test.go:453
2022-07-17T21:32:58.8855872Z         	Error:      	Expected nil, but got: &mysql.SQLError{Num:1146, State:"42S02", Message:"Table 'vt_ks.vt_insert_test' doesn't exist", Query:"select msg from vt_insert_test where id=20"}

This patch was tested and verified here and it passed on the first try. 🥳 It was then re-tested and re-verified after this minor follow-up and it passed on the first try again. 🥳 🥳

Related Issue(s)

Checklist

  • "Backport me!" label has been added if this change should be backported
  • Tests were added or are not required
  • Documentation was added or is not required

Signed-off-by: Matt Lord <mattalord@gmail.com>
@vitess-bot
Copy link
Contributor

vitess-bot bot commented Jul 17, 2022

Review Checklist

Hello reviewers! 👋 Please follow this checklist when reviewing this Pull Request.

General

  • Ensure that the Pull Request has a descriptive title.
  • If this is a change that users need to know about, please apply the release notes (needs details) label so that merging is blocked unless the summary release notes document is included.
  • If a new flag is being introduced, review whether it is really needed. The flag names should be clear and intuitive (as far as possible), and the flag's help should be descriptive. Additionally, flag names should use dashes (-) as word separators rather than underscores (_).
  • If a workflow is added or modified, each items in Jobs should be named in order to mark it as required. If the workflow should be required, the GitHub Admin should be notified.

Bug fixes

  • There should be at least one unit or end-to-end test.
  • The Pull Request description should either include a link to an issue that describes the bug OR an actual description of the bug and how to reproduce, along with a description of the fix.

Non-trivial changes

  • There should be some code comments as to why things are implemented the way they are.

New/Existing features

  • Should be documented, either by modifying the existing documentation or creating new documentation.
  • New features should have a link to a feature request issue or an RFC that documents the use cases, corner cases and test cases.

Backward compatibility

  • Protobuf changes should be wire-compatible.
  • Changes to _vt tables and RPCs need to be backward compatible.
  • vtctl command output order should be stable and awk-able.

mattlord added a commit to planetscale/vitess that referenced this pull request Jul 17, 2022
This is a test of the fix in:
  vitessio#10720

Signed-off-by: Matt Lord <mattalord@gmail.com>
@mattlord mattlord marked this pull request as ready for review July 18, 2022 00:02
@mattlord mattlord requested a review from GuptaManan100 July 18, 2022 00:02
@mattlord mattlord changed the title Reduce flakiness of ers/prs e2e tests using retries with a timeout Reduce Flakiness of ERS/PRS e2e Tests Using Retries With a Timeout Jul 18, 2022
mattlord added 2 commits July 17, 2022 20:11
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Copy link
Contributor

@GuptaManan100 GuptaManan100 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! I had seen this failure a couple of times before too and had this on my TODO list for a while now. Thank you for making this change.

@GuptaManan100 GuptaManan100 merged commit 1d435f2 into vitessio:main Jul 18, 2022
@GuptaManan100 GuptaManan100 deleted the ers_prs_flakes branch July 18, 2022 09:52
@mattlord mattlord mentioned this pull request Jul 19, 2022
3 tasks
mattlord added a commit that referenced this pull request Jul 22, 2022
* Don't allow resume if VDiff not completed.

An uncompleted vdiff should only be retried, not resumed.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Record and display VDiff errors per shard

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Better align errors in text based output

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Work with underlying database errors

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Retry when appropriate on vdiff engine startup

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Address bugs in resume/retry logic after adding support for both

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Make auto-retry the default

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Ensure report is valid json before unmarshaling

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Auto retry error'd VDiffs

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Don't retry error'd VDiffs on engine start anymore

We now have a goroutine that will do this periodically

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Limit withDDL usage to entry points

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add e2e test and fix bugs it exposed

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Ensure we always signal the retry goroutine to stop on engine.Close()

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use vdiff engine mutex during retries

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Close retry goroutine on vde ctx cancel w/o done channel

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Open & Close of VDiff engine controls retry goroutine

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Rely on sync.Once to apply VDiff schema

And avoid withDDL usage anywhere else.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor change after self review

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Tidy up vdiff retry goroutine mgmt

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Improving gorouting mgmt -- trying to eliminate flakes

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Moar safety

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Making more tweaks to exec and term more quickly

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Aye dios mio

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Go back to 30s ticker

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Make engine open more efficient to improve PRS times

Otherwise PRS was timing out too often waiting for the replicas to
catch up with the new promoted primary.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* isOpen=true before initControllers to ensure proper Close

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Do lazy init in 2 entry points when doing actual work.

Check for cancelled context in the controller.run code path.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Bug fixes and improvements

TODO: figure out how we should be tracking progress for merges where
      there's a single target shard -- and thus a single
      _vt.vdiff_table record -- but > 1 source shard.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Final (🤞) set of bug fixes

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add progress reporting

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor changes after self review

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Improved template for error handling in text format

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Bug fixes around progress calculation

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Comment improvement

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Ensure we get correct DB name from vreplication workflow.

And quickly timeout the receive attempted on the controller's
done channel when checking if it's active.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Bug fixes and improvements to progress reporting

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor changes after self review

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Remove unnecessary case in select

We'll never hit that as the first one will receive immediately as
the channel was closed or we'll hit the default.

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Add unit test for progress reporting

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Be more realistic in half way unit test

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use more descriptive var names in unit test to make logic clear

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Protect all access to vde.controllers

Add comment about safe usage of vde.addController()

Signed-off-by: Matt Lord <mattalord@gmail.com>

* De-flake ers_prs tests using state waits with timeouts

This is a test of the fix in:
  #10720

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor improvement -- also to exec the test suite again

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use simpler defer stmt

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Correct VDiff final state handling

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Improve WorkflowDiffer logging & error handling

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Use withDDL.ExecIgnore during engine startup

The VReplication engine does the same.

Signed-off-by: Matt Lord <mattalord@gmail.com>
rsajwani pushed a commit to planetscale/vitess that referenced this pull request Aug 1, 2022
…itessio#10720) (vitessio#880)

* Reduce flakiness of ers/prs e2e tests using retries with a timeout

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Minor improvement -- will also exercise the tests again

Signed-off-by: Matt Lord <mattalord@gmail.com>

* Improve failure message

Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Manan Gupta <manan@planetscale.com>

Co-authored-by: Matt Lord <mattalord@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants