Flakes: Use new healthy shard check in vreplication e2e tests#12502
Merged
mattlord merged 7 commits intovitessio:mainfrom Mar 1, 2023
Merged
Flakes: Use new healthy shard check in vreplication e2e tests#12502mattlord merged 7 commits intovitessio:mainfrom
mattlord merged 7 commits intovitessio:mainfrom
Conversation
This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. Signed-off-by: Matt Lord <mattalord@gmail.com>
Contributor
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
If a new flag is being introduced:
If a workflow is added or modified:
Bug fixes
Non-trivial changes
New/Existing features
Backward compatibility
|
They looked like this:
WARNING: DATA RACE
Write at 0x000005bf9b60 by goroutine 27141:
github.com/spf13/pflag.newUint64Value()
/home/runner/go/pkg/mod/github.com/spf13/pflag@v1.0.5/uint64.go:9 +0x5a
github.com/spf13/pflag.(*FlagSet).Uint64Var()
/home/runner/go/pkg/mod/github.com/spf13/pflag@v1.0.5/uint64.go:45 +0x55
vitess.io/vitess/go/vt/log.RegisterFlags()
/home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64
vitess.io/vitess/go/vt/servenv.GetFlagSetFor()
/home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183
vitess.io/vitess/go/vt/servenv.ParseFlags()
/home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49
...
Previous read at 0x000005bf9b60 by goroutine 27136:
1744
github.com/golang/glog.(*syncBuffer).Write()
...
And they most often occurred in the wrangler unit tests, which makes sense
because it creates a log of loggers.
Signed-off-by: Matt Lord <mattalord@gmail.com>
This reverts commit 51992b8. Signed-off-by: Matt Lord <mattalord@gmail.com>
d61154a to
a85c129
Compare
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
…althy Signed-off-by: Matt Lord <mattalord@gmail.com>
deepthi
approved these changes
Mar 1, 2023
Collaborator
|
Looks like a good change. Do we have a sense for how long the schema init is taking? IIRC, in local testing it was only 1-2 seconds. |
frouioui
approved these changes
Mar 1, 2023
Member
Author
Yeah, 1-3 seconds seems normal locally. But that is plenty of time for races. And sometimes we encounter very interesting timings on Actions runners. 🥲 |
Contributor
|
I was unable to backport this Pull Request to the following branches: |
GuptaManan100
pushed a commit
to planetscale/vitess
that referenced
this pull request
Mar 28, 2023
…io#12502) * Use new healthy shard check in vreplication e2e tests This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. Signed-off-by: Matt Lord <mattalord@gmail.com> * Try to address unit test race flakes around log size They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/pflag@v1.0.5/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/pflag@v1.0.5/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. Signed-off-by: Matt Lord <mattalord@gmail.com> * Revert "Try to address unit test race flakes around log size" This reverts commit 51992b8. Signed-off-by: Matt Lord <mattalord@gmail.com> * Use external cluster vtctld in TestMigrate Signed-off-by: Matt Lord <mattalord@gmail.com> * Use subshell vs command output interpolation Signed-off-by: Matt Lord <mattalord@gmail.com> * Ingnore any config files in mysql alias Signed-off-by: Matt Lord <mattalord@gmail.com> --------- Signed-off-by: Matt Lord <mattalord@gmail.com>
frouioui
pushed a commit
that referenced
this pull request
Mar 28, 2023
#12740) * Use new healthy shard check in vreplication e2e tests This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck is no longer a reliable indicator that the shard has a healthy serving primary, because now a primary needs to initialize its sidecar database and wait for that to replicate via semi-sync before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test. * Try to address unit test race flakes around log size They looked like this: WARNING: DATA RACE Write at 0x000005bf9b60 by goroutine 27141: github.com/spf13/pflag.newUint64Value() /home/runner/go/pkg/mod/github.com/spf13/pflag@v1.0.5/uint64.go:9 +0x5a github.com/spf13/pflag.(*FlagSet).Uint64Var() /home/runner/go/pkg/mod/github.com/spf13/pflag@v1.0.5/uint64.go:45 +0x55 vitess.io/vitess/go/vt/log.RegisterFlags() /home/runner/work/vitess/vitess/go/vt/log/log.go:81 +0x64 vitess.io/vitess/go/vt/servenv.GetFlagSetFor() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:347 +0x183 vitess.io/vitess/go/vt/servenv.ParseFlags() /home/runner/work/vitess/vitess/go/vt/servenv/servenv.go:326 +0x49 ... Previous read at 0x000005bf9b60 by goroutine 27136: 1744 github.com/golang/glog.(*syncBuffer).Write() ... And they most often occurred in the wrangler unit tests, which makes sense because it creates a log of loggers. * Revert "Try to address unit test race flakes around log size" This reverts commit 51992b8. * Use external cluster vtctld in TestMigrate * Use subshell vs command output interpolation * Ingnore any config files in mysql alias --------- Signed-off-by: Matt Lord <mattalord@gmail.com> Co-authored-by: Matt Lord <mattalord@gmail.com>
Closed
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
I have noticed that e.g. the
vreplication_basicCI workflow has become less reliable since the SidecarDB init work was merged. The seen failures were always on theTestVreplicationCopyThrottlingtest. The reason is because we now needed a more reliable check for a healthy shard.This is needed because checking that there's a primary tablet for the shard in vtgate's healtcheck — which is what
vtgate.WaitForStatusOfTabletInShard()does — is no longer a reliable indicator that the shard has a healthy serving primary, because now after a primary is elected it needs to initialize its sidecar database and wait for those DDLs to replicate via semi-sync replication before it becomes serving and can proceed to perform normal functions. So this delay could cause test flakiness if you required a healthy shard before continuing with the test as was the case inTestVreplicationCopyThrottling.This PR migrates all of the places in the
vreplicationendtoend tests where we were using this to wait for a healthy shard:vtgate.WaitForStatusOfTabletInShard(fmt.Sprintf("%s.%s.primary", keyspace, shard), 1)to instead using the newWaitForHealthyShardcluster package helper:Related Issue(s)
Checklist