Address flakiness of vtgate_vindex.prefixfanout tests by mattlord · Pull Request #10216 · vitessio/vitess

mattlord · 2022-05-04T17:45:41Z

Description

The vtgate_vindex->prefixfanout tests have been flaky, particularly when GitHub Actions is slower/has more resource contention than usual.

In this PR we make the following changes:

Add code to wait for the vttablets to be seen as healthy and serving (in TestMain) by the vtgate before executing any tests
- Correct the WaitForTabletsToHealthyInVtgate() function as it was always waiting for 1 replica tablet in each shard to be seen as healthy and serving in the vtgate but replica tablets are optional and we have none of them in the cluster used for the prefixfanout test
Mark the vtgate_vindex test as heavy since even with a long wait we still seemed unable to have mysqld start at times
- Although I later realized that this was caused by the WaitForTabletsToHealthyInVtgate() bug, this is fairly heavy so leaving this in place (can remove though if others prefer)
Deal with Cluster_17 flakiness seen here too (hitting the 10 min time limit); renamed that to vtgate_general_heavy
Since I was already renaming 2 CI workflows here, I went ahead and renamed 20 to xb_backup as I missed the opportunity to do that in Temp: Pin XtraBackup version used at 2.4.24 for 5.7 tests #10194 and was sad

ℹ️ NOTE: marking a CI workflow as heavy causes us to increase some key OS limits (e.g. local ephemeral port range, AIO slots, open files, etc) while also decreasing the resource usage of each mysqld

Related Issue(s)

Fixes: Flakes: address vtgate_vindex test flakiness #10215

Checklist

"Backport me!" label has been added if this change should be backported
Tests are not required
Documentation is not required

Signed-off-by: Matt Lord <mattalord@gmail.com>

Sometimes GitHub Actions is *super* slow and our tests should still be able to pass. Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

And get related files aligned Signed-off-by: Matt Lord <mattalord@gmail.com>

We were waiting for 1 replica tablet when the clsuter defined for the test did not have any replica tablets. Signed-off-by: Matt Lord <mattalord@gmail.com>

Signed-off-by: Matt Lord <mattalord@gmail.com>

frouioui

It looks good to me, I left a comment and a question.

However, the new vtgate_vindex_heavy test has a timeout. It seems like starting the VTGate makes the test timeout:

I0504 23:20:14.082316   16539 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_1342812822/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_1342812822/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_1342812822/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3 --health_check_interval=2s

Is the last log before the test is interrupted due to the time out. I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.

go/test/endtoend/cluster/cluster_process.go

go/test/endtoend/cluster/vtgate_process.go

frouioui · 2022-05-04T23:57:37Z

I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.

The timeout can possibly come from:

		if err := clusterInstance.WaitForTabletsToHealthyInVtgate(); err != nil {
 			return 1
 		}

as well.

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2022-05-05T00:56:41Z

It looks good to me, I left a comment and a question.

However, the new vtgate_vindex_heavy test has a timeout. It seems like starting the VTGate makes the test timeout:
I0504 23:20:14.082316   16539 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_1342812822/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_1342812822/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_1342812822/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3 --health_check_interval=2s
Is the last log before the test is interrupted due to the time out. I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.

Just noting that this was discussed in Slack. I’m not sure why the vtgate wait didn’t timeout at 60s: https://github.com/vitessio/vitess/blob/main/go/test/endtoend/cluster/vtgate_process.go#L109-L135

We might be waiting in here:

    go func() {
        if vtgate.proc != nil {
            vtgate.exit <- vtgate.proc.Wait()
        }
    }()

Perhaps vtgate.proc.Wait() blocks and never returns because the OS process has exited (in this case vtgate exited immediately because it was being passed an invalid flag).

Perhaps another bug to fix in the future. 🙂

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord · 2022-05-05T02:21:57Z

The vtgate_vindex_heavy test has passed 6 times in a row now! I'm going to make a minor logging change now. This will give us one more test run and I can correct the log message. The log messages now look like this:

2022-05-05T02:11:32.9276338Z I0505 02:10:29.618259   16601 cluster_process.go:592] Vtgate started, connect to mysql using : mysql -h 127.0.0.1 -P 16033
2022-05-05T02:11:32.9278331Z I0505 02:10:29.618544   16601 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_717048359/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_717048359/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_717048359/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3
2022-05-05T02:11:32.9279784Z I0505 02:10:59.723618   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.-41.primary tablets in shard zone1
2022-05-05T02:11:32.9280383Z I0505 02:10:59.725485   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.41-4180.primary tablets in shard zone1
2022-05-05T02:11:32.9280975Z I0505 02:10:59.726655   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.4180-42.primary tablets in shard zone1
2022-05-05T02:11:32.9281532Z I0505 02:10:59.727817   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing.42-.primary tablets in shard zone1
2022-05-05T02:11:32.9282112Z I0505 02:10:59.728965   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.-c2.primary tablets in shard zone1
2022-05-05T02:11:32.9282700Z I0505 02:10:59.729899   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.c2-c20a80.primary tablets in shard zone1
2022-05-05T02:11:32.9283309Z I0505 02:10:59.730821   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.c20a80-d0.primary tablets in shard zone1
2022-05-05T02:11:32.9283890Z I0505 02:10:59.731723   16601 vtgate_process.go:181] Waiting for healthy status of 1 cfc_testing_md5.d0-.primary tablets in shard zone1
2022-05-05T02:11:32.9284221Z === RUN   TestCFCPrefixQueryNoHash
...

Once the test passes again for the 7th time in a row (hopefully) and I confirm the correct log messages (should be ... tablets in cell zone1) then I will merge the PR.

Signed-off-by: Matt Lord <mattalord@gmail.com>

frouioui · 2022-05-05T02:23:28Z

Thanks for doing this @mattlord! It is amazing ❤️

Wait for vtgate and tablets to be healthy in prefixfanout tests

990ded2

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord added Component: Build/CI Type: CI/Build release notes none Flakes labels May 4, 2022

mattlord requested review from frouioui, harshit-gangal and systay as code owners May 4, 2022 17:45

Setup already waits for vtgate proceess to be healthy

5ad574f

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord marked this pull request as draft May 4, 2022 18:01

mattlord changed the title ~~Wait for vtgate and tablets to be healthy in prefixfanout tests~~ Address flakiness of vtgate_vindex.prefixfanout tests May 4, 2022

Let's give tablets more time to become healthy

82b186a

Sometimes GitHub Actions is *super* slow and our tests should still be able to pass. Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord added Skip Upgrade Downgrade and removed Skip Upgrade Downgrade labels May 4, 2022

mattlord added 2 commits May 4, 2022 14:44

Wait longer and check more frequently

c836a02

Signed-off-by: Matt Lord <mattalord@gmail.com>

Mark vtgate_vindex test as heavy

ecd6808

Signed-off-by: Matt Lord <mattalord@gmail.com>

vitessio deleted a comment from github-actions bot May 4, 2022

60s is a more than reasonable upper limit in tablet+mysqld startup

3adc1fd

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord marked this pull request as ready for review May 4, 2022 20:14

vitessio deleted a comment from github-actions bot May 4, 2022

mattlord added 4 commits May 4, 2022 16:38

Also rename 17->vtgate_general and mark as heavy

ff39463

Signed-off-by: Matt Lord <mattalord@gmail.com>

Update test config to match workflow renames

4e84ee7

Signed-off-by: Matt Lord <mattalord@gmail.com>

Rename 20 to xb_backup

a517c49

And get related files aligned Signed-off-by: Matt Lord <mattalord@gmail.com>

Actually wait for all tablets in all shards to be healthy

d8c9842

We were waiting for 1 replica tablet when the clsuter defined for the test did not have any replica tablets. Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord requested a review from GuptaManan100 May 4, 2022 22:57

We need to reset the replica and rdonly table counts for each shard

5ac33ed

Signed-off-by: Matt Lord <mattalord@gmail.com>

frouioui approved these changes May 4, 2022

View reviewed changes

go/test/endtoend/cluster/cluster_process.go Show resolved Hide resolved

go/test/endtoend/cluster/vtgate_process.go Show resolved Hide resolved

Add log msg and get rid of extra flags added

b8ffb6b

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord added 2 commits May 4, 2022 21:05

run CI tests again

d6d4626

Signed-off-by: Matt Lord <mattalord@gmail.com>

run CI tests one last time for goodness sake

d913d70

Signed-off-by: Matt Lord <mattalord@gmail.com>

Minor correction to new Info log msg

e618c1b

Signed-off-by: Matt Lord <mattalord@gmail.com>

mattlord merged commit f7d6c0d into vitessio:main May 5, 2022

mattlord deleted the prefixfanout_flakes branch May 5, 2022 03:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address flakiness of vtgate_vindex.prefixfanout tests#10216

Address flakiness of vtgate_vindex.prefixfanout tests#10216
mattlord merged 15 commits intovitessio:mainfrom
planetscale:prefixfanout_flakes

mattlord commented May 4, 2022 •

edited

Loading

Uh oh!

frouioui left a comment

Uh oh!

Uh oh!

Uh oh!

frouioui commented May 4, 2022

Uh oh!

mattlord commented May 5, 2022

Uh oh!

mattlord commented May 5, 2022

Uh oh!

frouioui commented May 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

mattlord commented May 4, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Related Issue(s)

Checklist

Uh oh!

frouioui left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

frouioui commented May 4, 2022

Uh oh!

mattlord commented May 5, 2022

Uh oh!

mattlord commented May 5, 2022

Uh oh!

frouioui commented May 5, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mattlord commented May 4, 2022 •

edited

Loading