Address flakiness of vtgate_vindex.prefixfanout tests#10216
Address flakiness of vtgate_vindex.prefixfanout tests#10216mattlord merged 15 commits intovitessio:mainfrom
Conversation
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Sometimes GitHub Actions is *super* slow and our tests should still be able to pass. Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
And get related files aligned Signed-off-by: Matt Lord <mattalord@gmail.com>
We were waiting for 1 replica tablet when the clsuter defined for the test did not have any replica tablets. Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
frouioui
left a comment
There was a problem hiding this comment.
It looks good to me, I left a comment and a question.
However, the new vtgate_vindex_heavy test has a timeout. It seems like starting the VTGate makes the test timeout:
I0504 23:20:14.082316 16539 vtgate_process.go:109] Running vtgate with command: vtgate --topo_implementation etcd2 --topo_global_server_address localhost:16002 --topo_global_root /vitess/global --log_dir /tmp/vt_1342812822/vtroot_16001/tmp_16003 --log_queries_to_file /tmp/vt_1342812822/vtroot_16001/tmp_16003/vtgate_querylog.txt --port 16031 --grpc_port 16032 --mysql_server_port 16033 --mysql_server_socket_path /tmp/vt_1342812822/vtroot_16001/tmp_16003/mysql.sock --cell zone1 --cells_to_watch zone1 --tablet_types_to_wait PRIMARY,REPLICA --service_map grpc-tabletmanager,grpc-throttler,grpc-queryservice,grpc-updatestream,grpc-vtctl,grpc-vtworker,grpc-vtgateservice --mysql_auth_server_impl none --planner_version Gen4CompareV3 --health_check_interval=2s
Is the last log before the test is interrupted due to the time out. I am unsure whether or not this time out has a link with the changes made to the port range and fd limit in the workflow.
The timeout can possibly come from: if err := clusterInstance.WaitForTabletsToHealthyInVtgate(); err != nil {
return 1
}as well. |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Just noting that this was discussed in Slack. I’m not sure why the We might be waiting in here: Perhaps Perhaps another bug to fix in the future. 🙂 |
Signed-off-by: Matt Lord <mattalord@gmail.com>
Signed-off-by: Matt Lord <mattalord@gmail.com>
|
The Once the test passes again for the 7th time in a row (hopefully) and I confirm the correct log messages (should be |
Signed-off-by: Matt Lord <mattalord@gmail.com>
|
Thanks for doing this @mattlord! It is amazing ❤️ |
Description
The
vtgate_vindex->prefixfanouttests have been flaky, particularly when GitHub Actions is slower/has more resource contention than usual.In this PR we make the following changes:
vttablets to be seen as healthy and serving (inTestMain) by thevtgatebefore executing any testsWaitForTabletsToHealthyInVtgate()function as it was always waiting for 1replicatablet in each shard to be seen as healthy and serving in thevtgatebutreplicatablets are optional and we have none of them in the cluster used for theprefixfanouttestvtgate_vindextest as heavy since even with a long wait we still seemed unable to have mysqld start at timesWaitForTabletsToHealthyInVtgate()bug, this is fairly heavy so leaving this in place (can remove though if others prefer)Cluster_17flakiness seen here too (hitting the 10 min time limit); renamed that tovtgate_general_heavy20toxb_backupas I missed the opportunity to do that in Temp: Pin XtraBackup version used at 2.4.24 for 5.7 tests #10194 and was sadℹ️ NOTE: marking a CI workflow as heavy causes us to increase some key OS limits (e.g. local ephemeral port range, AIO slots, open files, etc) while also decreasing the resource usage of each mysqld
Related Issue(s)
Checklist