fix flaky test on mysqlshell backup engine#17037
Conversation
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Review ChecklistHello reviewers! 👋 Please follow this checklist when reviewing this Pull Request. General
Tests
Documentation
New flags
If a workflow is added or modified:
Backward compatibility
|
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17037 +/- ##
==========================================
- Coverage 67.41% 67.39% -0.02%
==========================================
Files 1574 1574
Lines 253202 253204 +2
==========================================
- Hits 170686 170647 -39
- Misses 82516 82557 +41 ☔ View full report in Codecov by Sentry. |
|
This PR is being marked as stale because it has been open for 30 days with no activity. To rectify, you may do any of the following:
If no action is taken within 7 days, this PR will be closed. |
|
@rvrangel Hi! This test is flaky and impacting us in a bunch of places, do you have some idea on when you can fix it? |
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
rvrangel
left a comment
There was a problem hiding this comment.
re-reading the exec docs, Cmd.Wait() closes all pipes as soon as the program exits, so there is no way to control and avoid eventual between bufio's Scan() running and the pipe being closed, making it not very suitable for this use case.
So I changed approaches to use Cmd.Stdout and Cmd.Stderr and control the closing manually, as well as giving the 3 goroutines additional time to finish reading before we close the pipes, this seems to have eliminated the issue.
| // after that we can process if an error has happened or not. | ||
| err = cmd.Run() | ||
|
|
||
| time.Sleep(time.Millisecond * 100) // give the goroutines some time to read any remaining logs before closing pipes. |
There was a problem hiding this comment.
I can consistently run this from the main branch and hit that edge case:
$ go test -run=TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock -race -count=1000
--- FAIL: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock (0.01s)
--- FAIL: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock/lock_released_if_we_see_the_mysqlsh_lock_being_acquired (0.00s)
mysqlshellbackupengine_test.go:379:
Error Trace: /home/rrangel/vitess_rvrangel/go/vt/mysqlctl/mysqlshellbackupengine_test.go:379
Error: "I1126 15:18:30.734269 mysqlshellbackupengine.go:117] Starting ExecuteBackup in test\nI1126 15:18:30.734448 mysqlshellbackupengine.go:150] acquiring a global read lock before fetching the executed GTID sets\nI1126 15:18:30.734470 mysqlshellbackupengine.go:169] running /tmp/TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock892506315/001/test.sh --defaults-file=/dev/null --js -h localhost -e util.dumpInstance(\"logical/tmp/TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLocklock_released_if_we_see_the_mysqlsh_lock_being_acquired531729948/001\", {\"threads\": 4})\nW1126 15:18:30.737258 backup.go:524] error scanning lines from mysqlshell stdout: read |0: file already closed\nW1126 15:18:30.737293 backup.go:524] error scanning lines from mysqlshell stderr: read |0: file already closed\nE1126 15:18:30.737314 mysqlshellbackupengine.go:534] could not release global lock earlier\nI1126 15:18:30.737335 mysqlshellbackupengine.go:203] Writing backup MANIFEST\nI1126 15:18:30.737503 mysqlshellbackupengine.go:243] Backup completed\nI1126 15:18:30.737512 xtrabackupengine.go:164] Closing backup file MANIFEST\n" does not contain "global read lock released after"
Test: TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock/lock_released_if_we_see_the_mysqlsh_lock_being_acquired
Messages: failed to release the global lock after mysqlsh
FAIL
cleaning up "/tmp/50bd7cc81b5edaf5bdb887a7c4ab3ba5.dat"
exit status 1
FAIL vitess.io/vitess/go/vt/mysqlctl 12.238s
with this PR's branch, I reduced the sleep time to 10ms to allow me to runs it thousands of times and haven't seen the issue once the Sleep has been added:
$ go test -run=TestMySQLShellBackupEngine_ExecuteBackup_ReleaseLock -race -count=10000
PASS
cleaning up "/tmp/50bd7cc81b5edaf5bdb887a7c4ab3ba5.dat"
ok vitess.io/vitess/go/vt/mysqlctl 448.584s
I have tested this with 10ms on thousands of tests and didn't see it fail, 100ms should be safer to avoid the flaky test.
There was a problem hiding this comment.
I don't think we should use sleep, why wouldn't it work to wait for the group then? Then the goroutines are done and would have read it all?
In general, a sleep is kinda an anti pattern for concurrent work. If it's needed, there's usually a better way to solve the problem.
There was a problem hiding this comment.
the problem is that closing it too fast once the program finishes doesn't give enough time for bufio to Scan the last line, so the test doesn't see the lock being released. and if we close it immediately, then there is this race between Scan() trying to read it while the file is closed and it doesn't report the last line.
In the real world this should never really happen as the line we are looking for is printed in the beginning, but it shows up immediately on the mocked test. I can move the sleep to the inside the mocked program, which would avoid having this in the code
There was a problem hiding this comment.
Going to merge this since it affects CI so significantly. I still think there's something here fundamentally broken and it needs some rework to remove the actual race, but not going to hold this back here.
Signed-off-by: Renan Rangel <rrangel@slack-corp.com>
| // this simulates mysql shell completing without any issues. | ||
| generateTestFile(t, mysqlShellBackupBinaryName, fmt.Sprintf("#!/bin/bash\n>&2 echo %s", mysqlShellLockMessage)) | ||
| generateTestFile(t, be.binaryName, fmt.Sprintf( | ||
| "#!/bin/bash\n>&2 echo %s; echo \"backup completed\"; sleep 0.01", mysqlShellLockMessage)) |
There was a problem hiding this comment.
@rvrangel This feels still like it works around the race? Is there a way to more fundamentally fix the race? There's afaik no guarantee that here that the actual implementation will take some time before it exits after the last line outputted and we'd lose that last line in logging etc?
There was a problem hiding this comment.
not that I know, but let me know if you have other ideas we could try. the main problem seems to happen when the pipe/file is closed during the Scan() is taking place in the other goroutine and I haven't found a way of avoiding it
There was a problem hiding this comment.
@rvrangel Fwiw, I've tried running it without the sleep 0.01 in the test program and it doesn't seem to fail? So is that really needed?
There was a problem hiding this comment.
did you run the test as I did here (so it runs thousands of times with -race -count=1000)? If I remove the sleep I do see the errors pop up every now and then when I run it. If you don't, maybe try it with 10k instead.
Signed-off-by: Renan Rangel <rrangel@slack-corp.com> Signed-off-by: Florent Poinsard <florent.poinsard@outlook.fr>
Description
This is an attempt to fix the flakiness of the test introduced in #17000
Related Issue(s)
Related to the fix for #17011
https://vitess.slack.com/archives/CFBGH8GUV/p1729457569165799?thread_ts=1729264090.367559&cid=CFBGH8GUV
Checklist
Deployment Notes