-
Notifications
You must be signed in to change notification settings - Fork 5.5k
Fix flaky test: TestBrutalShutdown.testQueryRetryOnShutdown #24808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix flaky test: TestBrutalShutdown.testQueryRetryOnShutdown #24808
Conversation
725f419 to
4624aa4
Compare
When debugging this tests' failures, I found that some queries were failing with an error that is not marked as retriable, causing this test to fail. By changing the respective error code to one that is retriable this test no longer seems to fail. This may have occurred due to changes in the way the task event loops are scheduled, but I am not 100% sure. Either way, I think this error code is capable of query retries, so the solution seemed suitable.
4624aa4 to
29362f5
Compare
|
Nit suggestion for the release note entry to follow the Order of changes in the Release Notes Guidelines: I think my "testing" is a little vague. If you have a better phrasing, please do! |
|
Don't think we should add REMOTE_BUFFER_CLOSE_FAILED as an retryble error just for unit tests. We can add it if its indeed needed in prod too. |
|
That being said i don't see many errors |
|
Another solution is to try and re-write the test assertion to assert that all queries which initially failed with retriable errors all eventually successfully pass - but it's possible you can get 0 queries with retriable errors, so it would defeat the purpose of the test in some runs. The query failure type is non-deterministic when the worker(s) shut down, which seems to be the main cause of flakiness. I am open to other ideas |
rschlussel
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think if this is an error code that you see after a forced shutdown, it makes sense to mark it as retriable.
Description
Set error code
REMOTE_BUFFER_CLOSE_FAILEDas a retriable error.(Hopefully) Fixes #22125
Motivation and Context
I set this test's invocationCount to 10 or 15 in order to run this on repeat and trigger the failure. I found that for all failures, it was because the query failed with
REMOTE_BUFFER_CLOSE_FAILED. This caused the test to fail because it is not marked as a retriable error. By changing the respective error code to one that is retriable this test no longer seems to fail. I was not able to reproduce the failures on my local machineThis may have occurred due to changes in the way the task event loops are scheduled, but I am not 100% sure. Either way, I think this error code is capable of query retries, so the solution seemed suitable.
Impact
StandardErrorCode#REMOTE_BUFFER_CLOSE_FAILEDis now a retriable query error.Test Plan
Existing tests
Contributor checklist
Release Notes