Skip to content

Conversation

@ZacBlanco
Copy link
Contributor

@ZacBlanco ZacBlanco commented Mar 26, 2025

Description

Set error code REMOTE_BUFFER_CLOSE_FAILED as a retriable error.

(Hopefully) Fixes #22125

Motivation and Context

I set this test's invocationCount to 10 or 15 in order to run this on repeat and trigger the failure. I found that for all failures, it was because the query failed with REMOTE_BUFFER_CLOSE_FAILED. This caused the test to fail because it is not marked as a retriable error. By changing the respective error code to one that is retriable this test no longer seems to fail. I was not able to reproduce the failures on my local machine

This may have occurred due to changes in the way the task event loops are scheduled, but I am not 100% sure. Either way, I think this error code is capable of query retries, so the solution seemed suitable.

Impact

  • StandardErrorCode#REMOTE_BUFFER_CLOSE_FAILED is now a retriable query error.

Test Plan

Existing tests

Contributor checklist

  • Please make sure your submission complies with our contributing guide, in particular code style and commit standards.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

SPI Changes
* REMOTE_BUFFER_CLOSE_FAILED is now a retriable error

@prestodb-ci prestodb-ci added the from:IBM PR from IBM label Mar 26, 2025
@ZacBlanco ZacBlanco force-pushed the upstream-flaky-test-brutal-shutdown branch 2 times, most recently from 725f419 to 4624aa4 Compare March 26, 2025 22:48
When debugging this tests' failures, I found that some queries were
failing with an error that is not marked as retriable, causing this test
to fail. By changing the respective error code to one that is retriable
this test no longer seems to fail.

This may have occurred due to changes in the way the task event
loops are scheduled, but I am not 100% sure. Either way, I think this
error code is capable of query retries, so the solution seemed suitable.
@ZacBlanco ZacBlanco force-pushed the upstream-flaky-test-brutal-shutdown branch from 4624aa4 to 29362f5 Compare March 26, 2025 22:50
@ZacBlanco ZacBlanco marked this pull request as ready for review March 27, 2025 01:00
@prestodb-ci prestodb-ci requested a review from a team March 27, 2025 01:01
@ZacBlanco ZacBlanco requested a review from presto-oss March 27, 2025 01:01
@prestodb-ci prestodb-ci requested review from bibith4 and pratyakshsharma and removed request for a team March 27, 2025 01:01
@steveburnett
Copy link
Contributor

Nit suggestion for the release note entry to follow the Order of changes in the Release Notes Guidelines:

== RELEASE NOTES ==

SPI Changes
* Improve testing by making REMOTE_BUFFER_CLOSE_FAILED a retriable error

I think my "testing" is a little vague. If you have a better phrasing, please do!

@jaystarshot
Copy link
Member

Don't think we should add REMOTE_BUFFER_CLOSE_FAILED as an retryble error just for unit tests. We can add it if its indeed needed in prod too.
cc: @rschlussel

@jaystarshot
Copy link
Member

That being said i don't see many errors REMOTE_BUFFER_CLOSE_FAILED internally so its probably fine

@ZacBlanco
Copy link
Contributor Author

ZacBlanco commented Mar 27, 2025

Another solution is to try and re-write the test assertion to assert that all queries which initially failed with retriable errors all eventually successfully pass - but it's possible you can get 0 queries with retriable errors, so it would defeat the purpose of the test in some runs. The query failure type is non-deterministic when the worker(s) shut down, which seems to be the main cause of flakiness.

I am open to other ideas

Copy link
Contributor

@rschlussel rschlussel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if this is an error code that you see after a forced shutdown, it makes sense to mark it as retriable.

@ZacBlanco ZacBlanco merged commit 6c24d44 into prestodb:master Mar 27, 2025
93 checks passed
@prestodb-ci prestodb-ci mentioned this pull request Mar 28, 2025
30 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

from:IBM PR from IBM

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky TestBrutalShutdown.testQueryRetryOnShutdown

5 participants