-
Notifications
You must be signed in to change notification settings - Fork 7k
[Data] Attempt to deflake test_backpressure_e2e
#59014
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Balaji Veeramani <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request addresses a flaky test, test_backpressure_e2e, by removing an explicit del statement. This change is a good mitigation for the observed flakiness. The premature deletion of the dataset and iterator objects likely triggered their __del__ methods, causing the Ray Data executor to shut down while subsequent Ray API calls were still pending. This race condition appears to be the root cause of the core_worker has already been shutdown error. Removing the del statement correctly defers cleanup until the end of the function's scope, resolving the issue.
| it = iter(ds.iter_batches(batch_size=None, prefetch_batches=0)) | ||
| next(it) | ||
| time.sleep(3) | ||
| del it, ds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This explicit del statement can trigger garbage collection and the __del__ methods of the ds and it objects prematurely. This appears to cause a race condition where the Ray Data executor is shut down before the subsequent ray.get() call, leading to the flaky test failure. Removing this line is the correct approach, as it defers cleanup until the objects naturally go out of scope at the end of the function.
iamjustinhsu
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why does del make it non-flakey? I feel like that's counter intuitive since we are releasing resources
@iamjustinhsu -- Zach recommended testing this out. |
## Description `test_backpressure_e2e` occasionally fails with a bug like this: ``` [2025-11-26T17:33:36Z] PASSED[2025-11-26 17:27:35,172 E 550 12058] core_worker_process.cc:986: The core worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit ``` This PR attempt to deflake it by removing an unnecessary `del` (Long-term, we should rewrite or remove this test. This PR is a mitigation) Signed-off-by: Balaji Veeramani <[email protected]>
Description
test_backpressure_e2eoccasionally fails with a bug like this:This PR attempt to deflake it by removing an unnecessary
del(Long-term, we should rewrite or remove this test. This PR is a mitigation)