Skip to content

Conversation

@bveeramani
Copy link
Member

@bveeramani bveeramani commented Nov 26, 2025

Description

test_backpressure_e2e occasionally fails with a bug like this:

[2025-11-26T17:33:36Z] PASSED[2025-11-26 17:27:35,172 E 550 12058] core_worker_process.cc:986: The core worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit

This PR attempt to deflake it by removing an unnecessary del

(Long-term, we should rewrite or remove this test. This PR is a mitigation)

Signed-off-by: Balaji Veeramani <[email protected]>
@bveeramani bveeramani requested a review from a team as a code owner November 26, 2025 20:51
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses a flaky test, test_backpressure_e2e, by removing an explicit del statement. This change is a good mitigation for the observed flakiness. The premature deletion of the dataset and iterator objects likely triggered their __del__ methods, causing the Ray Data executor to shut down while subsequent Ray API calls were still pending. This race condition appears to be the root cause of the core_worker has already been shutdown error. Removing the del statement correctly defers cleanup until the end of the function's scope, resolving the issue.

it = iter(ds.iter_batches(batch_size=None, prefetch_batches=0))
next(it)
time.sleep(3)
del it, ds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This explicit del statement can trigger garbage collection and the __del__ methods of the ds and it objects prematurely. This appears to cause a race condition where the Ray Data executor is shut down before the subsequent ray.get() call, leading to the flaky test failure. Removing this line is the correct approach, as it defers cleanup until the objects naturally go out of scope at the end of the function.

Copy link
Contributor

@iamjustinhsu iamjustinhsu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why does del make it non-flakey? I feel like that's counter intuitive since we are releasing resources

@bveeramani bveeramani enabled auto-merge (squash) November 26, 2025 22:42
@github-actions github-actions bot added the go add ONLY when ready to merge, run all tests label Nov 26, 2025
@bveeramani
Copy link
Member Author

I... think this is happening because you delete it and ds while workers are mid execution.
Can you swap the lines where you get the result for launched and delete it and ds later or is that somehow important to this test?
Also I guess I'm hazy on how lifecycle works for the instrumentation in this test.

@iamjustinhsu -- Zach recommended testing this out.

@bveeramani bveeramani merged commit 30fe289 into master Nov 26, 2025
7 checks passed
@bveeramani bveeramani deleted the deflake-test-backpressure-e2e branch November 26, 2025 23:22
SheldonTsen pushed a commit to SheldonTsen/ray that referenced this pull request Dec 1, 2025
## Description

`test_backpressure_e2e` occasionally fails with a bug like this:
```
[2025-11-26T17:33:36Z] PASSED[2025-11-26 17:27:35,172 E 550 12058] core_worker_process.cc:986: The core worker has already been shutdown. This happens when the language frontend accesses the Ray's worker after it is shutdown. The process will exit
```

This PR attempt to deflake it by removing an unnecessary `del`

(Long-term, we should rewrite or remove this test. This PR is a
mitigation)

Signed-off-by: Balaji Veeramani <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

go add ONLY when ready to merge, run all tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants