Conversation
Adds skyrl.utils.worker_setup.worker_setup_fn which sets the multiprocessing start method to 'spawn', and wires it into ray.init via the worker_process_setup_hook runtime_env key. Includes unit tests verifying the hook is applied in Ray workers. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The set_start_method call in main_base.py only affected the driver process. Now that worker_process_setup_hook handles this for all Ray workers, the driver-side call is redundant. Move the explanatory comment to worker_setup_fn where it belongs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tting The worker_process_setup_hook only runs in Ray workers, not the driver. Call worker_setup_fn() at module level in main_base (which main_generate also imports) so the driver process also uses spawn. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Code Review
This pull request introduces a worker_process_setup_hook to consistently set the multiprocessing start method to spawn for all Ray workers, which is a good practice to avoid issues with fork in a Ray environment. The changes are well-implemented, including a new utility function, its integration into ray.init, and corresponding unit tests. My feedback focuses on improving the robustness of the new tests to ensure they don't have side effects on other tests.
| def test_worker_setup_fn_sets_spawn(): | ||
| """Test that worker_setup_fn sets the mp start method to spawn.""" | ||
| # Reset to default first | ||
| multiprocessing.set_start_method("fork", force=True) | ||
| worker_setup_fn() | ||
| assert multiprocessing.get_start_method() == "spawn" | ||
|
|
||
|
|
||
| def test_worker_setup_fn_idempotent(): | ||
| """Test that calling worker_setup_fn twice doesn't raise.""" | ||
| multiprocessing.set_start_method("spawn", force=True) | ||
| worker_setup_fn() # should not raise | ||
| assert multiprocessing.get_start_method() == "spawn" |
There was a problem hiding this comment.
These tests modify the global multiprocessing start method but don't restore its original state. This can lead to side effects and flaky tests, as the state change persists for subsequent tests in the same process. To improve test isolation, it's best practice to save the original state before the test and restore it in a finally block. This also makes the tests more portable across different operating systems.
def test_worker_setup_fn_sets_spawn():
"""Test that worker_setup_fn sets the mp start method to spawn."""
if 'fork' not in multiprocessing.get_all_start_methods():
pytest.skip("fork start method not supported")
original_method = multiprocessing.get_start_method(allow_none=True)
try:
# Reset to a different state first
multiprocessing.set_start_method("fork", force=True)
worker_setup_fn()
assert multiprocessing.get_start_method() == "spawn"
finally:
if original_method:
multiprocessing.set_start_method(original_method, force=True)
def test_worker_setup_fn_idempotent():
"""Test that calling worker_setup_fn twice doesn't raise."""
original_method = multiprocessing.get_start_method(allow_none=True)
try:
multiprocessing.set_start_method("spawn", force=True)
worker_setup_fn() # should not raise
assert multiprocessing.get_start_method() == "spawn"
finally:
if original_method:
multiprocessing.set_start_method(original_method, force=True)…rt method to `spawn`" (#1344) Reverts #1333 Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend. The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added `spawn` for the mp context for our dataloader I did a quick smoke test by running the gsm8k example and the script enters the first step successfully <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
…awn` (#1333) # What does this PR do? Adds `skyrl.utils.worker_setup.worker_setup_fn` which sets the multiprocessing start method to 'spawn', and wires it into `ray.init` via the `worker_process_setup_hook`runtime_env key. Includes unit tests verifying the hook is applied in Ray workers. We've previously seen many examples where ray <> fork interact in weird ways. There's code in the `skyrl_entrypoint` task (torch dataloader) as well as in other ray worker processes (ex: megatron workers) that rely on multiprocessing. Using worker setup hook provides us with a consistent way to handle this for all Ray worker processes. Test plan: - [x] CPU tests - [x] One GPU test that uses ray : `test_model_wrapper.py` - [x] New CPU tests for worker setup function <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1333" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
…rt method to `spawn`" (#1344) Reverts #1333 Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend. The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added `spawn` for the mp context for our dataloader I did a quick smoke test by running the gsm8k example and the script enters the first step successfully <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
What does this PR do?
Adds
skyrl.utils.worker_setup.worker_setup_fnwhich sets the multiprocessing start method to 'spawn', and wires it intoray.initvia theworker_process_setup_hookruntime_env key. Includes unit tests verifying the hook is applied in Ray workers.We've previously seen many examples where ray <> fork interact in weird ways. There's code in the
skyrl_entrypointtask (torch dataloader) as well as in other ray worker processes (ex: megatron workers) that rely on multiprocessing. Using worker setup hook provides us with a consistent way to handle this for all Ray worker processes.Test plan:
test_model_wrapper.py