Skip to content

[train] Add worker_process_setup_hook to set mp start method to spawn#1333

Merged
SumanthRH merged 3 commits intomainfrom
spawn
Mar 17, 2026
Merged

[train] Add worker_process_setup_hook to set mp start method to spawn#1333
SumanthRH merged 3 commits intomainfrom
spawn

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Mar 17, 2026

What does this PR do?

Adds skyrl.utils.worker_setup.worker_setup_fn which sets the multiprocessing start method to 'spawn', and wires it into ray.init via the worker_process_setup_hookruntime_env key. Includes unit tests verifying the hook is applied in Ray workers.

We've previously seen many examples where ray <> fork interact in weird ways. There's code in the skyrl_entrypoint task (torch dataloader) as well as in other ray worker processes (ex: megatron workers) that rely on multiprocessing. Using worker setup hook provides us with a consistent way to handle this for all Ray worker processes.

Test plan:

  • CPU tests
  • One GPU test that uses ray : test_model_wrapper.py
  • New CPU tests for worker setup function

Open with Devin

SumanthRH and others added 3 commits March 17, 2026 20:13
Adds skyrl.utils.worker_setup.worker_setup_fn which sets the
multiprocessing start method to 'spawn', and wires it into ray.init
via the worker_process_setup_hook runtime_env key. Includes unit tests
verifying the hook is applied in Ray workers.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
The set_start_method call in main_base.py only affected the driver process.
Now that worker_process_setup_hook handles this for all Ray workers, the
driver-side call is redundant. Move the explanatory comment to worker_setup_fn
where it belongs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…tting

The worker_process_setup_hook only runs in Ray workers, not the driver.
Call worker_setup_fn() at module level in main_base (which main_generate
also imports) so the driver process also uses spawn.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@SumanthRH SumanthRH marked this pull request as ready for review March 17, 2026 20:51
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a worker_process_setup_hook to consistently set the multiprocessing start method to spawn for all Ray workers, which is a good practice to avoid issues with fork in a Ray environment. The changes are well-implemented, including a new utility function, its integration into ray.init, and corresponding unit tests. My feedback focuses on improving the robustness of the new tests to ensure they don't have side effects on other tests.

Comment on lines +23 to +35
def test_worker_setup_fn_sets_spawn():
"""Test that worker_setup_fn sets the mp start method to spawn."""
# Reset to default first
multiprocessing.set_start_method("fork", force=True)
worker_setup_fn()
assert multiprocessing.get_start_method() == "spawn"


def test_worker_setup_fn_idempotent():
"""Test that calling worker_setup_fn twice doesn't raise."""
multiprocessing.set_start_method("spawn", force=True)
worker_setup_fn() # should not raise
assert multiprocessing.get_start_method() == "spawn"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

These tests modify the global multiprocessing start method but don't restore its original state. This can lead to side effects and flaky tests, as the state change persists for subsequent tests in the same process. To improve test isolation, it's best practice to save the original state before the test and restore it in a finally block. This also makes the tests more portable across different operating systems.

def test_worker_setup_fn_sets_spawn():
    """Test that worker_setup_fn sets the mp start method to spawn."""
    if 'fork' not in multiprocessing.get_all_start_methods():
        pytest.skip("fork start method not supported")

    original_method = multiprocessing.get_start_method(allow_none=True)
    try:
        # Reset to a different state first
        multiprocessing.set_start_method("fork", force=True)
        worker_setup_fn()
        assert multiprocessing.get_start_method() == "spawn"
    finally:
        if original_method:
            multiprocessing.set_start_method(original_method, force=True)


def test_worker_setup_fn_idempotent():
    """Test that calling worker_setup_fn twice doesn't raise."""
    original_method = multiprocessing.get_start_method(allow_none=True)
    try:
        multiprocessing.set_start_method("spawn", force=True)
        worker_setup_fn()  # should not raise
        assert multiprocessing.get_start_method() == "spawn"
    finally:
        if original_method:
            multiprocessing.set_start_method(original_method, force=True)

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@SumanthRH SumanthRH merged commit dffae95 into main Mar 17, 2026
6 checks passed
@SumanthRH SumanthRH deleted the spawn branch March 18, 2026 00:06
SumanthRH added a commit that referenced this pull request Mar 19, 2026
…rt method to `spawn`" (#1344)

Reverts #1333

Fixes #1342 and #1343 . It looks like we hit the same issue as
ray-project/ray#61350 when dealing with worker
process setup hook and vllm with the ray backend.

The long term fix is actually in the ray repo - the bug has been fixed
in ray-project/ray#61473 and we should be able
to make use of the setup hook after upgrading to the next ray release.
Until then, I've just reverted the changes and added `spawn` for the mp
context for our dataloader

I did a quick smoke test by running the gsm8k example and the script
enters the first step successfully
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
devpatelio pushed a commit that referenced this pull request Mar 20, 2026
…awn` (#1333)

# What does this PR do?

Adds `skyrl.utils.worker_setup.worker_setup_fn` which sets the
multiprocessing start method to 'spawn', and wires it into `ray.init`
via the `worker_process_setup_hook`runtime_env key. Includes unit tests
verifying the hook is applied in Ray workers.

We've previously seen many examples where ray <> fork interact in weird
ways. There's code in the `skyrl_entrypoint` task (torch dataloader) as
well as in other ray worker processes (ex: megatron workers) that rely
on multiprocessing. Using worker setup hook provides us with a
consistent way to handle this for all Ray worker processes.

Test plan:

- [x]  CPU tests
- [x] One GPU test that uses ray : `test_model_wrapper.py`
- [x] New CPU tests for worker setup function

<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1333"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com>
devpatelio pushed a commit that referenced this pull request Mar 20, 2026
…rt method to `spawn`" (#1344)

Reverts #1333

Fixes #1342 and #1343 . It looks like we hit the same issue as
ray-project/ray#61350 when dealing with worker
process setup hook and vllm with the ray backend.

The long term fix is actually in the ray repo - the bug has been fixed
in ray-project/ray#61473 and we should be able
to make use of the setup hook after upgrading to the next ray release.
Until then, I've just reverted the changes and added `spawn` for the mp
context for our dataloader

I did a quick smoke test by running the gsm8k example and the script
enters the first step successfully
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants