Skip to content

Revert "[train] Add worker_process_setup_hook to set mp start method to spawn"#1344

Merged
SumanthRH merged 5 commits intomainfrom
revert-1333-spawn
Mar 19, 2026
Merged

Revert "[train] Add worker_process_setup_hook to set mp start method to spawn"#1344
SumanthRH merged 5 commits intomainfrom
revert-1333-spawn

Conversation

@SumanthRH
Copy link
Copy Markdown
Member

@SumanthRH SumanthRH commented Mar 18, 2026

Reverts #1333

Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend.

The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added spawn for the mp context for our dataloader

I did a quick smoke test by running the gsm8k example and the script enters the first step successfully


Open with Devin

SumanthRH and others added 3 commits March 18, 2026 16:40
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
x
Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
@SumanthRH SumanthRH marked this pull request as ready for review March 18, 2026 23:56
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a previous change that introduced a worker_process_setup_hook to set the multiprocessing start method to spawn. This was causing issues with vllm and the Ray backend. The revert correctly removes the hook and applies a workaround by setting multiprocessing_context="spawn" for StatefulDataLoader and globally setting the start method to spawn in the main entrypoint. The changes are consistent with the goal of fixing the bug described. I have one suggestion to improve code consistency.

num_workers=0 if cfg.generator.inference_engine.enable_http_endpoint else 8,
drop_last=True if is_train else False,
generator=seeded_generator,
multiprocessing_context="spawn",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

For consistency with skyrl/train/utils/trainer_utils.py, it would be beneficial to add a comment explaining why multiprocessing_context is set to "spawn". This improves maintainability by ensuring that the reasoning for this important setting is clear wherever it is used.

        # NOTE (sumanthrh): We use ray and thus use `spawn` start method.
        # forking within ray leads to undefined behaviour and often causes hard to debug
        # memory leaks.  See: https://docs.ray.io/en/latest/ray-core/patterns/fork-new-processes.html
        multiprocessing_context="spawn",

devin-ai-integration[bot]

This comment was marked as resolved.

SumanthRH and others added 2 commits March 18, 2026 17:02
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@SumanthRH SumanthRH merged commit 860b5d6 into main Mar 19, 2026
5 of 6 checks passed
@erictang000 erictang000 deleted the revert-1333-spawn branch March 19, 2026 00:08
SumanthRH added a commit that referenced this pull request Mar 19, 2026
#1346)

# What does this PR do?

Fixes CI failure on main for the `SkyRL-Train-CPU` workflow:
https://github.com/NovaSky-AI/SkyRL/actions/runs/23273262330/job/67670625938

After #1344 , we added `multiprocessing_context='spawn'` to the
`build_dataloader` function. It looks like there was one case where the
change here affected a test that was not affected by the usage of
`worker_process_startup_hook` previously. A CPU test
`test_dataloader_seeding` referenced a local dataset class in dataloader
map function. After switch to `spawn`, we need to ensure that the
dataset class is a global variable.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1346"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
SumanthRH added a commit that referenced this pull request Mar 20, 2026
#1346)

# What does this PR do?

Fixes CI failure on main for the `SkyRL-Train-CPU` workflow:
https://github.com/NovaSky-AI/SkyRL/actions/runs/23273262330/job/67670625938

After #1344 , we added `multiprocessing_context='spawn'` to the
`build_dataloader` function. It looks like there was one case where the
change here affected a test that was not affected by the usage of
`worker_process_startup_hook` previously. A CPU test
`test_dataloader_seeding` referenced a local dataset class in dataloader
map function. After switch to `spawn`, we need to ensure that the
dataset class is a global variable.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1346"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
devpatelio pushed a commit that referenced this pull request Mar 20, 2026
…rt method to `spawn`" (#1344)

Reverts #1333

Fixes #1342 and #1343 . It looks like we hit the same issue as
ray-project/ray#61350 when dealing with worker
process setup hook and vllm with the ray backend.

The long term fix is actually in the ray repo - the bug has been fixed
in ray-project/ray#61473 and we should be able
to make use of the setup hook after upgrading to the next ray release.
Until then, I've just reverted the changes and added `spawn` for the mp
context for our dataloader

I did a quick smoke test by running the gsm8k example and the script
enters the first step successfully
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

---------

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
devpatelio pushed a commit that referenced this pull request Mar 20, 2026
#1346)

# What does this PR do?

Fixes CI failure on main for the `SkyRL-Train-CPU` workflow:
https://github.com/NovaSky-AI/SkyRL/actions/runs/23273262330/job/67670625938

After #1344 , we added `multiprocessing_context='spawn'` to the
`build_dataloader` function. It looks like there was one case where the
change here affected a test that was not affected by the usage of
`worker_process_startup_hook` previously. A CPU test
`test_dataloader_seeding` referenced a local dataset class in dataloader
map function. After switch to `spawn`, we need to ensure that the
dataset class is a global variable.
<!-- devin-review-badge-begin -->

---

<a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1346"
target="_blank">
  <picture>
<source media="(prefers-color-scheme: dark)"
srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1">
<img
src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1"
alt="Open with Devin">
  </picture>
</a>
<!-- devin-review-badge-end -->

Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[train] vllm init with ray backend fails due to worker_process_setup_hook being set [BUG] __RAY_WORKER_PROCESS_SETUP_HOOK_ENV_VAR is not permitted

1 participant