Revert "[train] Add worker_process_setup_hook to set mp start method to spawn"#1344
Revert "[train] Add worker_process_setup_hook to set mp start method to spawn"#1344
worker_process_setup_hook to set mp start method to spawn"#1344Conversation
There was a problem hiding this comment.
Code Review
This pull request reverts a previous change that introduced a worker_process_setup_hook to set the multiprocessing start method to spawn. This was causing issues with vllm and the Ray backend. The revert correctly removes the hook and applies a workaround by setting multiprocessing_context="spawn" for StatefulDataLoader and globally setting the start method to spawn in the main entrypoint. The changes are consistent with the goal of fixing the bug described. I have one suggestion to improve code consistency.
| num_workers=0 if cfg.generator.inference_engine.enable_http_endpoint else 8, | ||
| drop_last=True if is_train else False, | ||
| generator=seeded_generator, | ||
| multiprocessing_context="spawn", |
There was a problem hiding this comment.
For consistency with skyrl/train/utils/trainer_utils.py, it would be beneficial to add a comment explaining why multiprocessing_context is set to "spawn". This improves maintainability by ensuring that the reasoning for this important setting is clear wherever it is used.
# NOTE (sumanthrh): We use ray and thus use `spawn` start method.
# forking within ray leads to undefined behaviour and often causes hard to debug
# memory leaks. See: https://docs.ray.io/en/latest/ray-core/patterns/fork-new-processes.html
multiprocessing_context="spawn",Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
#1346) # What does this PR do? Fixes CI failure on main for the `SkyRL-Train-CPU` workflow: https://github.com/NovaSky-AI/SkyRL/actions/runs/23273262330/job/67670625938 After #1344 , we added `multiprocessing_context='spawn'` to the `build_dataloader` function. It looks like there was one case where the change here affected a test that was not affected by the usage of `worker_process_startup_hook` previously. A CPU test `test_dataloader_seeding` referenced a local dataset class in dataloader map function. After switch to `spawn`, we need to ensure that the dataset class is a global variable. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1346" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
#1346) # What does this PR do? Fixes CI failure on main for the `SkyRL-Train-CPU` workflow: https://github.com/NovaSky-AI/SkyRL/actions/runs/23273262330/job/67670625938 After #1344 , we added `multiprocessing_context='spawn'` to the `build_dataloader` function. It looks like there was one case where the change here affected a test that was not affected by the usage of `worker_process_startup_hook` previously. A CPU test `test_dataloader_seeding` referenced a local dataset class in dataloader map function. After switch to `spawn`, we need to ensure that the dataset class is a global variable. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1346" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
…rt method to `spawn`" (#1344) Reverts #1333 Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend. The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added `spawn` for the mp context for our dataloader I did a quick smoke test by running the gsm8k example and the script enters the first step successfully <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1344" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com>
#1346) # What does this PR do? Fixes CI failure on main for the `SkyRL-Train-CPU` workflow: https://github.com/NovaSky-AI/SkyRL/actions/runs/23273262330/job/67670625938 After #1344 , we added `multiprocessing_context='spawn'` to the `build_dataloader` function. It looks like there was one case where the change here affected a test that was not affected by the usage of `worker_process_startup_hook` previously. A CPU test `test_dataloader_seeding` referenced a local dataset class in dataloader map function. After switch to `spawn`, we need to ensure that the dataset class is a global variable. <!-- devin-review-badge-begin --> --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1346" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a> <!-- devin-review-badge-end --> Signed-off-by: SumanthRH <sumanthrh99@gmail.com>
Reverts #1333
Fixes #1342 and #1343 . It looks like we hit the same issue as ray-project/ray#61350 when dealing with worker process setup hook and vllm with the ray backend.
The long term fix is actually in the ray repo - the bug has been fixed in ray-project/ray#61473 and we should be able to make use of the setup hook after upgrading to the next ray release. Until then, I've just reverted the changes and added
spawnfor the mp context for our dataloaderI did a quick smoke test by running the gsm8k example and the script enters the first step successfully