[vllm] feat: retires vllm spmd mode in the codebase#4411
Merged
vermouth1992 merged 3 commits intoverl-project:mainfrom Dec 4, 2025
Merged
[vllm] feat: retires vllm spmd mode in the codebase#4411vermouth1992 merged 3 commits intoverl-project:mainfrom
vermouth1992 merged 3 commits intoverl-project:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Code Review
This pull request is a significant and well-executed refactoring to retire the legacy SPMD rollout path for vLLM and standardize on the async-only rollout. The changes are comprehensive, removing obsolete code, tests, and configurations, while updating examples and enforcing the new standard by raising errors on usage of the deprecated sync mode. The codebase is now cleaner and more focused. I've identified one edge case in a helper function that was moved during the refactoring, which could lead to a crash.
vermouth1992
approved these changes
Dec 4, 2025
7 tasks
4 tasks
TimurTaepov
pushed a commit
to giorgossideris/verl
that referenced
this pull request
Dec 20, 2025
### What does this PR do? Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: _N/A (internal task to delete SPMD support)._ - [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout` ### Test Not run (SPMD suites deleted; async flow already covered by existing CI). ### API and Usage Example All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`. Example: ```bash python -m verl.trainer.main_ppo \ ... \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ ... ``` ### Design & Code Changes - Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and the entire SGLang SPMD engine, leaving only async implementations. Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error on `mode=sync`. - Removed SPMD-specific docs, tests (`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`, `test_vllm_model_rope_scaling`), and CI steps (`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions and helper scripts accordingly. - Cleaned recipes/examples to default `rollout_mode=async` and eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests. - Updated documentation (FS- DP/Megatron worker guides, hybrid flow, r1_ascend notes, FP8 guide) to describe async-only rollout and mention removal of the old SPMD pathway. ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting). - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: _Removed obsolete SPMD jobs; async coverage already exists._ - [ ] Once your PR is ready for CI, notify the `ci-request` channel (or Feishu group).
yurekami
added a commit
to yurekami/verl
that referenced
this pull request
Dec 29, 2025
…erl-project#4682) The vLLMAsyncRollout.generate_sequences() method now provides a clear error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use async server interface (vLLMReplica, AsyncLLMServerManager) - HFRollout can be used for synchronous generation Also updated generation.yaml config to use async mode and document the current limitation with main_generation.py workflow. Fixes verl-project#4682 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
vermouth1992
pushed a commit
that referenced
this pull request
Dec 30, 2025
…) (#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue #4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR #4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR #4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue #4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue #1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes #4682 Contributes to #1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
boren-ms
pushed a commit
to boren-ms/verl
that referenced
this pull request
Dec 30, 2025
…strings (verl-project#1345) (verl-project#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue verl-project#4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue verl-project#4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue verl-project#1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes verl-project#4682 Contributes to verl-project#1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jsfanfanfan
pushed a commit
to meituan-search/verl
that referenced
this pull request
Jan 9, 2026
…strings (verl-project#1345) (verl-project#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue verl-project#4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue verl-project#4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue verl-project#1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes verl-project#4682 Contributes to verl-project#1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
4 tasks
vyomakesh0728
added a commit
to vyomakesh0728/verl
that referenced
this pull request
Jan 22, 2026
### What does this PR do? Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: _N/A (internal task to delete SPMD support)._ - [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout` ### Test Not run (SPMD suites deleted; async flow already covered by existing CI). ### API and Usage Example All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`. Example: ```bash python -m verl.trainer.main_ppo \ ... \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ ... ``` ### Design & Code Changes - Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and the entire SGLang SPMD engine, leaving only async implementations. Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error on `mode=sync`. - Removed SPMD-specific docs, tests (`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`, `test_vllm_model_rope_scaling`), and CI steps (`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions and helper scripts accordingly. - Cleaned recipes/examples to default `rollout_mode=async` and eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests. - Updated documentation (FS- DP/Megatron worker guides, hybrid flow, r1_ascend notes, FP8 guide) to describe async-only rollout and mention removal of the old SPMD pathway. ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting). - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: _Removed obsolete SPMD jobs; async coverage already exists._ - [ ] Once your PR is ready for CI, notify the `ci-request` channel (or Feishu group).
vyomakesh0728
added a commit
to vyomakesh0728/verl
that referenced
this pull request
Jan 22, 2026
…strings (verl-project#1345) (verl-project#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue verl-project#4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue verl-project#4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue verl-project#1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes verl-project#4682 Contributes to verl-project#1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sophiayyya
pushed a commit
to sophiayyya/verl
that referenced
this pull request
Jan 25, 2026
### What does this PR do? Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: _N/A (internal task to delete SPMD support)._ - [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout` ### Test Not run (SPMD suites deleted; async flow already covered by existing CI). ### API and Usage Example All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`. Example: ```bash python -m verl.trainer.main_ppo \ ... \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ ... ``` ### Design & Code Changes - Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and the entire SGLang SPMD engine, leaving only async implementations. Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error on `mode=sync`. - Removed SPMD-specific docs, tests (`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`, `test_vllm_model_rope_scaling`), and CI steps (`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions and helper scripts accordingly. - Cleaned recipes/examples to default `rollout_mode=async` and eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests. - Updated documentation (FS- DP/Megatron worker guides, hybrid flow, r1_ascend notes, FP8 guide) to describe async-only rollout and mention removal of the old SPMD pathway. ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting). - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: _Removed obsolete SPMD jobs; async coverage already exists._ - [ ] Once your PR is ready for CI, notify the `ci-request` channel (or Feishu group).
sophiayyya
pushed a commit
to sophiayyya/verl
that referenced
this pull request
Jan 25, 2026
…strings (verl-project#1345) (verl-project#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue verl-project#4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue verl-project#4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue verl-project#1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes verl-project#4682 Contributes to verl-project#1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed.
Checklist Before Starting
[vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rolloutTest
Not run (SPMD suites deleted; async flow already covered by existing CI).
API and Usage Example
All configs/scripts must now use
actor_rollout_ref.rollout.mode=async. Example:Design & Code Changes
verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.pyand the entire SGLang SPMD engine, leaving only async implementations. UpdatedBaseRolloutregistry,RolloutConfig, andmain_ppoto error onmode=sync.tests/workers/rollout/test_sglang_*,test_vllm_spmd,test_vllm_model_rope_scaling), and CI steps (.github/workflows/vllm.yml,sgl.yml). Simplified lint exclusions and helper scripts accordingly.rollout_mode=asyncand eliminated conditional sync branches (examples/**,recipe/**, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests.Checklist Before Submitting
ci-requestchannel (or Feishu group).