[vllm] feat: retires vllm spmd mode in the codebase by PeterSH6 · Pull Request #4411 · verl-project/verl

PeterSH6 · 2025-12-04T08:05:31Z

What does this PR do?

Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed.

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: N/A (internal task to delete SPMD support).
Format the PR title as [vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout

Test

Not run (SPMD suites deleted; async flow already covered by existing CI).

API and Usage Example

All configs/scripts must now use actor_rollout_ref.rollout.mode=async. Example:

python -m verl.trainer.main_ppo \
  ... \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.mode=async \
  ...

Design & Code Changes

Deleted verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py and the entire SGLang SPMD engine, leaving only async implementations. Updated BaseRollout registry, RolloutConfig, and main_ppo to error on mode=sync.
Removed SPMD-specific docs, tests (tests/workers/rollout/test_sglang_*, test_vllm_spmd, test_vllm_model_rope_scaling), and CI steps (.github/workflows/vllm.yml, sgl.yml). Simplified lint exclusions and helper scripts accordingly.
Cleaned recipes/examples to default rollout_mode=async and eliminated conditional sync branches (examples/**, recipe/**, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests.

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add / Update the documentation.
Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: Removed obsolete SPMD jobs; async coverage already exists.
Once your PR is ready for CI, notify the ci-request channel (or Feishu group).

gemini-code-assist

Code Review

This pull request is a significant and well-executed refactoring to retire the legacy SPMD rollout path for vLLM and standardize on the async-only rollout. The changes are comprehensive, removing obsolete code, tests, and configurations, while updating examples and enforcing the new standard by raising errors on usage of the deprecated sync mode. The codebase is now cleaner and more focused. I've identified one edge case in a helper function that was moved during the refactoring, which could lead to a crash.

tests/workers/rollout/rollout_vllm/run_fsdp_vllm.py

### What does this PR do? Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: _N/A (internal task to delete SPMD support)._ - [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout` ### Test Not run (SPMD suites deleted; async flow already covered by existing CI). ### API and Usage Example All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`. Example: ```bash python -m verl.trainer.main_ppo \ ... \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ ... ``` ### Design & Code Changes - Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and the entire SGLang SPMD engine, leaving only async implementations. Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error on `mode=sync`. - Removed SPMD-specific docs, tests (`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`, `test_vllm_model_rope_scaling`), and CI steps (`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions and helper scripts accordingly. - Cleaned recipes/examples to default `rollout_mode=async` and eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests. - Updated documentation (FS- DP/Megatron worker guides, hybrid flow, r1_ascend notes, FP8 guide) to describe async-only rollout and mention removal of the old SPMD pathway. ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting). - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: _Removed obsolete SPMD jobs; async coverage already exists._ - [ ] Once your PR is ready for CI, notify the `ci-request` channel (or Feishu group).

…erl-project#4682) The vLLMAsyncRollout.generate_sequences() method now provides a clear error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use async server interface (vLLMReplica, AsyncLLMServerManager) - HFRollout can be used for synchronous generation Also updated generation.yaml config to use async mode and document the current limitation with main_generation.py workflow. Fixes verl-project#4682 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…) (#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue #4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR #4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR #4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue #4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue #1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes #4682 Contributes to #1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

…strings (verl-project#1345) (verl-project#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue verl-project#4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue verl-project#4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue verl-project#1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes verl-project#4682 Contributes to verl-project#1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

### What does this PR do? Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: _N/A (internal task to delete SPMD support)._ - [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout` ### Test Not run (SPMD suites deleted; async flow already covered by existing CI). ### API and Usage Example All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`. Example: ```bash python -m verl.trainer.main_ppo \ ... \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ ... ``` ### Design & Code Changes - Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and the entire SGLang SPMD engine, leaving only async implementations. Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error on `mode=sync`. - Removed SPMD-specific docs, tests (`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`, `test_vllm_model_rope_scaling`), and CI steps (`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions and helper scripts accordingly. - Cleaned recipes/examples to default `rollout_mode=async` and eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests. - Updated documentation (FS- DP/Megatron worker guides, hybrid flow, r1_ascend notes, FP8 guide) to describe async-only rollout and mention removal of the old SPMD pathway. ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting). - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: _Removed obsolete SPMD jobs; async coverage already exists._ - [ ] Once your PR is ready for CI, notify the `ci-request` channel (or Feishu group).

…strings (verl-project#1345) (verl-project#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue verl-project#4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue verl-project#4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue verl-project#1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes verl-project#4682 Contributes to verl-project#1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

### What does this PR do? Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed. ### Checklist Before Starting - [ ] Search for similar PRs. Paste at least one query link here: _N/A (internal task to delete SPMD support)._ - [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout` ### Test Not run (SPMD suites deleted; async flow already covered by existing CI). ### API and Usage Example All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`. Example: ```bash python -m verl.trainer.main_ppo \ ... \ actor_rollout_ref.rollout.name=vllm \ actor_rollout_ref.rollout.mode=async \ ... ``` ### Design & Code Changes - Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and the entire SGLang SPMD engine, leaving only async implementations. Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error on `mode=sync`. - Removed SPMD-specific docs, tests (`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`, `test_vllm_model_rope_scaling`), and CI steps (`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions and helper scripts accordingly. - Cleaned recipes/examples to default `rollout_mode=async` and eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests. - Updated documentation (FS- DP/Megatron worker guides, hybrid flow, r1_ascend notes, FP8 guide) to describe async-only rollout and mention removal of the old SPMD pathway. ### Checklist Before Submitting - [ ] Read the [Contribute Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md). - [ ] Apply [pre-commit checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting). - [x] Add / Update [the documentation](https://github.com/volcengine/verl/tree/main/docs). - [x] Add unit or end-to-end test(s) to [the CI workflow](https://github.com/volcengine/verl/tree/main/.github/workflows) to cover all the code. If not feasible, explain why: _Removed obsolete SPMD jobs; async coverage already exists._ - [ ] Once your PR is ready for CI, notify the `ci-request` channel (or Feishu group).

…strings (verl-project#1345) (verl-project#4729) ## Summary This PR contains two contributions: ### 1. Fix for Issue verl-project#4682 - Informative error message for `generate_sequences` - **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare `NotImplementedError`, leaving users confused when running generation scripts - **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but the generation workflow (`main_generation.py`) still expects a synchronous `generate_sequences()` method - **Fix:** Added an informative error message explaining: - Sync mode was retired in PR verl-project#4411 - Users should use the async server interface (`vLLMReplica`, `AsyncLLMServerManager`) - Alternative: use `HFRollout` for synchronous generation - Links to issue verl-project#4682 for details - Also updated `generation.yaml` config comments to document the limitation ### 2. Documentation improvement for Issue verl-project#1345 - Google-style docstrings in `device.py` Standardized all function docstrings in `verl/utils/device.py` to follow Google-style documentation format: - `is_torch_npu_available()`: Added detailed description and return type - `get_visible_devices_keyword()`: Clarified purpose and return values - `get_device_name()`: Improved description of supported devices - `get_torch_device()`: Documented fallback behavior - `get_device_id()`: Concise description with example - `get_nccl_backend()`: Explained HCCL vs NCCL selection - `set_expandable_segments()`: Added OOM context and Note section - `auto_set_ascend_device_name()`: Documented NPU auto-configuration - `get_device_capability()`: Added proper type hints and description ## Test plan - [x] Python syntax verification passed for all modified files - [ ] CI tests should pass (no functional changes, only error messages and docstrings) Fixes verl-project#4682 Contributes to verl-project#1345 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: yurekami <yurekami@users.noreply.github.com> Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>

[vllm] feat: delete vllm spmd in the codebase

ca2e4bd

PeterSH6 requested review from FightingZhen, chenhaiq, eric-haibin-lin, ji-huazhong, tongyx361, vermouth1992 and wuxibin89 as code owners December 4, 2025 08:05

PeterSH6 added vllm related rollout labels Dec 4, 2025

gemini-code-assist bot reviewed Dec 4, 2025

View reviewed changes

tests/workers/rollout/rollout_vllm/run_fsdp_vllm.py Show resolved Hide resolved

PeterSH6 changed the title ~~[vllm] feat: retires vllm spmd in the codebase~~ [vllm] feat: retires vllm spmd mode in the codebase Dec 4, 2025

PeterSH6 added 2 commits December 4, 2025 16:49

fix tests and scripts

61ba104

fix sft test

24a09cd

vermouth1992 approved these changes Dec 4, 2025

View reviewed changes

vermouth1992 merged commit fd893c7 into verl-project:main Dec 4, 2025
94 of 100 checks passed

PeterSH6 mentioned this pull request Dec 4, 2025

[sglang] feat: retires sglang spmd mode in the codebase #4422

Merged

7 tasks

moehanabi mentioned this pull request Dec 11, 2025

on-policy distillation with error "no running event loop" and "'NoneType' object has no attribute 'llm_engine'" #4464

Open

4 tasks

yurekami mentioned this pull request Dec 29, 2025

[rollout,docs] fix: improve error message (#4682) and docstrings (#1345) #4729

Merged

bisz9918-maker mentioned this pull request Jan 21, 2026

[Bug] Megatron backend async rollout fails with "RuntimeError: no running event loop" #5008

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[vllm] feat: retires vllm spmd mode in the codebase#4411

[vllm] feat: retires vllm spmd mode in the codebase#4411
vermouth1992 merged 3 commits intoverl-project:mainfrom
PeterSH6:gm/spmd

PeterSH6 commented Dec 4, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

PeterSH6 commented Dec 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

Test

API and Usage Example

Design & Code Changes

Checklist Before Submitting

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

PeterSH6 commented Dec 4, 2025 •

edited

Loading