Skip to content

[vllm] feat: retires vllm spmd mode in the codebase#4411

Merged
vermouth1992 merged 3 commits intoverl-project:mainfrom
PeterSH6:gm/spmd
Dec 4, 2025
Merged

[vllm] feat: retires vllm spmd mode in the codebase#4411
vermouth1992 merged 3 commits intoverl-project:mainfrom
PeterSH6:gm/spmd

Conversation

@PeterSH6
Copy link
Collaborator

@PeterSH6 PeterSH6 commented Dec 4, 2025

What does this PR do?

Retires the legacy SPMD rollout path and standardizes the codebase on async-only rollout for vLLM (SGLang in the next PR). All Python modules, docs, workflows, and examples now reference the async server mode exclusively; the sync/SPMD runners, helpers, and CI jobs have been removed.

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: N/A (internal task to delete SPMD support).
  • Format the PR title as [vllm, sglang, rollout, trainer, recipe, ci, doc] refactor: remove SPMD rollout

Test

Not run (SPMD suites deleted; async flow already covered by existing CI).

API and Usage Example

All configs/scripts must now use actor_rollout_ref.rollout.mode=async. Example:

python -m verl.trainer.main_ppo \
  ... \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.mode=async \
  ...

Design & Code Changes

  • Deleted verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py and the entire SGLang SPMD engine, leaving only async implementations. Updated BaseRollout registry, RolloutConfig, and main_ppo to error on mode=sync.
  • Removed SPMD-specific docs, tests (tests/workers/rollout/test_sglang_*, test_vllm_spmd, test_vllm_model_rope_scaling), and CI steps (.github/workflows/vllm.yml, sgl.yml). Simplified lint exclusions and helper scripts accordingly.
  • Cleaned recipes/examples to default rollout_mode=async and eliminated conditional sync branches (examples/**, recipe/**, e2e scripts). Added explicit validation in agent-loop utilities and SFT runner to reject non-async requests.

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add / Update the documentation.
  • Add unit or end-to-end test(s) to the CI workflow to cover all the code. If not feasible, explain why: Removed obsolete SPMD jobs; async coverage already exists.
  • Once your PR is ready for CI, notify the ci-request channel (or Feishu group).

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request is a significant and well-executed refactoring to retire the legacy SPMD rollout path for vLLM and standardize on the async-only rollout. The changes are comprehensive, removing obsolete code, tests, and configurations, while updating examples and enforcing the new standard by raising errors on usage of the deprecated sync mode. The codebase is now cleaner and more focused. I've identified one edge case in a helper function that was moved during the refactoring, which could lead to a crash.

@PeterSH6 PeterSH6 changed the title [vllm] feat: retires vllm spmd in the codebase [vllm] feat: retires vllm spmd mode in the codebase Dec 4, 2025
@vermouth1992 vermouth1992 merged commit fd893c7 into verl-project:main Dec 4, 2025
94 of 100 checks passed
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
### What does this PR do?

Retires the legacy SPMD rollout path and standardizes the codebase on
async-only rollout for vLLM (SGLang in the next PR). All Python modules,
docs, workflows, and examples now reference the async server mode
exclusively; the sync/SPMD runners, helpers, and CI jobs have been
removed.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: _N/A
(internal task to delete SPMD support)._
- [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe,
ci, doc] refactor: remove SPMD rollout`

### Test

Not run (SPMD suites deleted; async flow already covered by existing
CI).

### API and Usage Example

All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`.
Example:

```bash
python -m verl.trainer.main_ppo \
  ... \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.mode=async \
  ...
```

### Design & Code Changes

- Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and
the entire SGLang SPMD engine, leaving only async implementations.
Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error
on `mode=sync`.
- Removed SPMD-specific docs, tests
(`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`,
`test_vllm_model_rope_scaling`), and CI steps
(`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions
and helper scripts accordingly.
- Cleaned recipes/examples to default `rollout_mode=async` and
eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e
scripts). Added explicit validation in agent-loop utilities and SFT
runner to reject non-async requests.
- Updated documentation (FS- DP/Megatron worker guides, hybrid flow,
r1_ascend notes, FP8 guide) to describe async-only rollout and mention
removal of the old SPMD pathway.

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting).
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: _Removed obsolete
SPMD jobs; async coverage already exists._
- [ ] Once your PR is ready for CI, notify the `ci-request` channel (or
Feishu group).
yurekami added a commit to yurekami/verl that referenced this pull request Dec 29, 2025
…erl-project#4682)

The vLLMAsyncRollout.generate_sequences() method now provides a clear
error message explaining:
- Sync mode was retired in PR verl-project#4411
- Users should use async server interface (vLLMReplica, AsyncLLMServerManager)
- HFRollout can be used for synchronous generation

Also updated generation.yaml config to use async mode and document
the current limitation with main_generation.py workflow.

Fixes verl-project#4682

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
vermouth1992 pushed a commit that referenced this pull request Dec 30, 2025
…) (#4729)

## Summary

This PR contains two contributions:

### 1. Fix for Issue #4682 - Informative error message for
`generate_sequences`
- **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare
`NotImplementedError`, leaving users confused when running generation
scripts
- **Root cause:** The vLLM SPMD (sync) mode was retired in PR #4411, but
the generation workflow (`main_generation.py`) still expects a
synchronous `generate_sequences()` method
- **Fix:** Added an informative error message explaining:
  - Sync mode was retired in PR #4411
- Users should use the async server interface (`vLLMReplica`,
`AsyncLLMServerManager`)
  - Alternative: use `HFRollout` for synchronous generation
  - Links to issue #4682 for details
- Also updated `generation.yaml` config comments to document the
limitation

### 2. Documentation improvement for Issue #1345 - Google-style
docstrings in `device.py`
Standardized all function docstrings in `verl/utils/device.py` to follow
Google-style documentation format:
- `is_torch_npu_available()`: Added detailed description and return type
- `get_visible_devices_keyword()`: Clarified purpose and return values
- `get_device_name()`: Improved description of supported devices
- `get_torch_device()`: Documented fallback behavior
- `get_device_id()`: Concise description with example
- `get_nccl_backend()`: Explained HCCL vs NCCL selection
- `set_expandable_segments()`: Added OOM context and Note section
- `auto_set_ascend_device_name()`: Documented NPU auto-configuration
- `get_device_capability()`: Added proper type hints and description

## Test plan
- [x] Python syntax verification passed for all modified files
- [ ] CI tests should pass (no functional changes, only error messages
and docstrings)

Fixes #4682
Contributes to #1345

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
boren-ms pushed a commit to boren-ms/verl that referenced this pull request Dec 30, 2025
…strings (verl-project#1345) (verl-project#4729)

## Summary

This PR contains two contributions:

### 1. Fix for Issue verl-project#4682 - Informative error message for
`generate_sequences`
- **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare
`NotImplementedError`, leaving users confused when running generation
scripts
- **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but
the generation workflow (`main_generation.py`) still expects a
synchronous `generate_sequences()` method
- **Fix:** Added an informative error message explaining:
  - Sync mode was retired in PR verl-project#4411
- Users should use the async server interface (`vLLMReplica`,
`AsyncLLMServerManager`)
  - Alternative: use `HFRollout` for synchronous generation
  - Links to issue verl-project#4682 for details
- Also updated `generation.yaml` config comments to document the
limitation

### 2. Documentation improvement for Issue verl-project#1345 - Google-style
docstrings in `device.py`
Standardized all function docstrings in `verl/utils/device.py` to follow
Google-style documentation format:
- `is_torch_npu_available()`: Added detailed description and return type
- `get_visible_devices_keyword()`: Clarified purpose and return values
- `get_device_name()`: Improved description of supported devices
- `get_torch_device()`: Documented fallback behavior
- `get_device_id()`: Concise description with example
- `get_nccl_backend()`: Explained HCCL vs NCCL selection
- `set_expandable_segments()`: Added OOM context and Note section
- `auto_set_ascend_device_name()`: Documented NPU auto-configuration
- `get_device_capability()`: Added proper type hints and description

## Test plan
- [x] Python syntax verification passed for all modified files
- [ ] CI tests should pass (no functional changes, only error messages
and docstrings)

Fixes verl-project#4682
Contributes to verl-project#1345

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
jsfanfanfan pushed a commit to meituan-search/verl that referenced this pull request Jan 9, 2026
…strings (verl-project#1345) (verl-project#4729)

## Summary

This PR contains two contributions:

### 1. Fix for Issue verl-project#4682 - Informative error message for
`generate_sequences`
- **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare
`NotImplementedError`, leaving users confused when running generation
scripts
- **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but
the generation workflow (`main_generation.py`) still expects a
synchronous `generate_sequences()` method
- **Fix:** Added an informative error message explaining:
  - Sync mode was retired in PR verl-project#4411
- Users should use the async server interface (`vLLMReplica`,
`AsyncLLMServerManager`)
  - Alternative: use `HFRollout` for synchronous generation
  - Links to issue verl-project#4682 for details
- Also updated `generation.yaml` config comments to document the
limitation

### 2. Documentation improvement for Issue verl-project#1345 - Google-style
docstrings in `device.py`
Standardized all function docstrings in `verl/utils/device.py` to follow
Google-style documentation format:
- `is_torch_npu_available()`: Added detailed description and return type
- `get_visible_devices_keyword()`: Clarified purpose and return values
- `get_device_name()`: Improved description of supported devices
- `get_torch_device()`: Documented fallback behavior
- `get_device_id()`: Concise description with example
- `get_nccl_backend()`: Explained HCCL vs NCCL selection
- `set_expandable_segments()`: Added OOM context and Note section
- `auto_set_ascend_device_name()`: Documented NPU auto-configuration
- `get_device_capability()`: Added proper type hints and description

## Test plan
- [x] Python syntax verification passed for all modified files
- [ ] CI tests should pass (no functional changes, only error messages
and docstrings)

Fixes verl-project#4682
Contributes to verl-project#1345

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
### What does this PR do?

Retires the legacy SPMD rollout path and standardizes the codebase on
async-only rollout for vLLM (SGLang in the next PR). All Python modules,
docs, workflows, and examples now reference the async server mode
exclusively; the sync/SPMD runners, helpers, and CI jobs have been
removed.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: _N/A
(internal task to delete SPMD support)._
- [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe,
ci, doc] refactor: remove SPMD rollout`

### Test

Not run (SPMD suites deleted; async flow already covered by existing
CI).

### API and Usage Example

All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`.
Example:

```bash
python -m verl.trainer.main_ppo \
  ... \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.mode=async \
  ...
```

### Design & Code Changes

- Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and
the entire SGLang SPMD engine, leaving only async implementations.
Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error
on `mode=sync`.
- Removed SPMD-specific docs, tests
(`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`,
`test_vllm_model_rope_scaling`), and CI steps
(`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions
and helper scripts accordingly.
- Cleaned recipes/examples to default `rollout_mode=async` and
eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e
scripts). Added explicit validation in agent-loop utilities and SFT
runner to reject non-async requests.
- Updated documentation (FS- DP/Megatron worker guides, hybrid flow,
r1_ascend notes, FP8 guide) to describe async-only rollout and mention
removal of the old SPMD pathway.

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting).
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: _Removed obsolete
SPMD jobs; async coverage already exists._
- [ ] Once your PR is ready for CI, notify the `ci-request` channel (or
Feishu group).
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…strings (verl-project#1345) (verl-project#4729)

## Summary

This PR contains two contributions:

### 1. Fix for Issue verl-project#4682 - Informative error message for
`generate_sequences`
- **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare
`NotImplementedError`, leaving users confused when running generation
scripts
- **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but
the generation workflow (`main_generation.py`) still expects a
synchronous `generate_sequences()` method
- **Fix:** Added an informative error message explaining:
  - Sync mode was retired in PR verl-project#4411
- Users should use the async server interface (`vLLMReplica`,
`AsyncLLMServerManager`)
  - Alternative: use `HFRollout` for synchronous generation
  - Links to issue verl-project#4682 for details
- Also updated `generation.yaml` config comments to document the
limitation

### 2. Documentation improvement for Issue verl-project#1345 - Google-style
docstrings in `device.py`
Standardized all function docstrings in `verl/utils/device.py` to follow
Google-style documentation format:
- `is_torch_npu_available()`: Added detailed description and return type
- `get_visible_devices_keyword()`: Clarified purpose and return values
- `get_device_name()`: Improved description of supported devices
- `get_torch_device()`: Documented fallback behavior
- `get_device_id()`: Concise description with example
- `get_nccl_backend()`: Explained HCCL vs NCCL selection
- `set_expandable_segments()`: Added OOM context and Note section
- `auto_set_ascend_device_name()`: Documented NPU auto-configuration
- `get_device_capability()`: Added proper type hints and description

## Test plan
- [x] Python syntax verification passed for all modified files
- [ ] CI tests should pass (no functional changes, only error messages
and docstrings)

Fixes verl-project#4682
Contributes to verl-project#1345

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026
### What does this PR do?

Retires the legacy SPMD rollout path and standardizes the codebase on
async-only rollout for vLLM (SGLang in the next PR). All Python modules,
docs, workflows, and examples now reference the async server mode
exclusively; the sync/SPMD runners, helpers, and CI jobs have been
removed.

### Checklist Before Starting

- [ ] Search for similar PRs. Paste at least one query link here: _N/A
(internal task to delete SPMD support)._
- [ ] Format the PR title as `[vllm, sglang, rollout, trainer, recipe,
ci, doc] refactor: remove SPMD rollout`

### Test

Not run (SPMD suites deleted; async flow already covered by existing
CI).

### API and Usage Example

All configs/scripts must now use `actor_rollout_ref.rollout.mode=async`.
Example:

```bash
python -m verl.trainer.main_ppo \
  ... \
  actor_rollout_ref.rollout.name=vllm \
  actor_rollout_ref.rollout.mode=async \
  ...
```

### Design & Code Changes

- Deleted `verl/workers/rollout/vllm_rollout/vllm_rollout_spmd.py` and
the entire SGLang SPMD engine, leaving only async implementations.
Updated `BaseRollout` registry, `RolloutConfig`, and `main_ppo` to error
on `mode=sync`.
- Removed SPMD-specific docs, tests
(`tests/workers/rollout/test_sglang_*`, `test_vllm_spmd`,
`test_vllm_model_rope_scaling`), and CI steps
(`.github/workflows/vllm.yml`, `sgl.yml`). Simplified lint exclusions
and helper scripts accordingly.
- Cleaned recipes/examples to default `rollout_mode=async` and
eliminated conditional sync branches (`examples/**`, `recipe/**`, e2e
scripts). Added explicit validation in agent-loop utilities and SFT
runner to reject non-async requests.
- Updated documentation (FS- DP/Megatron worker guides, hybrid flow,
r1_ascend notes, FP8 guide) to describe async-only rollout and mention
removal of the old SPMD pathway.

### Checklist Before Submitting

- [ ] Read the [Contribute
Guide](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md).
- [ ] Apply [pre-commit
checks](https://github.com/volcengine/verl/blob/main/CONTRIBUTING.md#code-linting-and-formatting).
- [x] Add / Update [the
documentation](https://github.com/volcengine/verl/tree/main/docs).
- [x] Add unit or end-to-end test(s) to [the CI
workflow](https://github.com/volcengine/verl/tree/main/.github/workflows)
to cover all the code. If not feasible, explain why: _Removed obsolete
SPMD jobs; async coverage already exists._
- [ ] Once your PR is ready for CI, notify the `ci-request` channel (or
Feishu group).
sophiayyya pushed a commit to sophiayyya/verl that referenced this pull request Jan 25, 2026
…strings (verl-project#1345) (verl-project#4729)

## Summary

This PR contains two contributions:

### 1. Fix for Issue verl-project#4682 - Informative error message for
`generate_sequences`
- **Problem:** `vLLMAsyncRollout.generate_sequences()` raised a bare
`NotImplementedError`, leaving users confused when running generation
scripts
- **Root cause:** The vLLM SPMD (sync) mode was retired in PR verl-project#4411, but
the generation workflow (`main_generation.py`) still expects a
synchronous `generate_sequences()` method
- **Fix:** Added an informative error message explaining:
  - Sync mode was retired in PR verl-project#4411
- Users should use the async server interface (`vLLMReplica`,
`AsyncLLMServerManager`)
  - Alternative: use `HFRollout` for synchronous generation
  - Links to issue verl-project#4682 for details
- Also updated `generation.yaml` config comments to document the
limitation

### 2. Documentation improvement for Issue verl-project#1345 - Google-style
docstrings in `device.py`
Standardized all function docstrings in `verl/utils/device.py` to follow
Google-style documentation format:
- `is_torch_npu_available()`: Added detailed description and return type
- `get_visible_devices_keyword()`: Clarified purpose and return values
- `get_device_name()`: Improved description of supported devices
- `get_torch_device()`: Documented fallback behavior
- `get_device_id()`: Concise description with example
- `get_nccl_backend()`: Explained HCCL vs NCCL selection
- `set_expandable_segments()`: Added OOM context and Note section
- `auto_set_ascend_device_name()`: Documented NPU auto-configuration
- `get_device_capability()`: Added proper type hints and description

## Test plan
- [x] Python syntax verification passed for all modified files
- [ ] CI tests should pass (no functional changes, only error messages
and docstrings)

Fixes verl-project#4682
Contributes to verl-project#1345

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: yurekami <yurekami@users.noreply.github.com>
Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants