[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump by CharlieFRuan · Pull Request #1250 · NovaSky-AI/SkyRL

CharlieFRuan · 2026-03-02T22:31:38Z

Summary

Fix abort_generation() and sleep() abort logic that broke silently after the vllm 0.16.0 bump (Bump vLLM to 0.16.0 with required dep updates #1240)
Add backward-compatible _get_unfinished_request_ids() helper to resolve internal vs external request ID mismatch
Fixes [train] Fix pause and resume logic and generation tests after 0.16.0 upgrade #1243

Root Cause

In vllm 0.16.0, InputProcessor.assign_request_id() now creates internal request IDs (with a random suffix) that are distinct from the user-provided external request IDs:

request.external_req_id = request.request_id                       # save original as external
request.request_id = f"{request.external_req_id}-{random_uuid():.8}"  # new internal ID

Our code was reading request IDs from output_processor.request_states.keys() (which are now internal IDs) and passing them to engine.abort() with internal=False (the default). The abort looked them up in the external_req_ids mapping, found nothing, and silently did nothing. Requests completed normally with finish_reason="length" instead of "abort".

This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with finish_reason="abort" so the retry loop can re-submit with accumulated tokens.

Related vllm changes:

Fix

Add a _get_unfinished_request_ids() static method on BaseVLLMInferenceEngine that:

Uses output_processor.external_req_ids.keys() when available (vllm 0.16.0+)
Falls back to output_processor.request_states.keys() for older vllm versions

Applied to all three abort call sites:

AsyncVLLMInferenceEngine.abort_generation() — used by fully async pause/resume
AsyncVLLMInferenceEngine.sleep() — cleanup before sleep
VLLMInferenceEngine.sleep() — sync engine cleanup before sleep

Test plan

test_abort_generation_vllm_engine — passes (was failing with assert 'length' == 'abort')
test_continue_generation_vllm_engine_chat_completion — passes
test_continue_generation_generate_vllm_engine_generation — passes
E2E fully async gsm8k (gsm8k_fully_async_ci project) — ran ~12 training steps successfully with pause/resume working correctly

Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci

In vllm 0.16.0, InputProcessor.assign_request_id() now creates internal request IDs (with random suffix) separate from the user-provided external request IDs. The output_processor.request_states dict is keyed by these internal IDs, but engine.abort() with internal=False (default) looks them up in the external_req_ids mapping. Since internal IDs don't exist as external keys, the abort silently did nothing and requests completed normally with finish_reason="length" instead of "abort". Add _get_unfinished_request_ids() helper that uses external_req_ids when available (vllm 0.16.0+) and falls back to request_states keys for older versions. Apply to abort_generation() and both sync/async sleep() methods. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist

Code Review

This pull request addresses an issue with abort_generation() and sleep() in the vLLM engine, which was caused by a change in request ID handling in vllm version 0.16.0. The fix introduces a _get_unfinished_request_ids helper method to correctly retrieve request IDs for aborting, ensuring backward compatibility. The changes are logical and well-contained. My feedback includes a minor suggestion to improve type hinting for the new helper method.

skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py

devin-ai-integration

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

gemini-code-assist bot reviewed Mar 2, 2026

View reviewed changes

skyrl/backends/skyrl_train/inference_engines/vllm/vllm_engine.py Show resolved Hide resolved

devin-ai-integration bot reviewed Mar 2, 2026

View reviewed changes

vercel bot deployed to Preview March 2, 2026 22:36 View deployment

CharlieFRuan requested a review from SumanthRH March 2, 2026 23:03

SumanthRH approved these changes Mar 2, 2026

View reviewed changes

SumanthRH merged commit c038635 into main Mar 2, 2026
6 of 7 checks passed

CharlieFRuan deleted the charlie/fix-pause-vllm0160 branch March 2, 2026 23:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump#1250

[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump#1250
SumanthRH merged 1 commit intomainfrom
charlie/fix-pause-vllm0160

CharlieFRuan commented Mar 2, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

CharlieFRuan commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Fix

Test plan

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

devin-ai-integration bot left a comment

Choose a reason for hiding this comment

✅ Devin Review: No Issues Found

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

CharlieFRuan commented Mar 2, 2026 •

edited

Loading