Skip to content

[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump#1250

Merged
SumanthRH merged 1 commit intomainfrom
charlie/fix-pause-vllm0160
Mar 2, 2026
Merged

[train][fullyAsync] Fix abort/pause broken after vllm 0.16.0 bump#1250
SumanthRH merged 1 commit intomainfrom
charlie/fix-pause-vllm0160

Conversation

@CharlieFRuan
Copy link
Member

@CharlieFRuan CharlieFRuan commented Mar 2, 2026

Summary

Root Cause

In vllm 0.16.0, InputProcessor.assign_request_id() now creates internal request IDs (with a random suffix) that are distinct from the user-provided external request IDs:

request.external_req_id = request.request_id                       # save original as external
request.request_id = f"{request.external_req_id}-{random_uuid():.8}"  # new internal ID

Our code was reading request IDs from output_processor.request_states.keys() (which are now internal IDs) and passing them to engine.abort() with internal=False (the default). The abort looked them up in the external_req_ids mapping, found nothing, and silently did nothing. Requests completed normally with finish_reason="length" instead of "abort".

This broke fully async RL's pause/resume flow, which relies on abort returning partial outputs with finish_reason="abort" so the retry loop can re-submit with accumulated tokens.

Related vllm changes:

Fix

Add a _get_unfinished_request_ids() static method on BaseVLLMInferenceEngine that:

  • Uses output_processor.external_req_ids.keys() when available (vllm 0.16.0+)
  • Falls back to output_processor.request_states.keys() for older vllm versions

Applied to all three abort call sites:

  1. AsyncVLLMInferenceEngine.abort_generation() — used by fully async pause/resume
  2. AsyncVLLMInferenceEngine.sleep() — cleanup before sleep
  3. VLLMInferenceEngine.sleep() — sync engine cleanup before sleep

Test plan

  • test_abort_generation_vllm_engine — passes (was failing with assert 'length' == 'abort')
  • test_continue_generation_vllm_engine_chat_completion — passes
  • test_continue_generation_generate_vllm_engine_generation — passes
  • E2E fully async gsm8k (gsm8k_fully_async_ci project) — ran ~12 training steps successfully with pause/resume working correctly

Light blue is the run after this fix (our nightly gsm8k fully async CI) https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_fully_async_ci

image
Open with Devin

In vllm 0.16.0, InputProcessor.assign_request_id() now creates internal
request IDs (with random suffix) separate from the user-provided external
request IDs. The output_processor.request_states dict is keyed by these
internal IDs, but engine.abort() with internal=False (default) looks them
up in the external_req_ids mapping. Since internal IDs don't exist as
external keys, the abort silently did nothing and requests completed
normally with finish_reason="length" instead of "abort".

Add _get_unfinished_request_ids() helper that uses external_req_ids when
available (vllm 0.16.0+) and falls back to request_states keys for older
versions. Apply to abort_generation() and both sync/async sleep() methods.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue with abort_generation() and sleep() in the vLLM engine, which was caused by a change in request ID handling in vllm version 0.16.0. The fix introduces a _get_unfinished_request_ids helper method to correctly retrieve request IDs for aborting, ensuring backward compatibility. The changes are logical and well-contained. My feedback includes a minor suggestion to improve type hinting for the new helper method.

Copy link
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Devin Review: No Issues Found

Devin Review analyzed this PR and found no potential bugs to report.

View in Devin Review to see 3 additional findings.

Open in Devin Review

@SumanthRH SumanthRH merged commit c038635 into main Mar 2, 2026
6 of 7 checks passed
@CharlieFRuan CharlieFRuan deleted the charlie/fix-pause-vllm0160 branch March 2, 2026 23:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[train] Fix pause and resume logic and generation tests after 0.16.0 upgrade

2 participants