fix(runner): pass request_id to model.preprocess() for per-request state by linyueqian · Pull Request #2746 · vllm-project/vllm-omni

linyueqian · 2026-04-13T16:39:21Z

Summary

OmniGPUModelRunner._preprocess() iterates over self.input_batch.req_ids but never passes the request ID to model.preprocess(). Models that maintain per-request state (e.g. VoxCPM2) fall back to a hardcoded "default" ID, so all concurrent requests share a single state object.

This one-line fix injects req_infos["request_id"] = req_id before the preprocess() call.

Bugs fixed (found while testing #2690):

Bug	Symptom	Root cause
Stop logic failure	2 concurrent requests produce ~58s audio for ~4s sentences	Shared state mixes stop signals; stop never cleanly triggers
Prefill shape mismatch	4 concurrent requests crash with `RuntimeError: size mismatch`	Second `preprocess()` overwrites first's `prefill_masks`; `forward()` reads stale dimensions

Known remaining issue (not addressed here): 4 concurrent requests hit msgspec.ValidationError: cannot unpack non-iterable NoneType object in the orchestrator IPC layer when requests finish at different times and mm_payload contains None audio entries. This is a separate orchestrator-level serialization bug.

Test plan

Tested on H20 (single GPU, enforce_eager=true):

Single request: RTF ~0.21, audio correct (unchanged)
2 concurrent requests: 2.72s + 5.28s audio (was 57s + 58s)
4 concurrent requests: prefill shape mismatch fixed, but blocked by orchestrator msgspec bug

OmniGPUModelRunner._preprocess() calls model.preprocess() per request but never passes the request_id. Models that maintain per-request state (e.g. VoxCPM2TalkerForConditionalGeneration) fall back to a hardcoded "default" id, causing all concurrent requests to share a single state. This produces two bugs in batched inference: - Stop logic failure: shared state mixes stop signals across requests, so requests never terminate (58s audio for 4s sentences) - Prefill shape mismatch: second preprocess() overwrites first's masks, causing RuntimeError when forward() reads stale dimensions Fix: inject req_id into req_infos before the preprocess() call. Tested on H20 (single GPU, enforce_eager): - 2 concurrent requests: audio duration 2.72s + 5.28s (was 57s + 58s) - Single request: unchanged (RTF ~0.21) Signed-off-by: Yueqian Lin <pandaleefree@gmail.com>

chatgpt-codex-connector · 2026-04-13T16:39:28Z

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

linyueqian · 2026-04-13T17:19:21Z

Fix included directly in #2690 (commit 97c91a8). Closing this separate PR.

linyueqian requested a review from hsliuustc0106 as a code owner April 13, 2026 16:39

linyueqian mentioned this pull request Apr 13, 2026

[Perf]: Speedup VoxCPM2 TTS performance and Support PagedAttention #2690

Merged

linyueqian closed this Apr 13, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(runner): pass request_id to model.preprocess() for per-request state#2746

fix(runner): pass request_id to model.preprocess() for per-request state#2746
linyueqian wants to merge 1 commit intovllm-project:mainfrom
linyueqian:fix/preprocess-request-id

linyueqian commented Apr 13, 2026

Uh oh!

chatgpt-codex-connector Bot commented Apr 13, 2026

Uh oh!

linyueqian commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

linyueqian commented Apr 13, 2026

Summary

Test plan

Uh oh!

chatgpt-codex-connector Bot commented Apr 13, 2026

Uh oh!

linyueqian commented Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants