Skip to content

fix(runner): pass request_id to model.preprocess() for per-request state#2746

Closed
linyueqian wants to merge 1 commit intovllm-project:mainfrom
linyueqian:fix/preprocess-request-id
Closed

fix(runner): pass request_id to model.preprocess() for per-request state#2746
linyueqian wants to merge 1 commit intovllm-project:mainfrom
linyueqian:fix/preprocess-request-id

Conversation

@linyueqian
Copy link
Copy Markdown
Collaborator

Summary

  • OmniGPUModelRunner._preprocess() iterates over self.input_batch.req_ids but never passes the request ID to model.preprocess(). Models that maintain per-request state (e.g. VoxCPM2) fall back to a hardcoded "default" ID, so all concurrent requests share a single state object.

This one-line fix injects req_infos["request_id"] = req_id before the preprocess() call.

Bugs fixed (found while testing #2690):

Bug Symptom Root cause
Stop logic failure 2 concurrent requests produce ~58s audio for ~4s sentences Shared state mixes stop signals; stop never cleanly triggers
Prefill shape mismatch 4 concurrent requests crash with RuntimeError: size mismatch Second preprocess() overwrites first's prefill_masks; forward() reads stale dimensions

Known remaining issue (not addressed here): 4 concurrent requests hit msgspec.ValidationError: cannot unpack non-iterable NoneType object in the orchestrator IPC layer when requests finish at different times and mm_payload contains None audio entries. This is a separate orchestrator-level serialization bug.

Test plan

Tested on H20 (single GPU, enforce_eager=true):

  • Single request: RTF ~0.21, audio correct (unchanged)
  • 2 concurrent requests: 2.72s + 5.28s audio (was 57s + 58s)
  • 4 concurrent requests: prefill shape mismatch fixed, but blocked by orchestrator msgspec bug

OmniGPUModelRunner._preprocess() calls model.preprocess() per request
but never passes the request_id. Models that maintain per-request state
(e.g. VoxCPM2TalkerForConditionalGeneration) fall back to a hardcoded
"default" id, causing all concurrent requests to share a single state.

This produces two bugs in batched inference:
- Stop logic failure: shared state mixes stop signals across requests,
  so requests never terminate (58s audio for 4s sentences)
- Prefill shape mismatch: second preprocess() overwrites first's masks,
  causing RuntimeError when forward() reads stale dimensions

Fix: inject req_id into req_infos before the preprocess() call.

Tested on H20 (single GPU, enforce_eager):
- 2 concurrent requests: audio duration 2.72s + 5.28s (was 57s + 58s)
- Single request: unchanged (RTF ~0.21)

Signed-off-by: Yueqian Lin <pandaleefree@gmail.com>
@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@linyueqian
Copy link
Copy Markdown
Collaborator Author

Fix included directly in #2690 (commit 97c91a8). Closing this separate PR.

@linyueqian linyueqian closed this Apr 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants