[Bugfix][VoxCPM2]: Fix vectorized_gather OOB under concurrent prefill+decode batches#2903
Conversation
Root cause of the c>=2 vectorized_gather OOB / illegal memory access reported on VoxCPM2 after PR vllm-project#2803: preprocess() used _pending_requests (per-step prefix, cleared each forward) as if it were the full active batch. When a new prefill was scheduled after cached decode requests in the same batch, the decode requests were wrongly classified as stale and removed from _active_states; the next forward then recreated empty states, silently skipped residual_model for them, and desynchronized attn metadata -- producing NaN on the prefill slice and eventually crashing inside the CFM/LocDiT kernels. State cleanup is now driven solely by on_requests_finished -> _flush_deferred_cleanup at the end of forward(). A one-shot leak warning in _get_or_create_state guards against a future regression. Signed-off-by: Sy03 <1370724210@qq.com>
| self._results_queue: list[tuple[str, torch.Tensor | None]] = [] | ||
| self._audio_queue: list[tuple[str, Any]] = [] | ||
| self._deferred_cleanup_ids: set[str] = set() | ||
| self._active_state_warn_threshold = max(512, 4 * self._max_batch_size) |
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKER scan:
| Category | Result |
|---|---|
| Correctness | PASS |
| Reliability/Safety | PASS |
| Breaking Changes | PASS |
| Test Coverage | needs tests |
| Documentation | PASS |
| Security | PASS |
BLOCKING ISSUES:
- [Test Coverage] No regression test — this bugfix lacks an automated test to prevent silent regressions.
VERDICT: REQUEST_CHANGES
linyueqian
left a comment
There was a problem hiding this comment.
Verified on H20 (1× H20 141GB) against merged main a683b1dd.
Crash reproduced on main (pre-PR): 30 staggered voice-clone requests at 0.5 s gap → 0/30 succeed, Worker log shows torch.AcceleratorError: CUDA error: an illegal memory access was encountered, engine dies, all 30 return HTTP 400. Matches the PR description exactly.
With this PR (30f528f7): 30/30 and then 60/60 staggered runs succeed, output durations 2.7–5.1 s, no MAX_DECODE_STEPS, no _active_states size warn. grep -E "EngineDead|illegal memory access|vectorized_gather|out of bounds|NaN|Traceback|AssertionError" returns 0.
Diagnosis is sound: traced preprocess()'s eviction, forward()'s _pending_requests.clear() at the tail (L821), and subsequent preprocess() calls appending one at a time at L1164. When a prefill runs mid-batch, _pending_requests genuinely is a prefix-only view, any prefill_completed=True decode scheduled after it gets evicted, forward()'s _switch_to_request silently recreates a prefill_completed=False state, which then hits the skip-else at L768-771, making torch.cat(residual_inputs) shorter than the CUDA-graph-captured metadata slots. The #2803 CUDA graph just turned what used to be silent data corruption into a loud OOB, as the PR description correctly attributes.
Fix is strictly more correct than the removed logic. Approving on correctness; agree with @hsliuustc0106 that a regression test should land before merge.
hsliuustc0106
left a comment
There was a problem hiding this comment.
BLOCKING:
- Test Coverage — This bugfix needs a regression test. The manual 540-request validation is strong evidence but not reproducible in CI. A test that exercises the prefill+decode mixed batch path (even a lightweight unit test mocking the eviction condition) would prevent silent regressions.
- Extract leak-warn threshold floor to module-level constant _ACTIVE_STATE_LEAK_WARN_MIN = 512 (was hardcoded). - Annotate leak warn with max_batch_size; mark one-shot by design. - Rewrite preprocess() prefill-branch comment to explain why _pending_requests is unsafe (per-step prefix) and name the real cleanup path (on_requests_finished -> _flush_deferred_cleanup, fed by vLLM scheduler._free_request via gpu_ar_model_runner.py). - Add regression tests pinning state-eviction and deferred-cleanup contracts (no GPU / CUDA graph / compile dependency). Signed-off-by: Sy03 <1370724210@qq.com>
|
All issues that comments mentioned have been fixed. @linyueqian @hsliuustc0106 |
hsliuustc0106
left a comment
There was a problem hiding this comment.
Good bugfix with comprehensive regression tests.
Minor suggestion: consider adding an e2e online serving test that reproduces the specific prefill+decode mixed batch pattern that triggered the crash, to complement the unit-level regression tests.
voxcpm2_talker.py imports librosa at module scope. The lightweight
unit-test environments (simple-unit-test, diffusion-cache-backend-test,
cuda-unit-test-*) don't install librosa, so collecting the test file
fails with ModuleNotFoundError before any test runs.
Add pytest.importorskip("librosa") next to the existing torch skip so
those shards skip cleanly. Full TTS-stack shards (which install
librosa) still exercise the regression tests.
Signed-off-by: Sy03 <1370724210@qq.com>
Complements the unit-level regression in tests/model_executor/models/voxcpm2/test_talker_state_eviction.py with an e2e offline_inference test that exercises the real scheduler path: one long prompt (stays in decode across many steps) plus several short prompts (keep entering prefill), reproducing the "1 prefill + N cached decode" batch shape that triggered the original crash. Rides the existing VoxCPM2 Native AR E2E Test Buildkite shard (gpu_1_queue, HF_TOKEN, voxcpm install) — no new shard required. Signed-off-by: Sy03 <1370724210@qq.com>
Signed-off-by: Sy03 <1370724210@qq.com>
…+decode batches (vllm-project#2903) Signed-off-by: Sy03 <1370724210@qq.com>
…+decode batches (vllm-project#2903) Signed-off-by: Sy03 <1370724210@qq.com>
Purpose
Fixes the
vectorized_gather_kernel: index out of bounds/ illegal memory access crash on VoxCPM2 concurrent serving after #2803. Reproduces within 1-2 staggered rounds once requests overlap in a prefill+decode mixed batch.Root cause
VoxCPM2TalkerForConditionalGeneration.preprocess()evicted stale per-request states at the start of every prefill usingself._pending_requestsas the "current batch". But_pending_requestsis cleared at the end of eachforward()and repopulated by vLLM's runner one request at a time — whenpreprocess()runs for the k-th request, it only holds the first k-1.In a prefill+decode mixed batch (1 new prefill + N cached decodes), when the prefill is walked mid-list, the cached decode requests scheduled after it are not yet in
_pending_requests, get classified as stale, and are removed from_active_states. The nextforward()lazily recreates empty states (prefill_completed=False), falls into the silent-skipelsebranch, and drops those tokens fromresidual_inputs.torch.cat(residual_inputs)is now shorter thanattn_metadata.slot_mapping; residual_model's PagedAttention reads/writes KV with the wrongcu_seqlens, softmax overflows to NaN on the prefill slice, NaN propagates into the CFM (LocDiT) solver, and the next indexing kernel crashes.Smoking gun (instrumented step 59):
pending=4butsm_len=43with 9query_start_locsegments — vLLM scheduled 8 requests, talker only saw 4.Why this only surfaced after #2803
The bug predates #2803. In the eager /
torch.compilepath, the metadata desync produces numerically wrong but finite outputs (subtly bad audio, no crash). #2803's manual CUDA Graph pins attn metadata pointers and buffer sizes at capture time; on replay the shorterbatch_inagainst captured-size metadata becomes a hard OOB. #2803 did not introduce the bug, it converted a silent data-corruption into a loud crash.Fix
_pending_requests-based eviction inpreprocess().on_requests_finished()→_flush_deferred_cleanup()at end offorward(), which already gets correctfinished_req_idsfrom vLLM.logger.warningin_get_or_create_state()when_active_statesgrows pastmax(512, 4 * max_num_seqs)as a regression guard.Investigated and ruled out
async_scheduling=truerace. Looked correlated but not causal;async_scheduling=true(repo default) is stable with this fix.block_table/scheduler_metadata. Tried owned block-table buffers,MemPool(no_split=True),scheduler_metadata=Noneat capture, FA3 wake hook — none fixed staggered load..detach().clone()on compiled outputs is a real correctness win for a different symptom (logits_indicesgetting INT_MAX-ish). Disabling Inductor entirely still crashes on staggered load.LocDiT.decoderbut LocDiT was the victim, not the villain — it received NaN inputs produced upstream.use_full_cuda_graph=True,max_num_splits=32). Discarded: regressed small-batch RTF 0.106 → 0.205 (−39…−61% throughput at c=1/2/4). Not needed.Deliberately NOT changed
async_schedulingstays on,max_num_seqsunchanged.torch.compilemode, or attn metadata handling..clone()across per-request state.Total change: 28 lines in a single file (
voxcpm2_talker.py).Test Plan
Run on 1× H20, clean
origin/mainworktree with this branch applied (no YAML changes):grep -E "EngineDead|illegal|Traceback|NaN|out of bounds"returns 0.voxcpm2_ab_bench_online.py(same script as [Perf] VoxCPM2: Speedup by manual CUDA Graph capture for scaffold/residual forward #2803) for perf comparison.Test Result
Stability (540 requests cumulative, 0 failures)
Server-log scans:
EngineDeadError=0,illegal memory access=0,Traceback=0.Audio validity spot check
48 kHz, non-truncated, non-silent, WAV decodes cleanly.
Performance (no regression)
c=1 slightly faster; c=2/c=4 within run-to-run noise.
Before this PR
Same worktree without the fix: staggered 60-request run crashes with
vectorized_gather_kernel: index out of boundsin 1-2 rounds, engine dies, subsequent requests return HTTP 400 (EngineDeadError).cc @linyueqian @hsliuustc0106