[Model][Core] Enable async_chunk streaming pipeline for CosyVoice3#1703
Conversation
hsliuustc0106
left a comment
There was a problem hiding this comment.
Review
Rating: 9/10 | Verdict: ✅ Approved
Summary
Excellent multi-category PR ([Model][Core]) hardening CosyVoice3 async_chunk runtime. Comprehensive solution addressing state management, inference correctness, and precision trimming with solid benchmark data.
Multi-Category Review Coverage
Primary: [Model] (vllm-omni-contrib)
- ✅ Stage config added (cosyvoice3_async_chunk.yaml)
- ✅ Input processor implemented (talker -> code2wav)
- ✅ Tests for model helpers (+99 lines)
- ✅ Tests for stage input processors (+191 lines)
Secondary: [Core] (Distributed/Runtime)
- ✅ State management: cleanup deferred to post-send phase (eliminates race conditions)
- ✅ Batch-safe code2wav forward pass
- ✅ Chunk deduplication mechanism
- ✅ Terminal EOF protocol
- ✅ Precision trimming optimization
- ✅ Connector tests (+99 lines)
Highlights
- Benchmark validated: TTFA -76.5%, audio fidelity maintained
- Root cause analysis: Detailed explanation of race conditions and dedup needs
- Test coverage: 389 lines of new tests covering all critical paths
- Production-ready: Addresses all follow-up issues from #498
Minor Suggestions (non-blocking)
-
Connector state lifecycle: Consider adding a state machine diagram in comments to clarify the cleanup timing (pre-send vs post-send).
-
Deduplication edge case: What happens if two consecutive chunks produce identical audio? Current logic dedups, but should this emit a silent frame to maintain stream continuity?
-
Config documentation: cosyvoice3_async_chunk.yaml could benefit from comments explaining the left_context_size and token_frame_rate parameters.
Pitfalls Check
| Directory | Pitfall | Status |
|---|---|---|
distributed/omni_connectors/ |
Race condition in cleanup | ✅ Fixed |
model_executor/models/ |
Batch-safe forward | ✅ Refactored |
model_executor/stage_input_processors/ |
Deduplication | ✅ Implemented |
model_executor/stage_configs/ |
Config validation | ✅ Complete |
Recommendation
Ready to merge. Thorough implementation with comprehensive tests and validated benchmarks.
Reviewed by OpenClaw with vllm-omni-skills 🦐
Multi-Category Review: Primary=vllm-omni-contrib, Secondary=distributed/runtime patterns
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 56eb5a3ae9
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| # Stage config for running CosyVoice3 with async_chunk architecture | ||
| # Stage 0: Talker (text prompt -> speech tokens streamed by chunks) | ||
| # Stage 1: Code2Wav (flow matching -> acoustic features -> waveform) | ||
| async_chunk: true |
There was a problem hiding this comment.
for async_chunk yaml, is there any difference compared with default yaml except line 4?
There was a problem hiding this comment.
yes, besides line 4, it also adds the stage0→stage1 streaming connector setup and switches to the async-chunk processor (talker2code2wav_async_chunk).
There was a problem hiding this comment.
I'd suggest waiting until #1722 is merged before merging this.
|
@amy-why-3459 @R2-Y Could you help take a look? |
| connector_get_max_wait: 300 | ||
| codec_chunk_frames: 25 | ||
| codec_left_context_frames: 25 | ||
| codec_vocab_size: 6561 |
There was a problem hiding this comment.
For codec related config, maybe better to put in engine args? Because it is model side config.
There was a problem hiding this comment.
Seems that Qwen3-Omni is same to put them here.
b95b1ba to
fa2913e
Compare
|
can you help add e2e test for cosyvoice 3 as well? you can learn from qwen3-tts's implementation |
|
output_async_chunk.wav |
linyueqian
left a comment
There was a problem hiding this comment.
Review: Tested locally on A100 (eager mode, float32)
Generated a WAV with the async_chunk config — pipeline runs end-to-end, but there are two bugs and an audio quality concern.
Bug 1: TTS-only models crash with "This model does not support generation"
File: vllm_omni/engine/async_omni_engine.py ~L544
CosyVoice3 has no comprehension stage, so "generate" is never added to supported_tasks. When SamplingParams is passed (which all TTS callers do), the vLLM input processor rejects it:
ValueError: This model does not support generation
Fix: Add "generate" alongside "speech" when final_output_type == "audio":
if any(metadata.get("final_output_type") == "audio" for metadata in stage_metadata):
supported_tasks.add("speech")
supported_tasks.add("generate") # TTS models use SamplingParams for AR decodingBug 2: Unbounded token_frames memory growth
File: vllm_omni/model_executor/stage_input_processors/cosyvoice3.py ~L149-201
token_frames = transfer_manager.code_prompt_token_ids[request_id] grows without bound for the entire request lifetime. After a chunk is emitted, frames older than left_context_size_cfg will never appear as context again. For long utterances this leaks memory proportional to total tokens generated.
Fix: Prune after each emission:
# After building and returning the payload (before the return):
if left_context_size_cfg < len(token_frames):
del token_frames[: len(token_frames) - left_context_size_cfg]
state["emitted_token_len"] = left_context_size_cfg
else:
state["emitted_token_len"] = lengthAudio quality: audible noise at chunk boundaries
Tested with the async_chunk config and listened to the output WAV. There is audible noise/artifacts at the transitions between chunks. This is likely caused by the left-context trimming in cosyvoice3.py:forward() — the samples_per_token-based crop produces a hard cut at the chunk boundary without any crossfade or overlap-add smoothing.
The sync config (cosyvoice3.yaml) already defines streaming overlap parameters (token_overlap_len=20, mel_overlap_len, mel_window = np.hamming(...), speech_window), but the async_chunk code2wav path in CosyVoice3Model.forward() does a raw audio[crop:] slice at L536-538 instead of applying the Hamming window crossfade that the reference CosyVoice3 streaming implementation uses.
Suggested approach: apply the mel_window / speech_window overlap-add at chunk boundaries in the code2wav forward path, similar to how the reference CosyVoice3Code2Wav streaming parameters are designed to be used. Without this, chunk seams will always be audible.
Minor nits
cosyvoice3.py:156:if length <= 0—lengthislen(list), can never be negative.if not token_frames:is clearer.cosyvoice3.py:139:if bool(state.get("terminal_sent", False))— thebool()wrapper is redundant.- The
serving_speech.pydiff removes Voxtral TTS, Fish Speech, upload_voice/delete_voice — these are unrelated to the async_chunk feature and should be in a separate PR.
All 28 unit tests pass with the two fixes above applied.
Thanks for your test. I didn't experience any pops and clicks in my previous tests, as the lookahead and transition processing I implemented had already eliminated the audio glitches present in earlier versions. |
Test setup detailsModel: Prompt audio: Synthetic 3-second 24kHz sine tone (200Hz, exponential decay) — not real speech, so the speaker embedding quality is poor. This likely contributes to the artifacts. Text input: Sampling params (Stage 0 / Talker): SamplingParams(
temperature=1.0, top_p=0.8, top_k=25,
repetition_penalty=2.0,
min_tokens=min_len, # text_token_len * 2
max_tokens=max_len, # text_token_len * 20
stop_token_ids=[6562],
detokenize=False,
)Stage config (modified from
Output: 7.84s audio, 188160 samples @ 24kHz, max amplitude 0.99, RMS 0.14 Note: the synthetic prompt audio (sine tone, not real speech) likely degraded the speaker embedding quality. A retest with a real speech prompt would give a fairer comparison. The chunk boundary artifacts I heard might also be less pronounced with a natural prompt. I will test again using actual speech. |
Retest with official CosyVoice3 prompt audioRetested using the official Prompt text: Output: 7.56s, 181440 samples @ 24kHz, RMS=0.091 Chunk boundary analysis (1-second windows): Audio quality is noticeably better with real speech prompt vs the synthetic sine tone used previously. The speaker similarity is reasonable. However, there is still one large amplitude discontinuity at chunk 5 boundary (jump=0.396), which is audible as a click/pop. The other boundaries are smoother. Same config as before: i can still hear a bit glitches in the middle. |
Thank you for testing. I have confirmed that there is indeed an audio popping issue under this setting. I will make alignment fixes based on the upstream CosyVoice3 repository soon. Thanks again! |
26f8615 to
ce3eb12
Compare
| @@ -89,6 +91,53 @@ def _make_buffer(self, *size, dtype, numpy=True): | |||
| with maybe_disable_pin_memory_for_ray(self, total_bytes): | |||
| return super()._make_buffer(*size, dtype=dtype, numpy=numpy) | |||
|
|
|||
| def _build_model_sampler_output_token_ids(self) -> list[list[int]]: | |||
There was a problem hiding this comment.
do all tts model need this function or just cosyvoice3? cc @linyueqian
There was a problem hiding this comment.
Currently this is only needed for CosyVoice3.
The key difference is that CosyVoice3 opts into prefer_model_sampler, and its custom RAS-style sampler explicitly depends on the decoded token history (output_token_ids) when making the next sampling decision.
Other TTS models such as Qwen3-TTS still use token history indirectly through the default vLLM sampler / model state, but they do not currently consume output_token_ids inside a custom model-level sampler, so this helper is not required for them today.
I kept it in GPUARModelRunner because the issue is in the generic prefer_model_sampler path rather than in a CosyVoice3-only module, and this also makes the contract correct for any future history-dependent model sampler.
|
Following up on the popping / overlap issue previously reported on this PR, I pushed a cleaned-up follow-up patch set and re-ran the CosyVoice3 quality validation on the validated clean branch before cherry-picking it onto the current PR head. What changed in this follow-up:
sync / async wavs for listening comparison: I've also Fixed the CosyVoice3 talker sampler path so the model sampler receives the actual decoded-token history, which is required for correct RAS-style sampling behavior.
|
|
can you update some benchmark stats? the samples seems to be have similar quality compared to sync one. nice work! |
ok, i've already updated in pr description |
|
resolve conflicts please |
|
Rebased onto current main (includes #2486 vLLM 0.19.0 compat fixes) and pushed. Also fixed the async_chunk yaml config (model_stage renames, default_sampling_params) and enabled the streaming e2e test. Offline Benchmark (H100, enforce_eager=true)
Audio quality is good for both, no audible chunk boundary artifacts. Online Streaming TTFA (H100, vLLM 0.19.0, enforce_eager=true)
Async chunk TTFA is consistently ~2.8s vs sync 4.3-8.7s (~35-68% improvement). Total end-to-end latency is higher due to connector overhead, but TTFA is the metric that matters for streaming UX. Both RTFs are < 1 for the offline path. The streaming path overhead is expected since chunks are processed incrementally. |
…av hardening - Add CosyVoice3 async_chunk stage config and connector carry-over for per-request metadata. - Make stage input processing and code2wav runtime batch-safe, with token safety and device alignment. - Add basic unit coverage for batched stage inputs and async payload emission. Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
…iagnostics - Call chunk_transfer_adapter.cleanup on AR finished requests to reclaim per-request async state. - Add warning_once when left-context trim is requested but samples_per_token is unavailable. - Add helper tests for _split_request_ids/_sanitize_codec_tokens and left-context warning path. - Add scheduler regression test for AR finished-request cleanup. - Scope: hardening + diagnostics only; no default strategy/policy behavior change. Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
…hunk audio span Signed-off-by: indevn <indevn@outlook.com>
…de2wav slicing Signed-off-by: indevn <indevn@outlook.com>
Reuse request_output when falling back to per-completion multimodal outputs. Behavior is unchanged. Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
e544011 to
c83b847
Compare
d236115 to
dd84a73
Compare
- Rename model_stage in yaml: talker -> cosyvoice3_talker, code2wav -> cosyvoice3_code2wav - Fix model_stage checks in RAS sampler/sample() to match renamed stages - Add default_sampling_params: top_k=25, top_p=0.8 (matching upstream defaults), repetition_penalty=1.0001 (near-identity, forces output_token_ids tracking for RAS) - Fix unit test model_stage references - Enable streaming e2e test with async_chunk config (core_model level) Signed-off-by: linyueqian <linyueqian@outlook.com>
dd84a73 to
e58e2ba
Compare
The CosyVoice3 model's decoder outputs 6761 logits (speech_token_size + 200). The official inference code treats ALL 200 tokens >= 6561 as stop signals, but the vLLM implementation was masking 199 of them and only restoring token 6562. This funnelled stop probability through a single token, causing bimodal behaviour: either immediate EOS (silence) or no EOS at all (excessively long audio). Fix with three changes: - compute_logits: merge all 200 stop logits into EOS via logsumexp, preserving the correct aggregate stop probability - gpu_ar_model_runner: apply logit bias (min_tokens enforcement) before the custom model sampler — prefer_model_sampler was bypassing it - serving_speech: compute dynamic min/max tokens for CosyVoice3 based on text length, matching the official min_token_text_ratio=2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
CLI --enforce-eager does not propagate to per-stage engine args. Set enforce_eager directly in the YAML for both stages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
thanks for ur work! |
…llm-project#1703) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: indevn <indevn@outlook.com> Co-authored-by: linyueqian <linyueqian@outlook.com>
…llm-project#1703) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: indevn <indevn@outlook.com> Co-authored-by: linyueqian <linyueqian@outlook.com>
…llm-project#1703) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: indevn <indevn@outlook.com> Co-authored-by: linyueqian <linyueqian@outlook.com>
…llm-project#1703) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: indevn <indevn@outlook.com> Co-authored-by: linyueqian <linyueqian@outlook.com>
Purpose
This PR introduces and productionizes the
async_chunkstreaming pipeline for CosyVoice3, fulfilling the architectural roadmap laid out in #498.The core change is to connect the
talker -> code2wavpath so that code2wav can start consuming codec chunks before the talker stage fully finishes, instead of waiting for the full utterance. This is the main integration step that brings CosyVoice3 onto the async_chunk execution model already used by vLLM-Omni.Architectural Integration
Added a dedicated
cosyvoice3_async_chunk.yamlconfig and implemented thetalker -> code2wavasync chunk processor, enabling the end-to-end async_chunk execution path for CosyVoice3.Concurrency & State Management
Made the
code2wavruntime batch-safe for multi-request chunk consumption, preserved prompt conditioning across chunked requests, and moved per-request async state reclamation to post-send to eliminate cleanup/save races.Inference Correctness
Enforced terminal EOF semantics, filtered invalid codec tokens, and clamped chunk audio length around left-context trimming to keep streaming boundaries accurate.
Testing & Validation
Added unit coverage for batched stage inputs, async payload emission, cleanup ordering, terminal behavior, and
seq_token_counts-aware slicing. Added an offline E2E test for CosyVoice3 as requested, following the Qwen3-TTS style.Performance Benchmark
GPU: NVIDIA A800
CUDA Version : 12.8
Driver Version : 570.133.20
Benchmark (
warmup=1, measured=3):Final audio parity check (official prompt):
finish_reason=stop,stop_reason=6562,num_tokens=203Test Result
Async Wav:
official_prompt_async.wav
sha256: fde20bc9c768a5c66d77232364f04304eb2c9f62166c74d848f36b133ccc6822
Sync Wav:
official_prompt_sync.wav
sha256: 26aa9237606c64eee114d6db879a2528cefcf98e8c4e2da1a393803d470ba895
8.08s1939200.072767Validation
tests/model_executor/stage_input_processors/test_cosyvoice3_stage_input_processors.py: 8 passedtests/model_executor/models/cosyvoice3/test_cosyvoice3_components.py: 13 passedtests/e2e/offline_inference/test_cosyvoice3.py -vvcosyvoice3.yamlpassedcosyvoice3_async_chunk.yamlpassed