[Model][Core] Enable async_chunk streaming pipeline for CosyVoice3 by indevn · Pull Request #1703 · vllm-project/vllm-omni

indevn · 2026-03-06T05:46:20Z

Purpose

This PR introduces and productionizes the async_chunk streaming pipeline for CosyVoice3, fulfilling the architectural roadmap laid out in #498.

The core change is to connect the talker -> code2wav path so that code2wav can start consuming codec chunks before the talker stage fully finishes, instead of waiting for the full utterance. This is the main integration step that brings CosyVoice3 onto the async_chunk execution model already used by vLLM-Omni.

Architectural Integration
Added a dedicated cosyvoice3_async_chunk.yaml config and implemented the talker -> code2wav async chunk processor, enabling the end-to-end async_chunk execution path for CosyVoice3.
Concurrency & State Management
Made the code2wav runtime batch-safe for multi-request chunk consumption, preserved prompt conditioning across chunked requests, and moved per-request async state reclamation to post-send to eliminate cleanup/save races.
Inference Correctness
Enforced terminal EOF semantics, filtered invalid codec tokens, and clamped chunk audio length around left-context trimming to keep streaming boundaries accurate.
Testing & Validation
Added unit coverage for batched stage inputs, async payload emission, cleanup ordering, terminal behavior, and seq_token_counts-aware slicing. Added an offline E2E test for CosyVoice3 as requested, following the Qwen3-TTS style.

Performance Benchmark

GPU: NVIDIA A800
CUDA Version : 12.8
Driver Version : 570.133.20

Benchmark (`warmup=1, measured=3`):

sync: avg TTFA 3739.97 ms, avg total latency 3740.14 ms
async_chunk: avg TTFA 1184.31 ms, avg total latency 4512.14 ms
async TTFA improves by 68.33%, while end-to-end completion is +772.0 ms slower

Final audio parity check (official prompt):

sync and async final wavs are both 8.08s / 193920 samples @ 24kHz
async emits 4 chunks, max boundary jump 0.072767
sync stage-0 stops normally with finish_reason=stop, stop_reason=6562, num_tokens=203

Test Result

Async Wav:
official_prompt_async.wav
sha256: fde20bc9c768a5c66d77232364f04304eb2c9f62166c74d848f36b133ccc6822

Sync Wav:
official_prompt_sync.wav
sha256: 26aa9237606c64eee114d6db879a2528cefcf98e8c4e2da1a393803d470ba895

sync / async duration: 8.08s
sample count: 193920
async max boundary jump: 0.072767

Validation

tests/model_executor/stage_input_processors/test_cosyvoice3_stage_input_processors.py: 8 passed
tests/model_executor/models/cosyvoice3/test_cosyvoice3_components.py: 13 passed
tests/e2e/offline_inference/test_cosyvoice3.py -vv
cosyvoice3.yaml passed
cosyvoice3_async_chunk.yaml passed

hsliuustc0106

Review

Rating: 9/10 | Verdict: ✅ Approved

Summary

Excellent multi-category PR ([Model][Core]) hardening CosyVoice3 async_chunk runtime. Comprehensive solution addressing state management, inference correctness, and precision trimming with solid benchmark data.

Multi-Category Review Coverage

Primary: [Model] (vllm-omni-contrib)

✅ Stage config added (cosyvoice3_async_chunk.yaml)
✅ Input processor implemented (talker -> code2wav)
✅ Tests for model helpers (+99 lines)
✅ Tests for stage input processors (+191 lines)

Secondary: [Core] (Distributed/Runtime)

✅ State management: cleanup deferred to post-send phase (eliminates race conditions)
✅ Batch-safe code2wav forward pass
✅ Chunk deduplication mechanism
✅ Terminal EOF protocol
✅ Precision trimming optimization
✅ Connector tests (+99 lines)

Highlights

Benchmark validated: TTFA -76.5%, audio fidelity maintained
Root cause analysis: Detailed explanation of race conditions and dedup needs
Test coverage: 389 lines of new tests covering all critical paths
Production-ready: Addresses all follow-up issues from #498

Minor Suggestions (non-blocking)

Connector state lifecycle: Consider adding a state machine diagram in comments to clarify the cleanup timing (pre-send vs post-send).
Deduplication edge case: What happens if two consecutive chunks produce identical audio? Current logic dedups, but should this emit a silent frame to maintain stream continuity?
Config documentation: cosyvoice3_async_chunk.yaml could benefit from comments explaining the left_context_size and token_frame_rate parameters.

Pitfalls Check

Directory	Pitfall	Status
`distributed/omni_connectors/`	Race condition in cleanup	✅ Fixed
`model_executor/models/`	Batch-safe forward	✅ Refactored
`model_executor/stage_input_processors/`	Deduplication	✅ Implemented
`model_executor/stage_configs/`	Config validation	✅ Complete

Recommendation

Ready to merge. Thorough implementation with comprehensive tests and validated benchmarks.

Reviewed by OpenClaw with vllm-omni-skills 🦐

Multi-Category Review: Primary=vllm-omni-contrib, Secondary=distributed/runtime patterns

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56eb5a3ae9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

hsliuustc0106 · 2026-03-06T06:21:02Z

+# Stage config for running CosyVoice3 with async_chunk architecture
+# Stage 0: Talker (text prompt -> speech tokens streamed by chunks)
+# Stage 1: Code2Wav (flow matching -> acoustic features -> waveform)
+async_chunk: true


for async_chunk yaml, is there any difference compared with default yaml except line 4?

yes, besides line 4, it also adds the stage0→stage1 streaming connector setup and switches to the async-chunk processor (talker2code2wav_async_chunk).

@linyueqian PTAL #1722

I'd suggest waiting until #1722 is merged before merging this.

gcanlin · 2026-03-23T02:24:37Z

@amy-why-3459 @R2-Y Could you help take a look?

R2-Y · 2026-03-23T03:02:18Z

+        connector_get_max_wait: 300
+        codec_chunk_frames: 25
+        codec_left_context_frames: 25
+        codec_vocab_size: 6561


For codec related config, maybe better to put in engine args? Because it is model side config.

Seems that Qwen3-Omni is same to put them here.

linyueqian · 2026-03-24T02:48:31Z

can you help add e2e test for cosyvoice 3 as well? you can learn from qwen3-tts's implementation

linyueqian · 2026-03-24T03:30:04Z

output_async_chunk.wav
i think the quality can be better

linyueqian

Review: Tested locally on A100 (eager mode, float32)

Generated a WAV with the async_chunk config — pipeline runs end-to-end, but there are two bugs and an audio quality concern.

Bug 1: TTS-only models crash with `"This model does not support generation"`

File: vllm_omni/engine/async_omni_engine.py ~L544

CosyVoice3 has no comprehension stage, so "generate" is never added to supported_tasks. When SamplingParams is passed (which all TTS callers do), the vLLM input processor rejects it:

ValueError: This model does not support generation

Fix: Add "generate" alongside "speech" when final_output_type == "audio":

if any(metadata.get("final_output_type") == "audio" for metadata in stage_metadata):
    supported_tasks.add("speech")
    supported_tasks.add("generate")  # TTS models use SamplingParams for AR decoding

Bug 2: Unbounded `token_frames` memory growth

File: vllm_omni/model_executor/stage_input_processors/cosyvoice3.py ~L149-201

token_frames = transfer_manager.code_prompt_token_ids[request_id] grows without bound for the entire request lifetime. After a chunk is emitted, frames older than left_context_size_cfg will never appear as context again. For long utterances this leaks memory proportional to total tokens generated.

Fix: Prune after each emission:

# After building and returning the payload (before the return):
if left_context_size_cfg < len(token_frames):
    del token_frames[: len(token_frames) - left_context_size_cfg]
    state["emitted_token_len"] = left_context_size_cfg
else:
    state["emitted_token_len"] = length

Audio quality: audible noise at chunk boundaries

Tested with the async_chunk config and listened to the output WAV. There is audible noise/artifacts at the transitions between chunks. This is likely caused by the left-context trimming in cosyvoice3.py:forward() — the samples_per_token-based crop produces a hard cut at the chunk boundary without any crossfade or overlap-add smoothing.

The sync config (cosyvoice3.yaml) already defines streaming overlap parameters (token_overlap_len=20, mel_overlap_len, mel_window = np.hamming(...), speech_window), but the async_chunk code2wav path in CosyVoice3Model.forward() does a raw audio[crop:] slice at L536-538 instead of applying the Hamming window crossfade that the reference CosyVoice3 streaming implementation uses.

Suggested approach: apply the mel_window / speech_window overlap-add at chunk boundaries in the code2wav forward path, similar to how the reference CosyVoice3Code2Wav streaming parameters are designed to be used. Without this, chunk seams will always be audible.

Minor nits

cosyvoice3.py:156: if length <= 0 — length is len(list), can never be negative. if not token_frames: is clearer.
cosyvoice3.py:139: if bool(state.get("terminal_sent", False)) — the bool() wrapper is redundant.
The serving_speech.py diff removes Voxtral TTS, Fish Speech, upload_voice/delete_voice — these are unrelated to the async_chunk feature and should be in a separate PR.

All 28 unit tests pass with the two fixes above applied.

indevn · 2026-03-24T03:39:19Z

output_async_chunk.wav i think the quality can be better

Thanks for your test. I didn't experience any pops and clicks in my previous tests, as the lookahead and transition processing I implemented had already eliminated the audio glitches present in earlier versions.
I'll run a re-test later and check for any discrepancies in my local version.

linyueqian · 2026-03-24T03:40:53Z

Test setup details

Model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
GPU: NVIDIA A100-SXM4-80GB (single GPU)
Mode: enforce_eager: true for both stages (skipped CUDA graph capture for faster iteration)
Branch: PR rebased onto current main (5aef6b9)

Prompt audio: Synthetic 3-second 24kHz sine tone (200Hz, exponential decay) — not real speech, so the speaker embedding quality is poor. This likely contributes to the artifacts.

Text input:

TEXT = "Hello, this is a test of the CosyVoice three async chunk streaming pipeline."
PROMPT_TEXT = "You are a helpful assistant.<|endofprompt|>Testing my voices. Why should I not?"

Sampling params (Stage 0 / Talker):

SamplingParams(
    temperature=1.0, top_p=0.8, top_k=25,
    repetition_penalty=2.0,
    min_tokens=min_len,  # text_token_len * 2
    max_tokens=max_len,  # text_token_len * 20
    stop_token_ids=[6562],
    detokenize=False,
)

Stage config (modified from cosyvoice3_async_chunk.yaml):

Stage 0: gpu_memory_utilization: 0.3, enforce_eager: true
Stage 1: gpu_memory_utilization: 0.15, enforce_eager: true
Connector: codec_chunk_frames: 25, codec_left_context_frames: 25, codec_vocab_size: 6561

Output: 7.84s audio, 188160 samples @ 24kHz, max amplitude 0.99, RMS 0.14

Note: the synthetic prompt audio (sine tone, not real speech) likely degraded the speaker embedding quality. A retest with a real speech prompt would give a fairer comparison. The chunk boundary artifacts I heard might also be less pronounced with a natural prompt. I will test again using actual speech.

linyueqian · 2026-03-24T03:53:01Z

Retest with official CosyVoice3 prompt audio

Retested using the official zero_shot_prompt.wav from the CosyVoice repo (real Chinese female speech, 3.48s @ 24kHz).

Prompt text: "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。" (official transcript)
Synthesis text: "CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities." (from README)

Output: 7.56s, 181440 samples @ 24kHz, RMS=0.091

Chunk boundary analysis (1-second windows):

Chunk 0: RMS=0.0560
Chunk 1: RMS=0.0830, boundary_jump=0.0624
Chunk 2: RMS=0.0732, boundary_jump=0.0256
Chunk 3: RMS=0.0982, boundary_jump=0.0015
Chunk 4: RMS=0.1275, boundary_jump=0.0433
Chunk 5: RMS=0.1240, boundary_jump=0.3959  ← large discontinuity
Chunk 6: RMS=0.0720, boundary_jump=0.0044

Audio quality is noticeably better with real speech prompt vs the synthetic sine tone used previously. The speaker similarity is reasonable. However, there is still one large amplitude discontinuity at chunk 5 boundary (jump=0.396), which is audible as a click/pop. The other boundaries are smoother.

Same config as before: enforce_eager: true, GPU A100-80GB, codec_chunk_frames: 25, codec_left_context_frames: 25.

output_async_chunk.wav

i can still hear a bit glitches in the middle.

indevn · 2026-03-25T14:57:06Z

Retest with official CosyVoice3 prompt audio

Retested using the official zero_shot_prompt.wav from the CosyVoice repo (real Chinese female speech, 3.48s @ 24kHz).使用 CosyVoice 官方仓库中的 zero_shot_prompt.wav 重新测试（真实中文女声，3.48 秒 @ 24kHz）。

Prompt text: "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。" (official transcript)提示文本： "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。" （官方转录文本） Synthesis text: "CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities." (from README)合成文本： "CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities." （来自 README）

Output: 7.56s, 181440 samples @ 24kHz, RMS=0.091输出：7.56 秒，181440 个采样点 @ 24kHz，RMS=0.091

Chunk boundary analysis (1-second windows):分块边界分析（1 秒窗口）：
Chunk 0: RMS=0.0560
Chunk 1: RMS=0.0830, boundary_jump=0.0624
Chunk 2: RMS=0.0732, boundary_jump=0.0256
Chunk 3: RMS=0.0982, boundary_jump=0.0015
Chunk 4: RMS=0.1275, boundary_jump=0.0433
Chunk 5: RMS=0.1240, boundary_jump=0.3959  ← large discontinuity
Chunk 6: RMS=0.0720, boundary_jump=0.0044
Audio quality is noticeably better with real speech prompt vs the synthetic sine tone used previously. The speaker similarity is reasonable. However, there is still one large amplitude discontinuity at chunk 5 boundary (jump=0.396), which is audible as a click/pop. The other boundaries are smoother.与之前使用的合成正弦音相比，使用真实语音提示时音频质量明显改善。说话人相似度尚可。然而，在第 5 个分块边界处仍存在一个较大幅度的断续（跳变=0.396），可听为咔哒/爆裂声。其他边界较为平滑。

Same config as before: enforce_eager: true, GPU A100-80GB, codec_chunk_frames: 25, codec_left_context_frames: 25.配置与之前相同： enforce_eager: true ，GPU A100-80GB， codec_chunk_frames: 25 ， codec_left_context_frames: 25 。

output_async_chunk.wav

i can still hear a bit glitches in the middle.我可以听到中间仍有一些轻微的卡顿。

Thank you for testing. I have confirmed that there is indeed an audio popping issue under this setting. I will make alignment fixes based on the upstream CosyVoice3 repository soon. Thanks again!

hsliuustc0106 · 2026-03-28T14:15:08Z

@@ -89,6 +91,53 @@ def _make_buffer(self, *size, dtype, numpy=True):
        with maybe_disable_pin_memory_for_ray(self, total_bytes):
            return super()._make_buffer(*size, dtype=dtype, numpy=numpy)

+    def _build_model_sampler_output_token_ids(self) -> list[list[int]]:


do all tts model need this function or just cosyvoice3? cc @linyueqian

Currently this is only needed for CosyVoice3.
The key difference is that CosyVoice3 opts into prefer_model_sampler, and its custom RAS-style sampler explicitly depends on the decoded token history (output_token_ids) when making the next sampling decision.
Other TTS models such as Qwen3-TTS still use token history indirectly through the default vLLM sampler / model state, but they do not currently consume output_token_ids inside a custom model-level sampler, so this helper is not required for them today.
I kept it in GPUARModelRunner because the issue is in the generic prefer_model_sampler path rather than in a CosyVoice3-only module, and this also makes the contract correct for any future history-dependent model sampler.

cc @tzhouam

indevn · 2026-03-28T15:06:12Z

Following up on the popping / overlap issue previously reported on this PR, I pushed a cleaned-up follow-up patch set and re-ran the CosyVoice3 quality validation on the validated clean branch before cherry-picking it onto the current PR head.

What changed in this follow-up:

Reworked the CosyVoice3 async code2wav path to align more closely with the upstream cumulative-mel streaming flow.
Switched async chunk decoding to use explicit token_offset-based emitted-suffix semantics, instead of relying on extra waveform-domain overlap stitching.
Preserved per-request streaming vocoder state so chunk boundaries do not duplicate or truncate audio content.
Added an offline E2E test for CosyVoice3 as requested, following the Qwen3-TTS style.

sync / async wavs for listening comparison:
official_prompt_sync.wav
official_prompt_async.wav

I've also Fixed the CosyVoice3 talker sampler path so the model sampler receives the actual decoded-token history, which is required for correct RAS-style sampling behavior.
As a side note, I also included a commit in this PR to align the CosyVoice3 talker sampler semantics with upstream.
The upstream sampler is explicitly history-sensitive; it relies on reading the recently generated token list to enforce RAS (Repetition Aware Sampling). I've implemented similar behavior on the vllm-omni side.
A couple of quick notes for context:

Scope/Abstraction: This is not a global feature intended for all TTS models. It is essentially a CosyVoice3 quirk, as it's currently the only model treating the decoded-token history as a first-class input for the sampler. Therefore, I intentionally avoided introducing any heavy global abstractions.
Issue Attribution: The core purpose of this patch is to align the upstream RAS semantics. It is not the primary fix for the popping/overlap audio boundary issues mentioned earlier (those were addressed by the continuity-related commits).
Implementation-wise, since the root cause lies in the state-passing timing along the model-sampler path, I scoped the fix within GPUARModelRunner. This keeps the workaround localized and avoids polluting the higher-level generic scheduling logic.

cc @hsliuustc0106 @linyueqian

linyueqian · 2026-03-28T16:33:08Z

can you update some benchmark stats? the samples seems to be have similar quality compared to sync one. nice work!

indevn · 2026-03-29T12:09:23Z

can you update some benchmark stats? the samples seems to be have similar quality compared to sync one. nice work!

ok, i've already updated in pr description

linyueqian · 2026-04-05T00:40:42Z

resolve conflicts please

linyueqian · 2026-04-05T00:41:46Z

Rebased onto current main (includes #2486 vLLM 0.19.0 compat fixes) and pushed. Also fixed the async_chunk yaml config (model_stage renames, default_sampling_params) and enabled the streaming e2e test.

Offline Benchmark (H100, enforce_eager=true)

Metric	Sync	Async Chunk
Avg Latency	5767ms	6345ms
RTF	0.78	0.86
Audio Duration	7.36s	7.36s

Audio quality is good for both, no audible chunk boundary artifacts.

Online Streaming TTFA (H100, vLLM 0.19.0, enforce_eager=true)

	Sync Stream	Async Chunk Stream
TTFA (Run 1)	4316ms	2787ms
TTFA (Run 2)	8704ms	2789ms
Total (Run 1)	4318ms	9288ms

Async chunk TTFA is consistently ~2.8s vs sync 4.3-8.7s (~35-68% improvement). Total end-to-end latency is higher due to connector overhead, but TTFA is the metric that matters for streaming UX.

Both RTFs are < 1 for the offline path. The streaming path overhead is expected since chunks are processed incrementally.

…av hardening - Add CosyVoice3 async_chunk stage config and connector carry-over for per-request metadata. - Make stage input processing and code2wav runtime batch-safe, with token safety and device alignment. - Add basic unit coverage for batched stage inputs and async payload emission. Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

…iagnostics - Call chunk_transfer_adapter.cleanup on AR finished requests to reclaim per-request async state. - Add warning_once when left-context trim is requested but samples_per_token is unavailable. - Add helper tests for _split_request_ids/_sanitize_codec_tokens and left-context warning path. - Add scheduler regression test for AR finished-request cleanup. - Scope: hardening + diagnostics only; no default strategy/policy behavior change. Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

…hunk audio span Signed-off-by: indevn <indevn@outlook.com>

…de2wav slicing Signed-off-by: indevn <indevn@outlook.com>

Reuse request_output when falling back to per-completion multimodal outputs. Behavior is unchanged. Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

Signed-off-by: indevn <indevn@outlook.com>

Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

Signed-off-by: indevn <indevn@outlook.com>

- Rename model_stage in yaml: talker -> cosyvoice3_talker, code2wav -> cosyvoice3_code2wav - Fix model_stage checks in RAS sampler/sample() to match renamed stages - Add default_sampling_params: top_k=25, top_p=0.8 (matching upstream defaults), repetition_penalty=1.0001 (near-identity, forces output_token_ids tracking for RAS) - Fix unit test model_stage references - Enable streaming e2e test with async_chunk config (core_model level) Signed-off-by: linyueqian <linyueqian@outlook.com>

The CosyVoice3 model's decoder outputs 6761 logits (speech_token_size + 200). The official inference code treats ALL 200 tokens >= 6561 as stop signals, but the vLLM implementation was masking 199 of them and only restoring token 6562. This funnelled stop probability through a single token, causing bimodal behaviour: either immediate EOS (silence) or no EOS at all (excessively long audio). Fix with three changes: - compute_logits: merge all 200 stop logits into EOS via logsumexp, preserving the correct aggregate stop probability - gpu_ar_model_runner: apply logit bias (min_tokens enforcement) before the custom model sampler — prefer_model_sampler was bypassing it - serving_speech: compute dynamic min/max tokens for CosyVoice3 based on text length, matching the official min_token_text_ratio=2 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

CLI --enforce-eager does not propagate to per-stage engine args. Set enforce_eager directly in the YAML for both stages. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

indevn · 2026-04-06T12:59:03Z

resolve conflicts please

thanks for ur work!

…llm-project#1703) Signed-off-by: linyueqian <linyueqian@outlook.com> Signed-off-by: indevn <indevn@outlook.com> Co-authored-by: linyueqian <linyueqian@outlook.com>

indevn requested a review from hsliuustc0106 as a code owner March 6, 2026 05:46

hsliuustc0106 approved these changes Mar 6, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 6, 2026

View reviewed changes

Comment thread vllm_omni/model_executor/models/cosyvoice3/cosyvoice3.py Outdated

hsliuustc0106 reviewed Mar 6, 2026

View reviewed changes

linyueqian mentioned this pull request Mar 10, 2026

[RFC]: TTS Development Roadmap - March 2026 #1795

Open

indevn changed the title ~~[Model][Core] CosyVoice3 async_chunk runtime hardening~~ [Model][Core] Enable async_chunk streaming pipeline for CosyVoice3 Mar 11, 2026

Gaohan123 added this to the v0.18.0 milestone Mar 13, 2026

gcanlin requested a review from linyueqian March 23, 2026 02:23

R2-Y reviewed Mar 23, 2026

View reviewed changes

amy-why-3459 reviewed Mar 23, 2026

View reviewed changes

Comment thread vllm_omni/entrypoints/openai/serving_speech.py

indevn force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from b95b1ba to fa2913e Compare March 23, 2026 10:58

linyueqian reviewed Mar 24, 2026

View reviewed changes

linyueqian mentioned this pull request Mar 24, 2026

[Fix] Qwen3 TTS audio handling for long ref_audio #2104

Merged

5 tasks

divyanshsinghvi mentioned this pull request Mar 24, 2026

[Bug]: cannot run Cosyvoice3 offline with ValueError: This model does not support generation #2043

Closed

1 task

indevn force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from 26f8615 to ce3eb12 Compare March 28, 2026 13:23

hsliuustc0106 reviewed Mar 28, 2026

View reviewed changes

indevn added 2 commits April 4, 2026 20:49

indevn added 9 commits April 4, 2026 20:50

[CosyVoice3][async_chunk] Fix duplicate terminal emission and clamp c…

5e4fe0f

…hunk audio span Signed-off-by: indevn <indevn@outlook.com>

[CosyVoice3][async_chunk] Honor single-request seq_token_counts in co…

9c39558

…de2wav slicing Signed-off-by: indevn <indevn@outlook.com>

refactor: clarify multimodal output fallback chain

3cb6849

Reuse request_output when falling back to per-completion multimodal outputs. Behavior is unchanged. Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

fix: align cosyvoice3 async streaming and sampler behavior

2e3b6d1

Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

test: add cosyvoice3 unit coverage for streaming helpers

9caa81b

Signed-off-by: indevn <indevn@outlook.com>

test: add cosyvoice3 offline e2e coverage

e6c7b55

Signed-off-by: indevn <indevn@outlook.com>

fix: clarify chunk transfer external request id handling

836b116

Signed-off-by: indevn <indevn@outlook.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

fix: align cosyvoice3 async_chunk e2e with current runtime

0d7fe6c

Signed-off-by: indevn <indevn@outlook.com>

test: rename cosyvoice3 e2e around reference zero-shot semantics

931da94

Signed-off-by: indevn <indevn@outlook.com>

linyueqian force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from e544011 to c83b847 Compare April 5, 2026 01:20

linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026

linyueqian force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch 4 times, most recently from d236115 to dd84a73 Compare April 5, 2026 01:55

linyueqian force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from dd84a73 to e58e2ba Compare April 5, 2026 02:27

linyueqian added 2 commits April 5, 2026 00:21

test: disable CUDA graphs in async_chunk e2e test for faster CI

f5f9f29

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

linyueqian enabled auto-merge (squash) April 5, 2026 04:53

linyueqian added 2 commits April 5, 2026 01:14

fix: update talker mock model_stage and config in unit tests

e6f379f

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>

linyueqian merged commit 6fc38e0 into vllm-project:main Apr 5, 2026
8 checks passed

Cccei000 mentioned this pull request Apr 16, 2026

[Feature]: Need online serving stream example for cosyvoice3 #2841

Open

1 task

gcanlin mentioned this pull request May 11, 2026

[BugFix][NPU] Honor prefer_model_sampler in NPU AR runner #3517

Open

Conversation

indevn commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Performance Benchmark

Benchmark (warmup=1, measured=3):

Final audio parity check (official prompt):

Test Result

Validation

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Review

Summary

Multi-Category Review Coverage

Highlights

Minor Suggestions (non-blocking)

Pitfalls Check

Recommendation

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gcanlin commented Mar 23, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

linyueqian commented Mar 24, 2026

Uh oh!

linyueqian commented Mar 24, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Review: Tested locally on A100 (eager mode, float32)

Bug 1: TTS-only models crash with "This model does not support generation"

Bug 2: Unbounded token_frames memory growth

Audio quality: audible noise at chunk boundaries

Minor nits

Uh oh!

indevn commented Mar 24, 2026

Uh oh!

linyueqian commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test setup details

Uh oh!

linyueqian commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Retest with official CosyVoice3 prompt audio

Uh oh!

indevn commented Mar 25, 2026

Retest with official CosyVoice3 prompt audio

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

indevn commented Mar 28, 2026

Uh oh!

linyueqian commented Mar 28, 2026

Uh oh!

indevn commented Mar 29, 2026

Uh oh!

indevn commented Mar 6, 2026 •

edited

Loading

Benchmark (`warmup=1, measured=3`):

Bug 1: TTS-only models crash with `"This model does not support generation"`

Bug 2: Unbounded `token_frames` memory growth

linyueqian commented Mar 24, 2026 •

edited

Loading

linyueqian commented Mar 24, 2026 •

edited

Loading

linyueqian commented Apr 5, 2026 •

edited

Loading