Skip to content

[Model][Core] Enable async_chunk streaming pipeline for CosyVoice3#1703

Merged
linyueqian merged 17 commits into
vllm-project:mainfrom
indevn:feat/cosyvoice3-async-batch-benchmark-signedoff
Apr 5, 2026
Merged

[Model][Core] Enable async_chunk streaming pipeline for CosyVoice3#1703
linyueqian merged 17 commits into
vllm-project:mainfrom
indevn:feat/cosyvoice3-async-batch-benchmark-signedoff

Conversation

@indevn
Copy link
Copy Markdown
Contributor

@indevn indevn commented Mar 6, 2026

Purpose

This PR introduces and productionizes the async_chunk streaming pipeline for CosyVoice3, fulfilling the architectural roadmap laid out in #498.

The core change is to connect the talker -> code2wav path so that code2wav can start consuming codec chunks before the talker stage fully finishes, instead of waiting for the full utterance. This is the main integration step that brings CosyVoice3 onto the async_chunk execution model already used by vLLM-Omni.

  1. Architectural Integration
    Added a dedicated cosyvoice3_async_chunk.yaml config and implemented the talker -> code2wav async chunk processor, enabling the end-to-end async_chunk execution path for CosyVoice3.

  2. Concurrency & State Management
    Made the code2wav runtime batch-safe for multi-request chunk consumption, preserved prompt conditioning across chunked requests, and moved per-request async state reclamation to post-send to eliminate cleanup/save races.

  3. Inference Correctness
    Enforced terminal EOF semantics, filtered invalid codec tokens, and clamped chunk audio length around left-context trimming to keep streaming boundaries accurate.

  4. Testing & Validation
    Added unit coverage for batched stage inputs, async payload emission, cleanup ordering, terminal behavior, and seq_token_counts-aware slicing. Added an offline E2E test for CosyVoice3 as requested, following the Qwen3-TTS style.

Performance Benchmark

GPU: NVIDIA A800
CUDA Version : 12.8
Driver Version : 570.133.20

Benchmark (warmup=1, measured=3):

  • sync: avg TTFA 3739.97 ms, avg total latency 3740.14 ms
  • async_chunk: avg TTFA 1184.31 ms, avg total latency 4512.14 ms
  • async TTFA improves by 68.33%, while end-to-end completion is +772.0 ms slower

Final audio parity check (official prompt):

  • sync and async final wavs are both 8.08s / 193920 samples @ 24kHz
  • async emits 4 chunks, max boundary jump 0.072767
  • sync stage-0 stops normally with finish_reason=stop, stop_reason=6562, num_tokens=203

Test Result

Async Wav:
official_prompt_async.wav
sha256: fde20bc9c768a5c66d77232364f04304eb2c9f62166c74d848f36b133ccc6822

Sync Wav:
official_prompt_sync.wav
sha256: 26aa9237606c64eee114d6db879a2528cefcf98e8c4e2da1a393803d470ba895

  • sync / async duration: 8.08s
  • sample count: 193920
  • async max boundary jump: 0.072767

Validation

  • tests/model_executor/stage_input_processors/test_cosyvoice3_stage_input_processors.py: 8 passed
  • tests/model_executor/models/cosyvoice3/test_cosyvoice3_components.py: 13 passed
  • tests/e2e/offline_inference/test_cosyvoice3.py -vv
    cosyvoice3.yaml passed
    cosyvoice3_async_chunk.yaml passed

@indevn indevn requested a review from hsliuustc0106 as a code owner March 6, 2026 05:46
Copy link
Copy Markdown
Collaborator

@hsliuustc0106 hsliuustc0106 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review

Rating: 9/10 | Verdict: ✅ Approved

Summary

Excellent multi-category PR ([Model][Core]) hardening CosyVoice3 async_chunk runtime. Comprehensive solution addressing state management, inference correctness, and precision trimming with solid benchmark data.

Multi-Category Review Coverage

Primary: [Model] (vllm-omni-contrib)

  • ✅ Stage config added (cosyvoice3_async_chunk.yaml)
  • ✅ Input processor implemented (talker -> code2wav)
  • ✅ Tests for model helpers (+99 lines)
  • ✅ Tests for stage input processors (+191 lines)

Secondary: [Core] (Distributed/Runtime)

  • ✅ State management: cleanup deferred to post-send phase (eliminates race conditions)
  • ✅ Batch-safe code2wav forward pass
  • ✅ Chunk deduplication mechanism
  • ✅ Terminal EOF protocol
  • ✅ Precision trimming optimization
  • ✅ Connector tests (+99 lines)

Highlights

  • Benchmark validated: TTFA -76.5%, audio fidelity maintained
  • Root cause analysis: Detailed explanation of race conditions and dedup needs
  • Test coverage: 389 lines of new tests covering all critical paths
  • Production-ready: Addresses all follow-up issues from #498

Minor Suggestions (non-blocking)

  1. Connector state lifecycle: Consider adding a state machine diagram in comments to clarify the cleanup timing (pre-send vs post-send).

  2. Deduplication edge case: What happens if two consecutive chunks produce identical audio? Current logic dedups, but should this emit a silent frame to maintain stream continuity?

  3. Config documentation: cosyvoice3_async_chunk.yaml could benefit from comments explaining the left_context_size and token_frame_rate parameters.

Pitfalls Check

Directory Pitfall Status
distributed/omni_connectors/ Race condition in cleanup ✅ Fixed
model_executor/models/ Batch-safe forward ✅ Refactored
model_executor/stage_input_processors/ Deduplication ✅ Implemented
model_executor/stage_configs/ Config validation ✅ Complete

Recommendation

Ready to merge. Thorough implementation with comprehensive tests and validated benchmarks.


Reviewed by OpenClaw with vllm-omni-skills 🦐

Multi-Category Review: Primary=vllm-omni-contrib, Secondary=distributed/runtime patterns

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 56eb5a3ae9

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread vllm_omni/model_executor/models/cosyvoice3/cosyvoice3.py Outdated
# Stage config for running CosyVoice3 with async_chunk architecture
# Stage 0: Talker (text prompt -> speech tokens streamed by chunks)
# Stage 1: Code2Wav (flow matching -> acoustic features -> waveform)
async_chunk: true
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for async_chunk yaml, is there any difference compared with default yaml except line 4?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, besides line 4, it also adds the stage0→stage1 streaming connector setup and switches to the async-chunk processor (talker2code2wav_async_chunk).

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest waiting until #1722 is merged before merging this.

@indevn indevn changed the title [Model][Core] CosyVoice3 async_chunk runtime hardening [Model][Core] Enable async_chunk streaming pipeline for CosyVoice3 Mar 11, 2026
@Gaohan123 Gaohan123 added this to the v0.18.0 milestone Mar 13, 2026
@gcanlin gcanlin requested a review from linyueqian March 23, 2026 02:23
@gcanlin
Copy link
Copy Markdown
Collaborator

gcanlin commented Mar 23, 2026

@amy-why-3459 @R2-Y Could you help take a look?

connector_get_max_wait: 300
codec_chunk_frames: 25
codec_left_context_frames: 25
codec_vocab_size: 6561
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For codec related config, maybe better to put in engine args? Because it is model side config.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems that Qwen3-Omni is same to put them here.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay

Comment thread vllm_omni/entrypoints/openai/serving_speech.py
@indevn indevn force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from b95b1ba to fa2913e Compare March 23, 2026 10:58
@linyueqian
Copy link
Copy Markdown
Collaborator

can you help add e2e test for cosyvoice 3 as well? you can learn from qwen3-tts's implementation

@linyueqian
Copy link
Copy Markdown
Collaborator

output_async_chunk.wav
i think the quality can be better

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Tested locally on A100 (eager mode, float32)

Generated a WAV with the async_chunk config — pipeline runs end-to-end, but there are two bugs and an audio quality concern.


Bug 1: TTS-only models crash with "This model does not support generation"

File: vllm_omni/engine/async_omni_engine.py ~L544

CosyVoice3 has no comprehension stage, so "generate" is never added to supported_tasks. When SamplingParams is passed (which all TTS callers do), the vLLM input processor rejects it:

ValueError: This model does not support generation

Fix: Add "generate" alongside "speech" when final_output_type == "audio":

if any(metadata.get("final_output_type") == "audio" for metadata in stage_metadata):
    supported_tasks.add("speech")
    supported_tasks.add("generate")  # TTS models use SamplingParams for AR decoding

Bug 2: Unbounded token_frames memory growth

File: vllm_omni/model_executor/stage_input_processors/cosyvoice3.py ~L149-201

token_frames = transfer_manager.code_prompt_token_ids[request_id] grows without bound for the entire request lifetime. After a chunk is emitted, frames older than left_context_size_cfg will never appear as context again. For long utterances this leaks memory proportional to total tokens generated.

Fix: Prune after each emission:

# After building and returning the payload (before the return):
if left_context_size_cfg < len(token_frames):
    del token_frames[: len(token_frames) - left_context_size_cfg]
    state["emitted_token_len"] = left_context_size_cfg
else:
    state["emitted_token_len"] = length

Audio quality: audible noise at chunk boundaries

Tested with the async_chunk config and listened to the output WAV. There is audible noise/artifacts at the transitions between chunks. This is likely caused by the left-context trimming in cosyvoice3.py:forward() — the samples_per_token-based crop produces a hard cut at the chunk boundary without any crossfade or overlap-add smoothing.

The sync config (cosyvoice3.yaml) already defines streaming overlap parameters (token_overlap_len=20, mel_overlap_len, mel_window = np.hamming(...), speech_window), but the async_chunk code2wav path in CosyVoice3Model.forward() does a raw audio[crop:] slice at L536-538 instead of applying the Hamming window crossfade that the reference CosyVoice3 streaming implementation uses.

Suggested approach: apply the mel_window / speech_window overlap-add at chunk boundaries in the code2wav forward path, similar to how the reference CosyVoice3Code2Wav streaming parameters are designed to be used. Without this, chunk seams will always be audible.


Minor nits

  • cosyvoice3.py:156: if length <= 0length is len(list), can never be negative. if not token_frames: is clearer.
  • cosyvoice3.py:139: if bool(state.get("terminal_sent", False)) — the bool() wrapper is redundant.
  • The serving_speech.py diff removes Voxtral TTS, Fish Speech, upload_voice/delete_voice — these are unrelated to the async_chunk feature and should be in a separate PR.

All 28 unit tests pass with the two fixes above applied.

@indevn
Copy link
Copy Markdown
Contributor Author

indevn commented Mar 24, 2026

output_async_chunk.wav i think the quality can be better

Thanks for your test. I didn't experience any pops and clicks in my previous tests, as the lookahead and transition processing I implemented had already eliminated the audio glitches present in earlier versions.
I'll run a re-test later and check for any discrepancies in my local version.

@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Mar 24, 2026

Test setup details

Model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
GPU: NVIDIA A100-SXM4-80GB (single GPU)
Mode: enforce_eager: true for both stages (skipped CUDA graph capture for faster iteration)
Branch: PR rebased onto current main (5aef6b9)

Prompt audio: Synthetic 3-second 24kHz sine tone (200Hz, exponential decay) — not real speech, so the speaker embedding quality is poor. This likely contributes to the artifacts.

Text input:

TEXT = "Hello, this is a test of the CosyVoice three async chunk streaming pipeline."
PROMPT_TEXT = "You are a helpful assistant.<|endofprompt|>Testing my voices. Why should I not?"

Sampling params (Stage 0 / Talker):

SamplingParams(
    temperature=1.0, top_p=0.8, top_k=25,
    repetition_penalty=2.0,
    min_tokens=min_len,  # text_token_len * 2
    max_tokens=max_len,  # text_token_len * 20
    stop_token_ids=[6562],
    detokenize=False,
)

Stage config (modified from cosyvoice3_async_chunk.yaml):

  • Stage 0: gpu_memory_utilization: 0.3, enforce_eager: true
  • Stage 1: gpu_memory_utilization: 0.15, enforce_eager: true
  • Connector: codec_chunk_frames: 25, codec_left_context_frames: 25, codec_vocab_size: 6561

Output: 7.84s audio, 188160 samples @ 24kHz, max amplitude 0.99, RMS 0.14

Note: the synthetic prompt audio (sine tone, not real speech) likely degraded the speaker embedding quality. A retest with a real speech prompt would give a fairer comparison. The chunk boundary artifacts I heard might also be less pronounced with a natural prompt. I will test again using actual speech.

@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Mar 24, 2026

Retest with official CosyVoice3 prompt audio

Retested using the official zero_shot_prompt.wav from the CosyVoice repo (real Chinese female speech, 3.48s @ 24kHz).

Prompt text: "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。" (official transcript)
Synthesis text: "CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities." (from README)

Output: 7.56s, 181440 samples @ 24kHz, RMS=0.091

Chunk boundary analysis (1-second windows):

Chunk 0: RMS=0.0560
Chunk 1: RMS=0.0830, boundary_jump=0.0624
Chunk 2: RMS=0.0732, boundary_jump=0.0256
Chunk 3: RMS=0.0982, boundary_jump=0.0015
Chunk 4: RMS=0.1275, boundary_jump=0.0433
Chunk 5: RMS=0.1240, boundary_jump=0.3959  ← large discontinuity
Chunk 6: RMS=0.0720, boundary_jump=0.0044

Audio quality is noticeably better with real speech prompt vs the synthetic sine tone used previously. The speaker similarity is reasonable. However, there is still one large amplitude discontinuity at chunk 5 boundary (jump=0.396), which is audible as a click/pop. The other boundaries are smoother.

Same config as before: enforce_eager: true, GPU A100-80GB, codec_chunk_frames: 25, codec_left_context_frames: 25.

output_async_chunk.wav

i can still hear a bit glitches in the middle.

@indevn
Copy link
Copy Markdown
Contributor Author

indevn commented Mar 25, 2026

Retest with official CosyVoice3 prompt audio

Retested using the official zero_shot_prompt.wav from the CosyVoice repo (real Chinese female speech, 3.48s @ 24kHz).使用 CosyVoice 官方仓库中的 zero_shot_prompt.wav 重新测试(真实中文女声,3.48 秒 @ 24kHz)。

Prompt text: "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。" (official transcript)提示文本: "You are a helpful assistant.<|endofprompt|>希望你以后能够做的比我还好呦。" (官方转录文本) Synthesis text: "CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities." (from README)合成文本: "CosyVoice is undergoing a comprehensive upgrade, providing more accurate, stable, faster, and better voice generation capabilities." (来自 README)

Output: 7.56s, 181440 samples @ 24kHz, RMS=0.091输出:7.56 秒,181440 个采样点 @ 24kHz,RMS=0.091

Chunk boundary analysis (1-second windows):分块边界分析(1 秒窗口):

Chunk 0: RMS=0.0560
Chunk 1: RMS=0.0830, boundary_jump=0.0624
Chunk 2: RMS=0.0732, boundary_jump=0.0256
Chunk 3: RMS=0.0982, boundary_jump=0.0015
Chunk 4: RMS=0.1275, boundary_jump=0.0433
Chunk 5: RMS=0.1240, boundary_jump=0.3959  ← large discontinuity
Chunk 6: RMS=0.0720, boundary_jump=0.0044

Audio quality is noticeably better with real speech prompt vs the synthetic sine tone used previously. The speaker similarity is reasonable. However, there is still one large amplitude discontinuity at chunk 5 boundary (jump=0.396), which is audible as a click/pop. The other boundaries are smoother.与之前使用的合成正弦音相比,使用真实语音提示时音频质量明显改善。说话人相似度尚可。然而,在第 5 个分块边界处仍存在一个较大幅度的断续(跳变=0.396),可听为咔哒/爆裂声。其他边界较为平滑。

Same config as before: enforce_eager: true, GPU A100-80GB, codec_chunk_frames: 25, codec_left_context_frames: 25.配置与之前相同: enforce_eager: true ,GPU A100-80GB, codec_chunk_frames: 25codec_left_context_frames: 25

output_async_chunk.wav

i can still hear a bit glitches in the middle.我可以听到中间仍有一些轻微的卡顿。

Thank you for testing. I have confirmed that there is indeed an audio popping issue under this setting. I will make alignment fixes based on the upstream CosyVoice3 repository soon. Thanks again!

@indevn indevn force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from 26f8615 to ce3eb12 Compare March 28, 2026 13:23
@@ -89,6 +91,53 @@ def _make_buffer(self, *size, dtype, numpy=True):
with maybe_disable_pin_memory_for_ray(self, total_bytes):
return super()._make_buffer(*size, dtype=dtype, numpy=numpy)

def _build_model_sampler_output_token_ids(self) -> list[list[int]]:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do all tts model need this function or just cosyvoice3? cc @linyueqian

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently this is only needed for CosyVoice3.
The key difference is that CosyVoice3 opts into prefer_model_sampler, and its custom RAS-style sampler explicitly depends on the decoded token history (output_token_ids) when making the next sampling decision.
Other TTS models such as Qwen3-TTS still use token history indirectly through the default vLLM sampler / model state, but they do not currently consume output_token_ids inside a custom model-level sampler, so this helper is not required for them today.
I kept it in GPUARModelRunner because the issue is in the generic prefer_model_sampler path rather than in a CosyVoice3-only module, and this also makes the contract correct for any future history-dependent model sampler.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@indevn
Copy link
Copy Markdown
Contributor Author

indevn commented Mar 28, 2026

Following up on the popping / overlap issue previously reported on this PR, I pushed a cleaned-up follow-up patch set and re-ran the CosyVoice3 quality validation on the validated clean branch before cherry-picking it onto the current PR head.

What changed in this follow-up:

  • Reworked the CosyVoice3 async code2wav path to align more closely with the upstream cumulative-mel streaming flow.
  • Switched async chunk decoding to use explicit token_offset-based emitted-suffix semantics, instead of relying on extra waveform-domain overlap stitching.
  • Preserved per-request streaming vocoder state so chunk boundaries do not duplicate or truncate audio content.
  • Added an offline E2E test for CosyVoice3 as requested, following the Qwen3-TTS style.

sync / async wavs for listening comparison:
official_prompt_sync.wav
official_prompt_async.wav


I've also Fixed the CosyVoice3 talker sampler path so the model sampler receives the actual decoded-token history, which is required for correct RAS-style sampling behavior.
As a side note, I also included a commit in this PR to align the CosyVoice3 talker sampler semantics with upstream.
The upstream sampler is explicitly history-sensitive; it relies on reading the recently generated token list to enforce RAS (Repetition Aware Sampling). I've implemented similar behavior on the vllm-omni side.
A couple of quick notes for context:

  1. Scope/Abstraction: This is not a global feature intended for all TTS models. It is essentially a CosyVoice3 quirk, as it's currently the only model treating the decoded-token history as a first-class input for the sampler. Therefore, I intentionally avoided introducing any heavy global abstractions.
  2. Issue Attribution: The core purpose of this patch is to align the upstream RAS semantics. It is not the primary fix for the popping/overlap audio boundary issues mentioned earlier (those were addressed by the continuity-related commits).
    Implementation-wise, since the root cause lies in the state-passing timing along the model-sampler path, I scoped the fix within GPUARModelRunner. This keeps the workaround localized and avoids polluting the higher-level generic scheduling logic.

cc @hsliuustc0106 @linyueqian

@linyueqian
Copy link
Copy Markdown
Collaborator

can you update some benchmark stats? the samples seems to be have similar quality compared to sync one. nice work!

@indevn
Copy link
Copy Markdown
Contributor Author

indevn commented Mar 29, 2026

can you update some benchmark stats? the samples seems to be have similar quality compared to sync one. nice work!

ok, i've already updated in pr description

@linyueqian
Copy link
Copy Markdown
Collaborator

resolve conflicts please

@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Apr 5, 2026

Rebased onto current main (includes #2486 vLLM 0.19.0 compat fixes) and pushed. Also fixed the async_chunk yaml config (model_stage renames, default_sampling_params) and enabled the streaming e2e test.

Offline Benchmark (H100, enforce_eager=true)

Metric Sync Async Chunk
Avg Latency 5767ms 6345ms
RTF 0.78 0.86
Audio Duration 7.36s 7.36s

Audio quality is good for both, no audible chunk boundary artifacts.

Online Streaming TTFA (H100, vLLM 0.19.0, enforce_eager=true)

Sync Stream Async Chunk Stream
TTFA (Run 1) 4316ms 2787ms
TTFA (Run 2) 8704ms 2789ms
Total (Run 1) 4318ms 9288ms

Async chunk TTFA is consistently ~2.8s vs sync 4.3-8.7s (~35-68% improvement). Total end-to-end latency is higher due to connector overhead, but TTFA is the metric that matters for streaming UX.

Both RTFs are < 1 for the offline path. The streaming path overhead is expected since chunks are processed incrementally.

indevn added 2 commits April 4, 2026 20:49
…av hardening

- Add CosyVoice3 async_chunk stage config and connector carry-over for per-request metadata.
- Make stage input processing and code2wav runtime batch-safe, with token safety and device alignment.
- Add basic unit coverage for batched stage inputs and async payload emission.

Signed-off-by: indevn <indevn@outlook.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
…iagnostics

- Call chunk_transfer_adapter.cleanup on AR finished requests to reclaim per-request async state.

- Add warning_once when left-context trim is requested but samples_per_token is unavailable.

- Add helper tests for _split_request_ids/_sanitize_codec_tokens and left-context warning path.

- Add scheduler regression test for AR finished-request cleanup.

- Scope: hardening + diagnostics only; no default strategy/policy behavior change.

Signed-off-by: indevn <indevn@outlook.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
indevn added 9 commits April 4, 2026 20:50
…hunk audio span

Signed-off-by: indevn <indevn@outlook.com>
…de2wav slicing

Signed-off-by: indevn <indevn@outlook.com>
Reuse request_output when falling back to per-completion multimodal outputs.
Behavior is unchanged.

Signed-off-by: indevn <indevn@outlook.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
@linyueqian linyueqian force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from e544011 to c83b847 Compare April 5, 2026 01:20
@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 5, 2026
@linyueqian linyueqian force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch 4 times, most recently from d236115 to dd84a73 Compare April 5, 2026 01:55
- Rename model_stage in yaml: talker -> cosyvoice3_talker, code2wav -> cosyvoice3_code2wav
- Fix model_stage checks in RAS sampler/sample() to match renamed stages
- Add default_sampling_params: top_k=25, top_p=0.8 (matching upstream defaults),
  repetition_penalty=1.0001 (near-identity, forces output_token_ids tracking for RAS)
- Fix unit test model_stage references
- Enable streaming e2e test with async_chunk config (core_model level)

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian force-pushed the feat/cosyvoice3-async-batch-benchmark-signedoff branch from dd84a73 to e58e2ba Compare April 5, 2026 02:27
The CosyVoice3 model's decoder outputs 6761 logits (speech_token_size +
200).  The official inference code treats ALL 200 tokens >= 6561 as stop
signals, but the vLLM implementation was masking 199 of them and only
restoring token 6562.  This funnelled stop probability through a single
token, causing bimodal behaviour: either immediate EOS (silence) or no
EOS at all (excessively long audio).

Fix with three changes:
- compute_logits: merge all 200 stop logits into EOS via logsumexp,
  preserving the correct aggregate stop probability
- gpu_ar_model_runner: apply logit bias (min_tokens enforcement) before
  the custom model sampler — prefer_model_sampler was bypassing it
- serving_speech: compute dynamic min/max tokens for CosyVoice3 based
  on text length, matching the official min_token_text_ratio=2

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian enabled auto-merge (squash) April 5, 2026 04:53
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
CLI --enforce-eager does not propagate to per-stage engine args.
Set enforce_eager directly in the YAML for both stages.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
@linyueqian linyueqian merged commit 6fc38e0 into vllm-project:main Apr 5, 2026
8 checks passed
@indevn
Copy link
Copy Markdown
Contributor Author

indevn commented Apr 6, 2026

resolve conflicts please

thanks for ur work!

skf-1999 pushed a commit to Semmer2/vllm-omni that referenced this pull request Apr 7, 2026
…llm-project#1703)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…llm-project#1703)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…llm-project#1703)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…llm-project#1703)

Signed-off-by: linyueqian <linyueqian@outlook.com>
Signed-off-by: indevn <indevn@outlook.com>
Co-authored-by: linyueqian <linyueqian@outlook.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants