[BugFix] qwen3_tts chunk boundary handling logic in initial chunk (IC)#2378
Merged
tzhouam merged 2 commits intovllm-project:mainfrom Apr 1, 2026
Merged
Conversation
da3e643 to
fd4ae7e
Compare
Contributor
Author
Update
|
| case | before | after | reason |
|---|---|---|---|
n=16, finished=False |
None |
(8, 16) |
16<=16 → still IC phase, 16%8==0 → emit |
n=24, finished=False |
(8, 24) |
None |
normal phase, adjusted=8, 8%16!=0 → hold |
-
Added 1 normal-emit verification for
ic=8, cs=16:n=32→(16, 32): first normal emit atinitial_coverage + chunk_size = 16+16 = 32
-
Added 5 new cases for
ic=5, cs=25(IC evenly divides chunk_size with higher multiplicity):- Demonstrates IC filling the entire first chunk: emit at 5, 12→hold, 25→emit, 30→hold, 50→first normal emit
- Emit interval pattern:
5,5,5,5,5,25,25,...— smooth transition with no gap - This is the key scenario that exposes the bug in the old
<logic: with strict<, IC would only emit at 5,10,15,20 (skipping 25), then normal phase wouldn't emit until frame 45, creating a 25-frame gap (longer than the normal chunk itself)
Updated comments clarify the IC boundary rule:
# IC phase: length <= chunk_size (uses <=, consistent with fish_speech)
# IC emits fill the entire first chunk_size worth of frames, so the
# normal phase always starts at a clean chunk boundary.
# initial_coverage = (chunk_size // initial_chunk_size) * initial_chunk_sizeSigned-off-by: Fattysand <fattysand@users.noreply.github.com>
Signed-off-by: Fattysand <fattysand@users.noreply.github.com>
b7204cc to
35c3a94
Compare
Contributor
|
LGTM. Nice catch. |
Collaborator
|
@JuanPZuluaga please also take a look. thank you! |
linyueqian
added a commit
to linyueqian/vllm-omni
that referenced
this pull request
Apr 2, 2026
Two bugs preventing Base (voice-clone) task from producing correct audio: 1. Speech tokenizer encoder ran in bfloat16, causing ~50% of encoded reference-audio codes to diverge from float32 baseline. The corrupted prompt prevents the talker from generating stop token 2150, producing ~318s of audio instead of ~8s. Fix: load encoder in float32. 2. Cherry-pick chunk boundary fix (vllm-project#2378): off-by-one in initial chunk phase boundary check caused the final codec chunk to be malformed (length 1, not divisible by 16 quantizers), resulting in 0-byte output even when stop token was correctly generated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
linyueqian
added a commit
to linyueqian/vllm-omni
that referenced
this pull request
Apr 2, 2026
Two bugs preventing Base (voice-clone) task from producing correct audio: 1. Speech tokenizer encoder ran in bfloat16, causing ~50% of encoded reference-audio codes to diverge from float32 baseline. The corrupted prompt prevents the talker from generating stop token 2150, producing ~318s of audio instead of ~8s. Fix: load encoder in float32. 2. Cherry-pick chunk boundary fix (vllm-project#2378): off-by-one in initial chunk phase boundary check caused the final codec chunk to be malformed (length 1, not divisible by 16 quantizers), resulting in 0-byte output even when stop token was correctly generated. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: linyueqian <linyueqian@outlook.com>
vraiti
pushed a commit
to vraiti/vllm-omni
that referenced
this pull request
Apr 9, 2026
vllm-project#2378) Signed-off-by: Fattysand <fattysand@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Fix the initial chunk (IC) coverage logic in
qwen3_tts.pyto align with the correct behavior already implemented infish_speech.py.Currently,
qwen3_tts.pyuses<andchunk_size - 1which constrains IC coverage to strictly less thanchunk_size, whilefish_speech.pyuses<=and no-1, allowing IC to cover up tochunk_size. This mismatch causesqwen3_tts.pyto miss the last IC chunk (e.g.cs=25, ic=5: IC emits at 5, 10, 15, 20 then jumps to normal phase emitting 21–45, skipping a 1–25 emit).Proposed fix (only two lines changed):
Reference —
fish_speech.py(lines 118 & 131):Test Plan
This is a minimal two-line logic fix aligning
qwen3_tts.pywith the existingfish_speech.pyimplementation. No additional test scripts are needed — the change is self-contained and the edge cases have been manually verified (see below).Test Result
Edge cases verified:
cs=25, ic=8):(24//8)*8 == (25//8)*8 == 24, behavior unchanged.ic == chunk_size: Guarded byinitial_chunk_size < chunk_size, IC skipped entirely — unaffected.finished=Trueduring IC: Handled by existingcontext_lengthlogic.Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)
cc @Sy0307