Skip to content

[BugFix] qwen3_tts chunk boundary handling logic in initial chunk (IC)#2378

Merged
tzhouam merged 2 commits intovllm-project:mainfrom
Fattysand:fix/qwen3-tts-chunk-boundary
Apr 1, 2026
Merged

[BugFix] qwen3_tts chunk boundary handling logic in initial chunk (IC)#2378
tzhouam merged 2 commits intovllm-project:mainfrom
Fattysand:fix/qwen3-tts-chunk-boundary

Conversation

@Fattysand
Copy link
Copy Markdown
Contributor

@Fattysand Fattysand commented Mar 31, 2026

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

Fix the initial chunk (IC) coverage logic in qwen3_tts.py to align with the correct behavior already implemented in fish_speech.py.

Currently, qwen3_tts.py uses < and chunk_size - 1 which constrains IC coverage to strictly less than chunk_size, while fish_speech.py uses <= and no -1, allowing IC to cover up to chunk_size. This mismatch causes qwen3_tts.py to miss the last IC chunk (e.g. cs=25, ic=5: IC emits at 5, 10, 15, 20 then jumps to normal phase emitting 21–45, skipping a 1–25 emit).

Proposed fix (only two lines changed):

# line 215: < → <=
in_initial_phase = initial_chunk_size > 0 and initial_chunk_size < chunk_size and length <= chunk_size

# lines 227-229: remove -1
initial_coverage = (
    (chunk_size // initial_chunk_size) * initial_chunk_size if 0 < initial_chunk_size < chunk_size else 0
)

Reference — fish_speech.py (lines 118 & 131):

in_initial_phase = initial_chunk_size > 0 and length <= chunk_size
initial_coverage = (chunk_size // initial_chunk_size) * initial_chunk_size if initial_chunk_size > 0 else 0

Test Plan

This is a minimal two-line logic fix aligning qwen3_tts.py with the existing fish_speech.py implementation. No additional test scripts are needed — the change is self-contained and the edge cases have been manually verified (see below).

Test Result

Edge cases verified:

  • Non-divisible (cs=25, ic=8): (24//8)*8 == (25//8)*8 == 24, behavior unchanged.
  • ic == chunk_size: Guarded by initial_chunk_size < chunk_size, IC skipped entirely — unaffected.
  • finished=True during IC: Handled by existing context_length logic.

Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

cc @Sy0307

@Fattysand Fattysand force-pushed the fix/qwen3-tts-chunk-boundary branch from da3e643 to fd4ae7e Compare March 31, 2026 11:50
@tzhouam tzhouam self-requested a review March 31, 2026 11:52
@tzhouam tzhouam added the ready label to trigger buildkite CI label Mar 31, 2026
@tzhouam tzhouam requested a review from linyueqian March 31, 2026 11:55
@Fattysand
Copy link
Copy Markdown
Contributor Author

Update test_qwen3_tts_async_chunk.py to match corrected IC boundary logic

The existing test cases for the "IC evenly divides chunk_size" edge case (ic=8, cs=16) were written under the assumption that IC phase uses strict < (i.e., length < chunk_size). This contradicts the <= boundary used in both fish_speech.py and the corrected qwen3_tts.py.

What changed in tests:

  1. Fixed 2 incorrect expectations for ic=8, cs=16:
case before after reason
n=16, finished=False None (8, 16) 16<=16 → still IC phase, 16%8==0 → emit
n=24, finished=False (8, 24) None normal phase, adjusted=8, 8%16!=0 → hold
  1. Added 1 normal-emit verification for ic=8, cs=16:

    • n=32(16, 32): first normal emit at initial_coverage + chunk_size = 16+16 = 32
  2. Added 5 new cases for ic=5, cs=25 (IC evenly divides chunk_size with higher multiplicity):

    • Demonstrates IC filling the entire first chunk: emit at 5, 12→hold, 25→emit, 30→hold, 50→first normal emit
    • Emit interval pattern: 5,5,5,5,5,25,25,... — smooth transition with no gap
    • This is the key scenario that exposes the bug in the old < logic: with strict <, IC would only emit at 5,10,15,20 (skipping 25), then normal phase wouldn't emit until frame 45, creating a 25-frame gap (longer than the normal chunk itself)

Updated comments clarify the IC boundary rule:

# IC phase: length <= chunk_size  (uses <=, consistent with fish_speech)
# IC emits fill the entire first chunk_size worth of frames, so the
# normal phase always starts at a clean chunk boundary.
# initial_coverage = (chunk_size // initial_chunk_size) * initial_chunk_size

Signed-off-by: Fattysand <fattysand@users.noreply.github.com>
Signed-off-by: Fattysand <fattysand@users.noreply.github.com>
@Fattysand Fattysand force-pushed the fix/qwen3-tts-chunk-boundary branch from b7204cc to 35c3a94 Compare March 31, 2026 13:27
@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 31, 2026

LGTM. Nice catch.

@linyueqian
Copy link
Copy Markdown
Collaborator

@JuanPZuluaga please also take a look. thank you!

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tzhouam tzhouam merged commit 7274e15 into vllm-project:main Apr 1, 2026
7 of 8 checks passed
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 2, 2026
Two bugs preventing Base (voice-clone) task from producing correct audio:

1. Speech tokenizer encoder ran in bfloat16, causing ~50% of encoded
   reference-audio codes to diverge from float32 baseline. The corrupted
   prompt prevents the talker from generating stop token 2150, producing
   ~318s of audio instead of ~8s. Fix: load encoder in float32.

2. Cherry-pick chunk boundary fix (vllm-project#2378): off-by-one in initial chunk
   phase boundary check caused the final codec chunk to be malformed
   (length 1, not divisible by 16 quantizers), resulting in 0-byte output
   even when stop token was correctly generated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
linyueqian added a commit to linyueqian/vllm-omni that referenced this pull request Apr 2, 2026
Two bugs preventing Base (voice-clone) task from producing correct audio:

1. Speech tokenizer encoder ran in bfloat16, causing ~50% of encoded
   reference-audio codes to diverge from float32 baseline. The corrupted
   prompt prevents the talker from generating stop token 2150, producing
   ~318s of audio instead of ~8s. Fix: load encoder in float32.

2. Cherry-pick chunk boundary fix (vllm-project#2378): off-by-one in initial chunk
   phase boundary check caused the final codec chunk to be malformed
   (length 1, not divisible by 16 quantizers), resulting in 0-byte output
   even when stop token was correctly generated.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: linyueqian <linyueqian@outlook.com>
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
vllm-project#2378)

Signed-off-by: Fattysand <fattysand@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants