[Qwen3TTS][ServingSpeech] Bugfix/voice upload and add optional ref_text by JuanPZuluaga · Pull Request #2046 · vllm-project/vllm-omni

JuanPZuluaga · 2026-03-20T13:06:24Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

fix voice upload + generation on the Base model;
On main, generating with an uploaded voice fails because "generate" isn't registered as a supported task for TTS stages;
we know that optional ref_text improve cloning performnace, let's add that to tge voice upload endpoint to enable in-context cloning (higher quality than x_vector-only mode).

Test Plan

1. Launch server (same for both branches)

CUDA_VISIBLE_DEVICES=0 \
  python -m vllm_omni.entrypoints.cli.main serve \
  Qwen/Qwen3-TTS-12Hz-0.6B-Base \
  --omni --host 127.0.0.1 --port 8000 \
  --stage-configs-path benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs1.yaml \
  --trust-remote-code --stage-init-timeout 120

# 2. Upload voice 
curl -X POST http://127.0.0.1:8000/v1/audio/voices \
  -F "audio_sample=@benchmarks/qwen3-tts/sample_0.wav" \
  -F "consent=test" \
  -F "name=my_voice" \
  -F "ref_text=hello, welcome to the voice synthesis benchmark test"

# 3. Generate with uploaded voice
curl -X POST http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "After the quarterly meeting, we should discuss the projected revenue.", "voice": "my_voice", "language": "English"}' \
  --output test_output.wav

Test Result

outputs after the fix:

baseline.wav
gen_icl.wav
gen_xvec.wav

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… bugfix/voice-upload-and-ref-text

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… bugfix/voice-upload-and-ref-text

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 746ba821b0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-03-20T13:11:56Z

+        if duration > 20.0:
+            raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 20s supported — use a shorter clip.")


Remove the shared 20s cap from reference-audio loading

_resolve_ref_audio() is used by all speech models, not just Qwen TTS, so this new duration > 20s rejection breaks Fish Speech voice cloning. In _prepare_speech_generation() the Fish Speech path always calls this helper before prompt construction, while our own Fish Speech UI/docs still recommend 10–30s reference clips (examples/online_serving/fish_speech/gradio_demo.py:193-237). Any documented 20–30s sample will now fail with a 400 even though that flow previously worked.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-20T13:11:56Z

        if any(metadata.get("final_output_type") == "audio" for metadata in stage_metadata):
            supported_tasks.add("speech")
+            # TTS stage-0 is an AR model, so we need to add generate
+            supported_tasks.add("generate")


Avoid advertising generate for speech-only stage graphs

Adding "generate" whenever any stage returns final_output_type == "audio" makes the API server instantiate all text-generation handlers behind api_server.py:611-766, even for pipelines with no comprehension stage. I checked vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml and cosyvoice3.yaml: neither marks a stage as is_comprehension, so before this change those servers only exposed speech APIs; after it they will also advertise /chat/completions and /completions on graphs that are not wired for text generation.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-03-20T13:11:56Z

+                if stored_ref_text:
+                    params["ref_text"] = [stored_ref_text]
+                    params["x_vector_only_mode"] = [False]


Normalize blank uploaded transcripts before enabling ICL

upload_voice() now persists ref_text, and this branch treats any truthy stored value as a signal to force in-context cloning. A transcript like ' ' therefore uploads successfully, but every later synthesis for that voice will fail when qwen3_tts_talker.py:1389-1391 rejects the same value because ref_text.strip() is empty. Stripping or rejecting blank transcripts before setting x_vector_only_mode=False would avoid creating voices that can never be used.

Useful? React with 👍 / 👎.

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… bugfix/voice-upload-and-ref-text

lishunyang12

Left a few comments. The ref_text plumbing looks correct overall, nice fix for the upload+generate flow.

lishunyang12 · 2026-03-22T17:27:06Z

            wav_np = np.mean(wav_np, axis=-1)
-        return wav_np.tolist(), int(sr)
+        sr = int(sr)
+        duration = len(wav_np) / sr if sr > 0 else 0.0


Duration validation only runs at generation time (_resolve_ref_audio), not at upload time. If someone uploads a 45s clip, upload_voice succeeds but every subsequent generation request will fail with a confusing error. Should validate duration bounds in upload_voice as well (or instead).

Perfect. I added a check there as well.

lishunyang12 · 2026-03-22T17:27:06Z

+                "At least 1s of clear speech is required for speaker embedding."
+            )
+        if duration > 30.0:
+            raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 30s supported — use a shorter clip.")


This line is ~110 chars. Split the f-string to stay within the line-length limit.

Suggested change

raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 30s supported — use a shorter clip.")

raise ValueError(

f"Reference audio too long ({duration:.1f}s). "

"Maximum 30s supported — use a shorter clip."

)

I tried to change it, but the pre-commit run changes it back. I think the line lenght is correct.

lishunyang12 · 2026-03-22T17:27:06Z

            n = flat.numel()
            if n == 0 or n % q != 0:
-                if n > 0:
+                if n > 1:


Why n > 1 instead of n > 0? When q > 1, a single-element input is still malformed (not divisible by q). Changing this silently swallows the warning for n == 1.

lishunyang12 · 2026-03-22T17:27:06Z

+        assert response.status_code == 200
+        result = response.json()
+        assert result["success"] is True
+        assert result["voice"]["name"] == "test_voice_rt"


This test doesn't assert that ref_text was actually stored. At minimum check result["voice"].get("ref_text") == "Hello world transcript" — otherwise the test passes even if ref_text is silently dropped.

… bugfix/voice-upload-and-ref-text

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

JuanPZuluaga · 2026-03-23T07:13:09Z

@lishunyang12 changes done. Btw, i'm currently working on a smarter way of VoiceManager for voice caching that will be shared across all TTS models where we do not need to compute the speaker embedding multiple times during voice cloning, but only once, maybe at voice-upload time. This should bring a noticeable speedup in batched voice cloning decoding (#1701).

Sy0307 · 2026-03-23T07:33:04Z

@lishunyang12 changes done. Btw, i'm currently working on a smarter way of VoiceManager for voice caching that will be shared across all TTS models where we do not need to compute the speaker embedding multiple times during voice cloning, but only once, maybe at voice-upload time. This should bring a noticeable speedup in batched voice cloning decoding (#1701).

I suggest you can proposal a new RFC to implement it and some other works. I think multi-turn real-time voice chat is an important scenario and we can do more work on it. Feel free to reach out anytime.

… bugfix/voice-upload-and-ref-text

JuanPZuluaga · 2026-03-23T21:04:53Z

I suggest you can proposal a new RFC to implement it and some other works. I think multi-turn real-time voice chat is an important scenario and we can do more work on it. Feel free to reach out anytime.

This makes sense, do you mean a full RFC regarding Voice Caching manager? @Sy0307

Sy0307 · 2026-03-24T03:10:20Z

I suggest you can proposal a new RFC to implement it and some other works. I think multi-turn real-time voice chat is an important scenario and we can do more work on it. Feel free to reach out anytime.

This makes sense, do you mean a full RFC regarding Voice Caching manager? @Sy0307

Yes, you can propose such an RFC. Additionally, I believe that multi-turn conversations and similar caching features are helpful for any model with real-time interactive multi-turn requirements. We can start with voice models for now, but I hope the design of this RFC can be extended to more real-time models, such as world models or real-time diffusion models. Reference: #1987 (comment)

Feel free to leave your thoughts.

… bugfix/voice-upload-and-ref-text

JuanPZuluaga · 2026-03-24T17:41:29Z

Yes, you can propose such an RFC. Additionally, I believe that multi-turn conversations and similar caching features are helpful for any model with real-time interactive multi-turn requirements. We can start with voice models for now, but I hope the design of this RFC can be extended to more real-time models, such as world models or real-time diffusion models. Reference: #1987 (comment)

Feel free to leave your thoughts.

@Sy0307 this is super cool btw, we need t work in the future in a way to keep "a chat/conversation" style but for voice. Maybe even something with ws for life real-time agents, with some kind of per-conversation statefulness?

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… bugfix/voice-upload-and-ref-text

linyueqian · 2026-03-25T20:02:42Z

resolve conflicts please, also fix ci

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

JuanPZuluaga · 2026-03-25T22:25:08Z

@linyueqian done. Thanks!

linyueqian

LGTM

… bugfix/voice-upload-and-ref-text

linyueqian · 2026-03-26T13:22:38Z

fix ci please

… bugfix/voice-upload-and-ref-text

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

…/JuanPZuluaga/vllm-omni into bugfix/voice-upload-and-ref-text

… bugfix/voice-upload-and-ref-text

JuanPZuluaga · 2026-03-26T14:09:21Z

@linyueqian thanks. I'm looking into it.

…xt (vllm-project#2046) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>

JuanPZuluaga added 9 commits March 20, 2026 09:17

support ref_text

db0ef93

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

support ref_text in serving speech

0accaed

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

move log to warning of code2wav

0b89a26

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

6f3e23a

… bugfix/voice-upload-and-ref-text

add to docs voice upload ref_text

e4508b1

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

update readme and voice upload test

75638c3

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

update test

b7140f3

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

add generate to AR stage0

eeb3449

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

746ba82

… bugfix/voice-upload-and-ref-text

JuanPZuluaga requested a review from hsliuustc0106 as a code owner March 20, 2026 13:06

JuanPZuluaga mentioned this pull request Mar 20, 2026

[RFC]: Qwen3-TTS Production Ready - February Milestone #938

Open

chatgpt-codex-connector Bot reviewed Mar 20, 2026

View reviewed changes

JuanPZuluaga added 6 commits March 20, 2026 13:26

revert 'generate', add cap to 30s and clean ref_text

610809d

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

01e4134

… bugfix/voice-upload-and-ref-text

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

da52be7

… bugfix/voice-upload-and-ref-text

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

7473da6

… bugfix/voice-upload-and-ref-text

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

5160819

… bugfix/voice-upload-and-ref-text

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

1d38a1f

… bugfix/voice-upload-and-ref-text

lishunyang12 reviewed Mar 22, 2026

View reviewed changes

JuanPZuluaga added 3 commits March 23, 2026 05:20

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

2fbdfae

… bugfix/voice-upload-and-ref-text

add clone sample limit at voice upload, add ref_text in tests

9ef29d7

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

added min/max global and add check in voice_upload

55a27c1

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

4347880

… bugfix/voice-upload-and-ref-text

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

5cceff9

… bugfix/voice-upload-and-ref-text

merge main

431ae0b

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

JuanPZuluaga added 2 commits March 25, 2026 06:22

merge main

4c86c88

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

0d19a75

… bugfix/voice-upload-and-ref-text

JuanPZuluaga mentioned this pull request Mar 25, 2026

[Qwen3TTS] [TTS] [Feat] Refactor voice cache manager #2108

Merged

5 tasks

linyueqian added the ready label to trigger buildkite CI label Mar 25, 2026

linyueqian enabled auto-merge (squash) March 25, 2026 15:45

merge main, fix conflicts

fa508d7

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

auto-merge was automatically disabled March 25, 2026 22:22
Head branch was pushed to by a user without write access

linyueqian self-requested a review March 25, 2026 22:43

linyueqian approved these changes Mar 25, 2026

View reviewed changes

JuanPZuluaga and others added 2 commits March 26, 2026 07:08

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

37e5cb8

… bugfix/voice-upload-and-ref-text

Merge branch 'main' into bugfix/voice-upload-and-ref-text

d00efbb

JuanPZuluaga and others added 5 commits March 26, 2026 13:41

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

83c33ea

… bugfix/voice-upload-and-ref-text

Merge branch 'main' into bugfix/voice-upload-and-ref-text

8d38593

fix ci

9a4552a

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'bugfix/voice-upload-and-ref-text' of https://github.com…

97695e8

…/JuanPZuluaga/vllm-omni into bugfix/voice-upload-and-ref-text

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

f28645e

… bugfix/voice-upload-and-ref-text

linyueqian enabled auto-merge (squash) March 26, 2026 14:04

linyueqian merged commit 574ec99 into vllm-project:main Mar 26, 2026
7 of 8 checks passed

JuanPZuluaga deleted the bugfix/voice-upload-and-ref-text branch March 28, 2026 14:10

linyueqian mentioned this pull request Apr 5, 2026

[Bugfix] Accept 'speaker' as alias for 'voice' in TTS speech API #2424

Merged

		if duration > 20.0:
		raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 20s supported — use a shorter clip.")

Conversation

JuanPZuluaga commented Mar 20, 2026

Purpose

Test Plan

1. Launch server (same for both branches)

Test Result

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 20, 2026

Choose a reason for hiding this comment

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JuanPZuluaga commented Mar 23, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Sy0307 commented Mar 23, 2026

Uh oh!

JuanPZuluaga commented Mar 23, 2026

Uh oh!

Sy0307 commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

linyueqian commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

JuanPZuluaga commented Mar 25, 2026

Uh oh!

linyueqian left a comment

Choose a reason for hiding this comment

Uh oh!

linyueqian commented Mar 26, 2026

Uh oh!

JuanPZuluaga commented Mar 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JuanPZuluaga commented Mar 23, 2026 •

edited

Loading

Sy0307 commented Mar 24, 2026 •

edited

Loading

JuanPZuluaga commented Mar 24, 2026 •

edited

Loading

linyueqian commented Mar 25, 2026 •

edited

Loading