Skip to content

[Qwen3TTS][ServingSpeech] Bugfix/voice upload and add optional ref_text#2046

Merged
linyueqian merged 31 commits into
vllm-project:mainfrom
JuanPZuluaga:bugfix/voice-upload-and-ref-text
Mar 26, 2026
Merged

[Qwen3TTS][ServingSpeech] Bugfix/voice upload and add optional ref_text#2046
linyueqian merged 31 commits into
vllm-project:mainfrom
JuanPZuluaga:bugfix/voice-upload-and-ref-text

Conversation

@JuanPZuluaga
Copy link
Copy Markdown
Contributor

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

  • fix voice upload + generation on the Base model;
  • On main, generating with an uploaded voice fails because "generate" isn't registered as a supported task for TTS stages;
  • we know that optional ref_text improve cloning performnace, let's add that to tge voice upload endpoint to enable in-context cloning (higher quality than x_vector-only mode).

Test Plan

1. Launch server (same for both branches)

CUDA_VISIBLE_DEVICES=0 \
  python -m vllm_omni.entrypoints.cli.main serve \
  Qwen/Qwen3-TTS-12Hz-0.6B-Base \
  --omni --host 127.0.0.1 --port 8000 \
  --stage-configs-path benchmarks/qwen3-tts/vllm_omni/configs/qwen3_tts_bs1.yaml \
  --trust-remote-code --stage-init-timeout 120
# 2. Upload voice 
curl -X POST http://127.0.0.1:8000/v1/audio/voices \
  -F "audio_sample=@benchmarks/qwen3-tts/sample_0.wav" \
  -F "consent=test" \
  -F "name=my_voice" \
  -F "ref_text=hello, welcome to the voice synthesis benchmark test"
# 3. Generate with uploaded voice
curl -X POST http://127.0.0.1:8000/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{"input": "After the quarterly meeting, we should discuss the projected revenue.", "voice": "my_voice", "language": "English"}' \
  --output test_output.wav

Test Result

outputs after the fix:

baseline.wav
gen_icl.wav
gen_xvec.wav


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
  • The test results. Please paste the results comparison before and after, or the e2e results.
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
  • (Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 746ba821b0

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +680 to +681
if duration > 20.0:
raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 20s supported — use a shorter clip.")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Remove the shared 20s cap from reference-audio loading

_resolve_ref_audio() is used by all speech models, not just Qwen TTS, so this new duration > 20s rejection breaks Fish Speech voice cloning. In _prepare_speech_generation() the Fish Speech path always calls this helper before prompt construction, while our own Fish Speech UI/docs still recommend 10–30s reference clips (examples/online_serving/fish_speech/gradio_demo.py:193-237). Any documented 20–30s sample will now fail with a 400 even though that flow previously worked.

Useful? React with 👍 / 👎.

Comment thread vllm_omni/engine/async_omni_engine.py Outdated
Comment on lines +533 to +536
if any(metadata.get("final_output_type") == "audio" for metadata in stage_metadata):
supported_tasks.add("speech")
# TTS stage-0 is an AR model, so we need to add generate
supported_tasks.add("generate")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid advertising generate for speech-only stage graphs

Adding "generate" whenever any stage returns final_output_type == "audio" makes the API server instantiate all text-generation handlers behind api_server.py:611-766, even for pipelines with no comprehension stage. I checked vllm_omni/model_executor/stage_configs/fish_speech_s2_pro.yaml and cosyvoice3.yaml: neither marks a stage as is_comprehension, so before this change those servers only exposed speech APIs; after it they will also advertise /chat/completions and /completions on graphs that are not wired for text generation.

Useful? React with 👍 / 👎.

Comment on lines +816 to +818
if stored_ref_text:
params["ref_text"] = [stored_ref_text]
params["x_vector_only_mode"] = [False]
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Normalize blank uploaded transcripts before enabling ICL

upload_voice() now persists ref_text, and this branch treats any truthy stored value as a signal to force in-context cloning. A transcript like ' ' therefore uploads successfully, but every later synthesis for that voice will fail when qwen3_tts_talker.py:1389-1391 rejects the same value because ref_text.strip() is empty. Stripping or rejecting blank transcripts before setting x_vector_only_mode=False would avoid creating voices that can never be used.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

@lishunyang12 lishunyang12 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left a few comments. The ref_text plumbing looks correct overall, nice fix for the upload+generate flow.

wav_np = np.mean(wav_np, axis=-1)
return wav_np.tolist(), int(sr)
sr = int(sr)
duration = len(wav_np) / sr if sr > 0 else 0.0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duration validation only runs at generation time (_resolve_ref_audio), not at upload time. If someone uploads a 45s clip, upload_voice succeeds but every subsequent generation request will fail with a confusing error. Should validate duration bounds in upload_voice as well (or instead).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perfect. I added a check there as well.

"At least 1s of clear speech is required for speaker embedding."
)
if duration > 30.0:
raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 30s supported — use a shorter clip.")
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is ~110 chars. Split the f-string to stay within the line-length limit.

Suggested change
raise ValueError(f"Reference audio too long ({duration:.1f}s). Maximum 30s supported — use a shorter clip.")
raise ValueError(
f"Reference audio too long ({duration:.1f}s). "
"Maximum 30s supported — use a shorter clip."
)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to change it, but the pre-commit run changes it back. I think the line lenght is correct.

n = flat.numel()
if n == 0 or n % q != 0:
if n > 0:
if n > 1:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why n > 1 instead of n > 0? When q > 1, a single-element input is still malformed (not divisible by q). Changing this silently swallows the warning for n == 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

assert response.status_code == 200
result = response.json()
assert result["success"] is True
assert result["voice"]["name"] == "test_voice_rt"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test doesn't assert that ref_text was actually stored. At minimum check result["voice"].get("ref_text") == "Hello world transcript" — otherwise the test passes even if ref_text is silently dropped.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added this

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Mar 23, 2026

@lishunyang12 changes done. Btw, i'm currently working on a smarter way of VoiceManager for voice caching that will be shared across all TTS models where we do not need to compute the speaker embedding multiple times during voice cloning, but only once, maybe at voice-upload time. This should bring a noticeable speedup in batched voice cloning decoding (#1701).

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 23, 2026

@lishunyang12 changes done. Btw, i'm currently working on a smarter way of VoiceManager for voice caching that will be shared across all TTS models where we do not need to compute the speaker embedding multiple times during voice cloning, but only once, maybe at voice-upload time. This should bring a noticeable speedup in batched voice cloning decoding (#1701).

I suggest you can proposal a new RFC to implement it and some other works. I think multi-turn real-time voice chat is an important scenario and we can do more work on it. Feel free to reach out anytime.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

I suggest you can proposal a new RFC to implement it and some other works. I think multi-turn real-time voice chat is an important scenario and we can do more work on it. Feel free to reach out anytime.

This makes sense, do you mean a full RFC regarding Voice Caching manager? @Sy0307

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Mar 24, 2026

I suggest you can proposal a new RFC to implement it and some other works. I think multi-turn real-time voice chat is an important scenario and we can do more work on it. Feel free to reach out anytime.

This makes sense, do you mean a full RFC regarding Voice Caching manager? @Sy0307

Yes, you can propose such an RFC. Additionally, I believe that multi-turn conversations and similar caching features are helpful for any model with real-time interactive multi-turn requirements. We can start with voice models for now, but I hope the design of this RFC can be extended to more real-time models, such as world models or real-time diffusion models. Reference: #1987 (comment)

Feel free to leave your thoughts.

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

JuanPZuluaga commented Mar 24, 2026

Yes, you can propose such an RFC. Additionally, I believe that multi-turn conversations and similar caching features are helpful for any model with real-time interactive multi-turn requirements. We can start with voice models for now, but I hope the design of this RFC can be extended to more real-time models, such as world models or real-time diffusion models. Reference: #1987 (comment)

Feel free to leave your thoughts.

@Sy0307 this is super cool btw, we need t work in the future in a way to keep "a chat/conversation" style but for voice. Maybe even something with ws for life real-time agents, with some kind of per-conversation statefulness?

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
@linyueqian linyueqian added the ready label to trigger buildkite CI label Mar 25, 2026
@linyueqian linyueqian enabled auto-merge (squash) March 25, 2026 15:45
@linyueqian
Copy link
Copy Markdown
Collaborator

linyueqian commented Mar 25, 2026

resolve conflicts please, also fix ci

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
auto-merge was automatically disabled March 25, 2026 22:22

Head branch was pushed to by a user without write access

@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

@linyueqian done. Thanks!

@linyueqian linyueqian self-requested a review March 25, 2026 22:43
Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linyueqian
Copy link
Copy Markdown
Collaborator

fix ci please

@linyueqian linyueqian enabled auto-merge (squash) March 26, 2026 14:04
@JuanPZuluaga
Copy link
Copy Markdown
Contributor Author

@linyueqian thanks. I'm looking into it.

@linyueqian linyueqian merged commit 574ec99 into vllm-project:main Mar 26, 2026
7 of 8 checks passed
@JuanPZuluaga JuanPZuluaga deleted the bugfix/voice-upload-and-ref-text branch March 28, 2026 14:10
vraiti pushed a commit to vraiti/vllm-omni that referenced this pull request Apr 9, 2026
…xt (vllm-project#2046)

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
…xt (vllm-project#2046)

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
…xt (vllm-project#2046)

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Co-authored-by: Yueqian Lin <70319226+linyueqian@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants