Skip to content

[BugFix]: Fix Qwen3-TTS code2wav fails when enforce_eager: false#2868

Merged
linyueqian merged 1 commit into
vllm-project:mainfrom
ChefWu551:fix-code2wav-eager-force
Apr 23, 2026
Merged

[BugFix]: Fix Qwen3-TTS code2wav fails when enforce_eager: false#2868
linyueqian merged 1 commit into
vllm-project:mainfrom
ChefWu551:fix-code2wav-eager-force

Conversation

@ChefWu551
Copy link
Copy Markdown
Contributor

@ChefWu551 ChefWu551 commented Apr 17, 2026

Purpose

As described in PR #2866, this PR mainly fixes that issue.
This is also a review for PR #2328.

Test Plan

python /workspace/vllm-omni/benchmarks/qwen3-tts/vllm_omni/bench_tts_serve.py \
--host 127.0.0.1 --port 8899 \
--task-type Base \
--ref-audio /workspace/resource/clone_2.wav \
--ref-text "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you." \
--num-prompts 10 \
--config-name base_baseline \
--result-dir benchmarks/qwen3-tts/results/


Test Result

Warming up with 3 requests...
  Warmup done.
  Running 10 requests with concurrency=4...
  concurrency=4: 100%|██████████████████████████████████████████████████████████████████████████████| 10/10 [00:15<00:00,  1.56s/it]

==================================================
             Serving Benchmark Result             
==================================================
Successful requests:                    10        
Failed requests:                        0         
Maximum request concurrency:            4         
Benchmark duration (s):                 15.60     
Request throughput (req/s):             0.64      
--------------------------------------------------
                End-to-end Latency                
--------------------------------------------------
Mean E2EL (ms):                         5546.95   
Median E2EL (ms):                       5522.70   
P99 E2EL (ms):                          6739.25   
==================================================
                   Audio Result                   
==================================================
Total audio duration generated (s):     42.24     
Audio throughput (audio duration/s):    2.71      
--------------------------------------------------
               Time to First Packet               
--------------------------------------------------
Mean AUDIO_TTFP (ms):                   768.17    
Median AUDIO_TTFP (ms):                 727.03    
P99 AUDIO_TTFP (ms):                    1049.49   
--------------------------------------------------
                 Real Time Factor                 
--------------------------------------------------
Mean AUDIO_RTF:                         1.330     
Median AUDIO_RTF:                       1.436     
P99 AUDIO_RTF:                          1.457     
==================================================

The accuracy

Test Conclusion: Accuracy is satisfactory, and the content of the two audio segments is consistent.

Performance (not as good as shown in the chart)

Concurrency Metric force_eager: false force_eager: true
1 Total duration (s) 19.36 19.07
1 Request throughput (req/s) 0.52 0.52
1 Audio throughput (audio duration/s) 2.18 2.21
1 Mean E2E latency (ms) 1935.29 1906.71
4 Total duration (s) 12.19 11.86
4 Request throughput (req/s) 0.82 0.84
4 Audio throughput (audio duration/s) 3.41 3.44
4 Mean E2E latency (ms) 4321.41 4251.28
10 Total duration (s) 10.34 10.19
10 Request throughput (req/s) 0.97 0.98
10 Audio throughput (audio duration/s) 4.01 4.24
10 Mean E2E latency (ms) 9225.24 9198.95

@chatgpt-codex-connector
Copy link
Copy Markdown

Codex usage limits have been reached for code reviews. Please check with the admins of this repo to increase the limits by adding credits.
Credits must be used to enable repository wide code reviews.

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

Fix looks correct. The tuple length check is defensive, which is good.

One question: when enforce_eager: false, what returns an OmniOutput tuple instead of an OmniOutput object? Is it torch.compile or graph mode? Adding a comment explaining the root cause would help future maintainers understand why this conversion is needed.

Also consider: could the check be stricter? For example, verify each tuple element type matches OmniOutput._field_types to catch mismatches earlier?

@hsliuustc0106
Copy link
Copy Markdown
Collaborator

why the RTF is so big >1? which hardware are you using?

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 18, 2026

Please verify generated audio examples's quality.

@ChefWu551
Copy link
Copy Markdown
Contributor Author

why the RTF is so big >1? which hardware are you using?

GPU: NVIDIA RTX 40 series

@ChefWu551
Copy link
Copy Markdown
Contributor Author

ChefWu551 commented Apr 20, 2026

Please verify generated audio examples's quality.

Good advice! And I have test the quality, it seems no diffrence.
Here is use case


curl -X POST http://127.0.0.1:8899/v1/audio/speech \
  -H "Content-Type: application/json" \
  -d '{
    "model": "/model/ModelScope/Qwen/Qwen3-TTS-12Hz-1.7B-Base",
    "input": "Once upon a time, in a small village, there lived a wise old owl. Every night, the owl would sit atop the tallest tree and share stories with the other animals. One stormy night, a lost little rabbit found its way to the tree. The owl, noticing the rabbit’s fear, invited it to listen to a tale of courage. As the storm raged on, the rabbit felt safe and warm. By morning, the rabbit had learned to face its fears and found the courage to return home. The owl’s stories had once again brought comfort and strength.",
    "task_type": "Base",
    "voice": "clone",
    "ref_audio": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen3-TTS-Repo/clone_2.wav",
    "ref_text": "Okay. Yeah. I resent you. I love you. I respect you. But you know what? You blew it! And thanks to you.",
    "response_format": "wav"
  }' --output output_1.wav

The result is :

@Sy0307
Copy link
Copy Markdown
Contributor

Sy0307 commented Apr 20, 2026

Good verification. Can you compare the performance when enable eager mode or not. And also consider the order of merging this PR and #2910

@ChefWu551
Copy link
Copy Markdown
Contributor Author

ChefWu551 commented Apr 20, 2026

Good verification. Can you compare the performance when enable eager mode or not. And also consider the order of merging this PR and #2910

Sure,I am working on this.

@Gaohan123 Gaohan123 added this to the v0.20.0 milestone Apr 20, 2026
@linyueqian linyueqian added the ready label to trigger buildkite CI label Apr 22, 2026
@linyueqian
Copy link
Copy Markdown
Collaborator

fix dco please. can you please update this pr before this friday? thanks.

@ChefWu551
Copy link
Copy Markdown
Contributor Author

fix dco please. can you please update this pr before this friday? thanks.

sure

@ChefWu551
Copy link
Copy Markdown
Contributor Author

Good verification. Can you compare the performance when enable eager mode or not. And also consider the order of merging this PR and #2910

server start command

vllm serve /model/ModelScope/Qwen/Qwen3-TTS-12Hz-1.7B-Base \
--omni \
--allowed-local-media-path /workspace \
--port 8899 

before merge pr #2910

Concurrency Metric force_eager: false force_eager: true
1 Total duration (s) 19.36 19.07
1 Request throughput (req/s) 0.52 0.52
1 Audio throughput (audio duration/s) 2.18 2.21
1 Mean E2E latency (ms) 1935.29 1906.71
4 Total duration (s) 12.19 11.86
4 Request throughput (req/s) 0.82 0.84
4 Audio throughput (audio duration/s) 3.41 3.44
4 Mean E2E latency (ms) 4321.41 4251.28
10 Total duration (s) 10.34 10.19
10 Request throughput (req/s) 0.97 0.98
10 Audio throughput (audio duration/s) 4.01 4.24
10 Mean E2E latency (ms) 9225.24 9198.95

After merge pr #2910

Concurrency Metric force_eager: false force_eager: true
1 Total duration (s) 18.83 18.99
1 Request throughput (req/s) 0.53 0.53
1 Audio throughput (audio duration/s) 2.24 2.22
1 Mean E2E latency (ms) 1882.35 1899.04
4 Total duration (s) 11.96 12.48
4 Request throughput (req/s) 0.84 0.80
4 Audio throughput (audio duration/s) 3.61 3.37
4 Mean E2E latency (ms) 4152.56 4399.58
10 Total duration (s) 10.53 10.26
10 Request throughput (req/s) 0.95 0.97
10 Audio throughput (audio duration/s) 3.99 4.16
10 Mean E2E latency (ms) 9474.58 9241.43

Signed-off-by: wuyuefeng <565948592@qq.com>
@ChefWu551 ChefWu551 force-pushed the fix-code2wav-eager-force branch 2 times, most recently from bb1a036 to 16d91c6 Compare April 23, 2026 02:42
@ChefWu551
Copy link
Copy Markdown
Contributor Author

fix dco please. can you please update this pr before this friday? thanks.

Thanks for the reminder. I have fixed the missing DCO sign-off and updated the PR branch.

Copy link
Copy Markdown
Collaborator

@linyueqian linyueqian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@linyueqian linyueqian merged commit c8efdbd into vllm-project:main Apr 23, 2026
8 checks passed
lengrongfu pushed a commit to lengrongfu/vllm-omni that referenced this pull request May 1, 2026
clodaghwalsh17 pushed a commit to clodaghwalsh17/nm-vllm-omni-ent that referenced this pull request May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready label to trigger buildkite CI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants