[Feat][Qwen3TTS] reduce TTFA with flexible initial phase#1583
[Feat][Qwen3TTS] reduce TTFA with flexible initial phase#1583hsliuustc0106 merged 22 commits intovllm-project:mainfrom
Conversation
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
Signed-off-by: pablo <pablo@agigo.ai>
|
Can we reduce TTFP by adjusting chunk_size? |
Indeed, but then:
With this approach, at least, we know that the TTFA can go below 500ms, without compromising too much the quality and not too much overhead. We could even increase the chunk size to higher values to have better audio quality without increasing TTFA. |
Signed-off-by: pablo <pablo@agigo.ai>
|
please let me know what you think @amy-why-3459 @linyueqian. TTFC can go to below ~300ms Also, should i add some test? |
|
Thank you so much for your contribution, it's a great idea. May I ask what your test scenario is? For example, concurrency and input/output length? |
|
Do you mean a real use case? If it's that, ideally, we would like to reduce TTFC on high concurrency loads for voice assistants, where we need very low latency for generating the first audio to the user. For instance, in batched offline decoding scenarios, i wouldn't see much importance to this value. |
…size Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-config-ttfp
|
It makes sense to me that there is Additionally, I have a question: is it better to implement initial_codec_chunk_frames as a stage-level configuration or a request-level configuration? Or could both exist, with the request-level config taking higher priority? Does this make sense? Welcome to discuss. |
this is a good idea actually, though it means adding yet more params to be set request-level. Let me know what you think, and we can implement it. |
Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-config-ttfp
|
I'd like to ask about the test scenarios in your Test Result. Also, could you please adapt this solution to qwen3-omni and mimo_audio as well? |
|
what are the definitions of TTFA/TTFP? |
Sorry about that i've mixed both, but they mean the same: Time to First Audio or Time to first Packet. So far, i hear glitches while doing "live streaming" due to the fact that the model cannot keep with real time processing (meaning RTFx is less than one). I expect it to be fast once we get the different models compiled, etc. |
|
Thanks for this feature! The TTFA improvement at I had a question about the warmup to normal phase transition for longer responses. Correct me if my understanding is wrong: At I suspect the benchmark avoids this because the test prompts are short enough (~25 frames, roughly 1 second) that the entire response is served within the initial phase and the normal phase is never entered. Would it be possible to test with a longer prompt (say, 5 to 10 seconds of output) to confirm? Also a small terminology question: the codebase already uses "warmup" in a few other contexts (model warmup, benchmark warmup requests, cache-dit warmup steps). Would a different name like "startup phase" or "initial-chunk phase" avoid ambiguity? Happy to defer to whatever the team prefers! |
Indeed! It was very surprising to me, that with only 2 chunks, the performance is so good! I did a signal processing analysis, and the difference is minimal. I'd put the script but it's too long...
you are correct, i can notice the glitches in the first In any case, if the current functionality is supported, further future optimizations already planned will allow us to use very low
I will try with longer sequences, but in my erlier experiments, i didn't see any issue TBH.
Startup phase sounds good! I'll change this! thanks for the feedback. |
@amy-why-3459 we would need to wait for #1423 in order to add the same functionality in qwen3-omni |
Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-config-ttfp
For example, the current configurations could be included as request-level parameters in something like |
Signed-off-by: pablo <pablo@agigo.ai>
this sounds cool, is it worth doing it? I can add the variable chunk size that goes into @amy-why-3459 @linyueqian i changed warmup to "initial phase" everywhere. Please let me know which are other things to try, so i can try, otherwise, we could check for merge? |
Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-config-ttfp
|
Ok, updated with configurable Test results below |
Signed-off-by: pablo <pablo@agigo.ai>
… feat/qwen3tts-config-ttfp
why RTF becomes worse in this scenario? |
|
I think they're very similar @hsliuustc0106. Could you please point out to which experiment specifically? |
in the first figure, concurrency at 1/4, the rtf looks higher when IC=2 |
Oh you’re correct. In this setting, probably the small RTF overhead is due to doing 12 forward passes of the code2wav (of ic2, 2x12=24 chunks) instead of 1 forward passes in ic0, which does it once we have cached 25 chunks. |
| window_frames = transfer_manager.code_prompt_token_ids[request_id][-end_index:] | ||
| if in_initial_phase: | ||
| # Initial-chunk phase: emit every initial_chunk_size frames with full accumulated context. | ||
| already_sent = transfer_manager.put_req_chunk[request_id] * initial_chunk_size |
There was a problem hiding this comment.
Where does put_req_chunk get incremented after a chunk is emitted? This function reads it but I don't see the write side in this diff.
There was a problem hiding this comment.
I think it is done already in:
specifically in ChunkTransferAdapter._send_single_request().
… feat/qwen3tts-config-ttfp
|
lgtm. BTW I believe this should probably become a general method that can be used to reduce TTFA/TTFV for all models with high real-time requirements. |
Thanks! yeah, which models do you think we should target? qwen3-omni is down in line. I can work on it. Also, I think we should implement a smarter way to select this chunk. |
qwen3.5-omni is on the way :) |
we should provide an elegant abstraction for this dynamic IC selection |
I’ll brainstorm a bit, and come with an useful abstraction based on the load in the ‘code2wav’ block. |
…t#1583) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai>
### vllm-omni-api - Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" - Changes: - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" ### vllm-omni-contrib - Source: [PR #1724](vllm-project/vllm-omni#1724) - Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" - Changes: - New feature: Revert "[Profile] Adding metrics for Diffusion/DiT Single diffusion Pipeline (#668)" ### vllm-omni-api - Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]: Add vae-patch-parallel CLI argument in online serving - Changes: - New feature: [Feature]: Add vae-patch-parallel CLI argument in online serving ### vllm-omni-contrib - Source: [PR #1716](vllm-project/vllm-omni#1716) - [Feature]: Add vae-patch-parallel CLI argument in online serving - Changes: - New feature: [Feature]: Add vae-patch-parallel CLI argument in online serving ### vllm-omni-contrib - Source: [PR #1693](vllm-project/vllm-omni#1693) - [skip CI][Docs] Add TTS model developer guide - Changes: - New feature: [skip CI][Docs] Add TTS model developer guide ### vllm-omni-audio-tts - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-distributed - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-perf - Source: [PR #1688](vllm-project/vllm-omni#1688) - [MiMo-Audio] Bugfix tp lg than 1 - Changes: - Bug fix: [MiMo-Audio] Bugfix tp lg than 1 ### vllm-omni-perf - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-distributed - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-api - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Additions: - `/v1/audio/speech` ### vllm-omni-quantization - Source: [PR #1687](vllm-project/vllm-omni#1687) - [BugFix] Return proper HTTP status for ErrorResponse in create_speech - Changes: - Bug fix: [BugFix] Return proper HTTP status for ErrorResponse in create_speech ### vllm-omni-cicd - Source: [PR #1683](vllm-project/vllm-omni#1683) - [CI] Remove high concurrency tests before issue #1374 fixed. - Changes: - Bug fix: [CI] Remove high concurrency tests before issue #1374 fixed. ### vllm-omni-audio-tts - Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS - Changes: - New feature: Add non-async chunk support for Qwen3-TTS ### vllm-omni-cicd - Source: [PR #1678](vllm-project/vllm-omni#1678) - Add non-async chunk support for Qwen3-TTS - Changes: - New feature: Add non-async chunk support for Qwen3-TTS ### vllm-omni-cicd - Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type ### vllm-omni-perf - Source: [PR #1677](vllm-project/vllm-omni#1677) - Replace hard-coded cuda generator with current_omni_platform.device_type ### vllm-omni-serving - Source: [PR #1675](vllm-project/vllm-omni#1675) - [Misc] remove logits_processor_pattern this field, because vllm have … ### vllm-omni-cicd - Source: [PR #1666](vllm-project/vllm-omni#1666) - [Cleanup] Move cosyvoice3 tests to model subdirectory ### vllm-omni-audio-tts - Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder - Changes: - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder ### vllm-omni-cicd - Source: [PR #1664](vllm-project/vllm-omni#1664) - [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder - Changes: - Bug fix: [Bugfix] Fix all-silence TTS output: use float32 for speech tokenizer decoder ### vllm-omni-distributed - Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk ### vllm-omni-contrib - Source: [PR #1656](vllm-project/vllm-omni#1656) - [Optimize][Qwen3-Omni] Reduce inter-packet latency in async chunk ### vllm-omni-quantization - Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models - Changes: - New feature: [UX] Add progress bar for diffusion models ### vllm-omni-perf - Source: [PR #1652](vllm-project/vllm-omni#1652) - [UX] Add progress bar for diffusion models - Changes: - New feature: [UX] Add progress bar for diffusion models ### vllm-omni-distributed - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-quantization - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-perf - Source: [PR #1651](vllm-project/vllm-omni#1651) - docs: Announce vllm-omni-skills community project ### vllm-omni-contrib - Source: [PR #1649](vllm-project/vllm-omni#1649) - [Misc] update wechat ### vllm-omni-perf - Source: [PR #1642](vllm-project/vllm-omni#1642) - [chore] add _repeated_blocks for regional compilation support - Changes: - New feature: [chore] add _repeated_blocks for regional compilation support ### vllm-omni-api - Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes - Changes: - New feature: [Bugfix] Add TTS request validation to prevent engine crashes ### vllm-omni-cicd - Source: [PR #1641](vllm-project/vllm-omni#1641) - [Bugfix] Add TTS request validation to prevent engine crashes - Changes: - New feature: [Bugfix] Add TTS request validation to prevent engine crashes ### vllm-omni-image-gen - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer - Additions: - text-to-image - Text-to-Image - Flux ### vllm-omni-quantization - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer - Additions: - FP8 support or improvements ### vllm-omni-contrib - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer ### vllm-omni-perf - Source: [PR #1640](vllm-project/vllm-omni#1640) - [FP8 Quantization] Add FP8 quantization support for Flux transformer - Changes: - New feature: [FP8 Quantization] Add FP8 quantization support for Flux transformer ### vllm-omni-contrib - Source: [PR #1631](vllm-project/vllm-omni#1631) - [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup - Changes: - Bug fix: [BugFix] Fix LongCat Sequence Parallelism / Small Cleanup ### vllm-omni-cicd - Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases ### vllm-omni-perf - Source: [PR #1628](vllm-project/vllm-omni#1628) - [Test][Qwen3-Omni]Modify Qwen3-Omni benchmark test cases ### vllm-omni-perf - Source: [PR #1619](vllm-project/vllm-omni#1619) - [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context - Changes: - Bug fix: [Bugfix] Fix Qwen3-TTS code predictor crash due to missing vLLM config context ### vllm-omni-perf - Source: [PR #1617](vllm-project/vllm-omni#1617) - [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph - Changes: - Performance improvement: [Refactor][Perf] Qwen3-TTS: re-prefill Code Predictor with torch.compile + enable Code2Wav decoder CUDA Graph ### vllm-omni-contrib - Source: [PR #1615](vllm-project/vllm-omni#1615) - [Doc] Fix links in the configuration doc - Changes: - Bug fix: [Doc] Fix links in the configuration doc ### vllm-omni-audio-tts - Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor - Changes: - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor ### vllm-omni-perf - Source: [PR #1614](vllm-project/vllm-omni#1614) - perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor - Changes: - Performance improvement: perf: replace per-element .item() GPU syncs with batch .tolist() in TTS code predictor ### vllm-omni-image-gen - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Additions: - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image - GLM-Image ### vllm-omni-api - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation ### vllm-omni-perf - Source: [PR #1609](vllm-project/vllm-omni#1609) - [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation - Changes: - Bug fix: [Bugfix] Fix filepath resolution for model with subdir and GLM-Image generation ### vllm-omni-contrib - Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios from ByteDance ### vllm-omni-perf - Source: [PR #1604](vllm-project/vllm-omni#1604) - [Model]: support Helios from ByteDance ### vllm-omni-serving - Source: [PR #1602](vllm-project/vllm-omni#1602) - [Bugfix] fix kernel error for qwen3-omni - Changes: - Bug fix: [Bugfix] fix kernel error for qwen3-omni ### vllm-omni-distributed - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-image-gen - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Additions: - HunyuanImage3 - HunyuanImage3Pipeline - HunyuanImage3 - HunyuanImage-3 - HunyuanImage-3 - HunyuanImage-3 - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage3Pipeline - HunyuanImage-3 ### vllm-omni-quantization - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-perf - Source: [PR #1598](vllm-project/vllm-omni#1598) - [BugFix] Fix load_weights error when loading HunyuanImage3.0 - Changes: - Bug fix: [BugFix] Fix load_weights error when loading HunyuanImage3.0 ### vllm-omni-audio-tts - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-api - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-cicd - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-contrib - Source: [PR #1583](vllm-project/vllm-omni#1583) - [Feat][Qwen3TTS] reduce TTFA with flexible initial phase - Changes: - New feature: [Feat][Qwen3TTS] reduce TTFA with flexible initial phase ### vllm-omni-api - Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor ### vllm-omni-serving - Source: [PR #1579](vllm-project/vllm-omni#1579) - [1/N][Refactor] Clean up dead code in output processor ### vllm-omni-distributed - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-cicd - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-perf - Source: [PR #1578](vllm-project/vllm-omni#1578) - [Feature][Bagel] Add CFG parallel mode - Changes: - New feature: [Feature][Bagel] Add CFG parallel mode ### vllm-omni-contrib - Source: [PR #1576](vllm-project/vllm-omni#1576) - 0.16.0 release ### vllm-omni-audio-tts - Source: [PR #1570](vllm-project/vllm-omni#1570) - [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio - Changes: - Bug fix: [bugfix] Fix unexpected argument 'is_finished' in function llm2code2wav_async_chunk of mimo-audio ### vllm-omni-api - Source: [PR #1566](vllm-project/vllm-omni#1566) - [Bugfix] Import InputPreprocessor into Renderer - Changes: - Bug fix: [Bugfix] Import InputPreprocessor into Renderer ### vllm-omni-distributed - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-quantization - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-perf - Source: [PR #1539](vllm-project/vllm-omni#1539) - [Debug] Enable curl retry aligned with openai ### vllm-omni-image-gen - Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln - Changes: - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln - Additions: - mindiesd - mindiesd - Qwen-Image-Edit-2509 - mindiesd - mindiesd - mindiesd - mindiesd ### vllm-omni-perf - Source: [PR #1537](vllm-project/vllm-omni#1537) - [NPU] [Features] [Bugfix] Support mindiesd adaln - Changes: - New feature: [NPU] [Features] [Bugfix] Support mindiesd adaln ### vllm-omni-serving - Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving - Changes: - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving ### vllm-omni-perf - Source: [PR #1536](vllm-project/vllm-omni#1536) - [Bugfix] Fix transformers 5.x compat issues in online TTS serving - Changes: - Bug fix: [Bugfix] Fix transformers 5.x compat issues in online TTS serving
…t#1583) Signed-off-by: pablo <pablo@agigo.ai> Co-authored-by: pablo <pablo@agigo.ai> Signed-off-by: lishunyang <lishunyang12@163.com>
|
When I use initial_codec_chunk_frames < 25 I always have 1s latency in the middle of every generation It hapens with concurrency 4 or more on 3090/A100 and 6 or more on 4090 |
|
@JuanPZuluaga ptal
|








PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Related to #938.
In this PR, we lower the TTFA of Qwen3TTS by having a smaller number of frames required to start to stream the output. Ideally, this would help to perceive a faster response of the whole system by reducing the TTFA.
The current implementation is not too flexible, as if one wants to reduce the TTFA, we would end up always decoding with very small chunks, which can reduce audio quality.
initial_chunk_sizewhich dictates the chunk rate used to generate audio. we can call this phase "warmup phase":initial_chunk_size + full left contextchunk_size + left_context_sizePS: i decided to not add support of
left_context_size=-1(aka full left context), cause we might get OOM in very long sequences (too large context), though the user can just increase left context, to let's say 10s.I will add some audio samples later.
Test Plan
Test Result
initial_codec_chunk_frames = 0, 2, 5, 10, 15, 20"; 0 means noinitial_codec_chunk_framesicmeans, vlaue set forinitial_codec_chunk_framesTTFA (Time To First Audio) in milliseconds
Total Generation Time (ms)
Inter-Chunk Time (ms)
output_1_ic_0.wav
output_1_ic_2.wav
output_1_ic_5.wav
output_1_ic_10.wav
output_1_ic_15.wav
output_1_ic_20.wav
Note that
ic_15andic_20, would yield similar results.EDIT: removed the first sample (in the tables) due to overhead in compile.
EDIT2: updated the numbers with a fresh run.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)