[Feat][Qwen3-Omni] Shared code predictor module for Qwen3-TTS and Qwen3-Omni by JuanPZuluaga · Pull Request #2375 · vllm-project/vllm-omni

JuanPZuluaga · 2026-03-31T11:44:27Z

PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.

Purpose

in this PR:

unified code predictor logic of Qwen3-TTS and Qwen3-Omni into a shared CodePredictorWrapper base class common/qwen3_code_predictor.py
Both model-specific wrappers are now config-only subclasses, can be modified by a CodePredictorWrapperConfig dataclass with 5 behavioral flags: use_cuda_graphs,
use_parallel_embedding, use_projection, return_proj_buf, sampling_mode
it also includes the torch.compile(dynamic=False) flag + CUDA graph capture per power-of-2 batch buckets, with epilogue_fusion=False to preserve float32 precision in RMSNorm/RoPE for audio quality (this was reported in previous PRs)
Bugfix in stage_init_utils.py: hasattr returned True for None-valued custom_process_input_func; replaced with getattr(..., None) truthiness check

Test Plan

Test Result

the e2e time is more or less the same overall, but the code-predictor is a bit faster.

some audios generated at concurrency=16:

output_0_4178681c-d9ac-423e-a274-8daaf2bd4b64.wav
output_1_fd088db1-2725-4321-9286-cf7d966dfff0.wav

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan. Please provide the test scripts & test commands. Please state the reasons if your codes don't require additional test scripts. For test file guidelines, please check the test style doc
The test results. Please paste the results comparison before and after, or the e2e results.
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model. Please run mkdocs serve to sync the documentation editions to ./docs.
(Optional) Release notes update. If your change is user-facing, please update the release notes draft.

BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)

… warmup Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

amy-why-3459 · 2026-03-31T11:56:33Z

@LJH-LBJ PTAL

lishunyang12

left a few comments. the compile+bucket change looks solid overall — nice that it follows the TTS code predictor pattern.

lishunyang12 · 2026-04-02T15:37:29Z

            # Convert to numpy array and ensure correct format
+            # In async_chunk mode, audio may arrive as a list of chunks
+            if isinstance(audio_tensor, list):
+                import torch


torch is already used transitively elsewhere in this file (via audio_tensor.float()). Move the import to the top-level imports instead of burying it inside a conditional.

moved to the top.

lishunyang12 · 2026-04-02T15:37:29Z

+    def _ensure_buffers(self, device: torch.device, dtype: torch.dtype, min_bsz: int = 0) -> None:
+        """Pre-allocate projection buffer sized to max(max_num_seqs, min_bsz)."""
+        max_seq = self.num_code_groups + 1
+        max_bsz = max(self._vllm_config.scheduler_config.max_num_seqs, min_bsz)


The min_bsz parameter is not present in the TTS code predictor version of _ensure_buffers. Is this needed? max_num_seqs should already be the upper bound — if bsz > max_num_seqs something else has gone wrong.

Thanks, fixed now.

lishunyang12 · 2026-04-02T15:37:29Z

        proj_buf[:bsz, 0:1, :] = last_talker_hidden
        proj_buf[:bsz, 1:2, :] = layer0_embed

+        # Get pre-computed pos_ids for this bucket


Nit: _setup_compile does warmup internally which can be expensive. Might be worth adding a log line or comment at the call site so someone debugging a slow first-call knows to look there.

perfect, done

… feat/cuda-graph-code-predictor

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… feat/cuda-graph-code-predictor

hsliuustc0106

cc @amy-why-3459 @ZeldaHuang

ZeldaHuang · 2026-04-09T11:43:36Z

The implementations of _setup_compile、_warmup_buckets、 _padded_bsz seem to overlap significantly with those in qwen3_tts_code_predictor. We could abstract them into a separated class for future reuse.

JuanPZuluaga · 2026-04-09T11:49:38Z

The implementations of _setup_compile、_warmup_buckets、 _padded_bsz seem to overlap significantly with those in qwen3_tts_code_predictor. We could abstract them into a separated class for future reuse.

@ZeldaHuang that's correct. It's quite overlapped; should I propose a shared module for Qwen3TTS and Qwen3Omni?

ZeldaHuang · 2026-04-09T11:58:02Z

The implementations of _setup_compile、_warmup_buckets、 _padded_bsz seem to overlap significantly with those in qwen3_tts_code_predictor. We could abstract them into a separated class for future reuse.

@ZeldaHuang that's correct. It's quite overlapped; should I propose a shared module for Qwen3TTS and Qwen3Omni?

You can include it in this PR if it’s not too complicated, and it would be great to add some tests to protect the module as well. Thanks!

… feat/cuda-graph-code-predictor

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

ZeldaHuang · 2026-04-10T09:41:55Z

@JuanPZuluaga Hi, I notice you abstract the whole code predictor model, can you change the PR title?

… feat/cuda-graph-code-predictor

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… feat/cuda-graph-code-predictor

ZeldaHuang · 2026-04-11T09:27:10Z

@JuanPZuluaga To speed up the process, it would be better to first land just the torch.compile abstraction in this PR, and leave the rest (modeling, cudagraph support, etc.) for follow-up PRs.

JuanPZuluaga · 2026-04-11T09:30:19Z

Hi, I'll update the body. @ZeldaHuang

@JuanPZuluaga To speed up the process, it would be better to first land just the torch.compile abstraction in this PR, and leave the rest (modeling, cudagraph support, etc.) for follow-up PRs.

the issue is that these optimizations are already done on Qwen3TTS code-predictor model. If we drop them, it would regress Qwen3TTS. The shared module keeps the full stack and gates cudagraph capture behind CodePredictorWrapperConfig.use_cuda_graphs, so Qwen3TTS migrates with use_cuda_graphs=True (byte-identical behavior) and Qwen3Omni can opt in separately. But, if you think it's still fine, i can do it. please let know.

… feat/cuda-graph-code-predictor

ZeldaHuang · 2026-04-14T05:29:19Z

Hi, I'll update the body. @ZeldaHuang

@JuanPZuluaga To speed up the process, it would be better to first land just the torch.compile abstraction in this PR, and leave the rest (modeling, cudagraph support, etc.) for follow-up PRs.

the issue is that these optimizations are already done on Qwen3TTS code-predictor model. If we drop them, it would regress Qwen3TTS. The shared module keeps the full stack and gates cudagraph capture behind CodePredictorWrapperConfig.use_cuda_graphs, so Qwen3TTS migrates with use_cuda_graphs=True (byte-identical behavior) and Qwen3Omni can opt in separately. But, if you think it's still fine, i can do it. please let know.

It make sense. For this PR, we can focus on resolving the shared module first, while keeping the current CUDA graph capture approach for each code predictor unchanged.

ZeldaHuang · 2026-04-14T05:35:38Z

@@ -0,0 +1,654 @@
+"""Code Predictor -- optimized re-prefill, no KV cache.


Would it be better to rename this shared module from CodePredictor to QwenCodePredictor (since other models also use code predictors, such as Fish Speech), or to Qwen3OmniCodePredictor (since it was first introduced in Qwen3Omni)?

ZeldaHuang · 2026-04-14T12:10:07Z

@JuanPZuluaga Please fix conflicts

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

…aga/vllm-omni into feat/cuda-graph-code-predictor

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… feat/cuda-graph-code-predictor

ZeldaHuang · 2026-04-15T03:31:53Z

CI Test Failure: Tensor Shape Mismatch in Code Predictor

The CI test test_mix_to_text_audio_001[omni_server0] is failing with a tensor dimension mismatch error.
Error Message

  RuntimeError: The expanded size of the tensor (5) must match the existing size (8) at non-singleton dimension 0. Target sizes:   [5, 1024]. Tensor sizes: [8, 1024]

Location: vllm_omni/model_executor/models/common/qwen3_code_predictor.py line 537

@JuanPZuluaga PTAL

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

JuanPZuluaga · 2026-04-15T04:22:51Z

CI Test Failure: Tensor Shape Mismatch in Code Predictor

The CI test test_mix_to_text_audio_001[omni_server0] is failing with a tensor dimension mismatch error. Error Message
  RuntimeError: The expanded size of the tensor (5) must match the existing size (8) at non-singleton dimension 0. Target sizes:   [5, 1024]. Tensor sizes: [8, 1024]
Location: vllm_omni/model_executor/models/common/qwen3_code_predictor.py line 537

@JuanPZuluaga PTAL

thanks for caching this issue:

the thing was that with the unified _ensure_buffers, which sizes _proj_buf to max_num_seqs (5 in CI), but during server init _capture_talker_mtp_graphs calls the code predictor with CUDA graph capture sizes with powers of 2: 1, 2, 4, 8. this exceeds the max_num_seqs. The original Omni code avoided this when allocating proj_buf fresh each call.

the fix is only _ensure_buffers`` now takes the actual batch size needed instead of reading max_num_seqs` internally, does the buffer grows on demand.

… feat/cuda-graph-code-predictor

…uanPZuluaga/vllm-omni into feat/cuda-graph-code-predictor Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… feat/cuda-graph-code-predictor

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

… feat/cuda-graph-code-predictor

ZeldaHuang · 2026-04-15T06:13:54Z

            output_wav = os.path.join(output_dir, f"output_{request_id}.wav")

            # Convert to numpy array and ensure correct format
+            # In async_chunk mode, audio may arrive as a list of chunks


We already have examples/offline_inference/qwen3_omni/end2end_async_chunk.py to run offline inference with async_chunk enabled

I fixed this @ZeldaHuang and also modified the PR body with more consistent to what was done. Thanks for the review :)

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

…n3-Omni (vllm-project#2375) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>

[Feat] Optimize Qwen3-Omni code predictor with torch.compile + bucket…

6873ecc

… warmup Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

JuanPZuluaga requested a review from hsliuustc0106 as a code owner March 31, 2026 11:44

JuanPZuluaga mentioned this pull request Mar 31, 2026

[Feat][Qwen3-Omni] Add CUDA graph support for Code2Wav decoder #2376

Merged

5 tasks

lishunyang12 reviewed Apr 2, 2026

View reviewed changes

JuanPZuluaga added 5 commits April 3, 2026 16:20

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

f305a02

… feat/cuda-graph-code-predictor

update and adressed comments

3d08fb3

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

8b66ed3

… feat/cuda-graph-code-predictor

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

f88761a

… feat/cuda-graph-code-predictor

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

5a516d1

… feat/cuda-graph-code-predictor

hsliuustc0106 reviewed Apr 9, 2026

View reviewed changes

JuanPZuluaga added 3 commits April 9, 2026 20:12

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

af185b1

… feat/cuda-graph-code-predictor

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

4809861

… feat/cuda-graph-code-predictor

abstract Qwen3TTS and Qwen3Omni CodePredictor into same class

b62f9f3

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

JuanPZuluaga added 4 commits April 10, 2026 12:55

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

2d92c5a

… feat/cuda-graph-code-predictor

update test

5eeaa8b

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

add wrapper test

2ed7dde

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

cb6dc82

… feat/cuda-graph-code-predictor

JuanPZuluaga changed the title ~~[Feat][Qwen3-Omni] Optimize code predictor with torch.compile bucket warmup~~ [Feat][Qwen3-Omni] Shared code predictor module for Qwen3-TTS and Qwen3-Omni Apr 11, 2026

hsliuustc0106 added the merge-test label to trigger buildkite merge test CI label Apr 11, 2026

JuanPZuluaga and others added 3 commits April 11, 2026 21:19

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

545723d

… feat/cuda-graph-code-predictor

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

837d72d

… feat/cuda-graph-code-predictor

Merge branch 'main' into feat/cuda-graph-code-predictor

39d04e6

ZeldaHuang reviewed Apr 14, 2026

View reviewed changes

JuanPZuluaga added 4 commits April 14, 2026 14:11

fix conflicts and merge main

fe5e00b

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'feat/cuda-graph-code-predictor' of github.com:JuanPZulu…

b51b012

…aga/vllm-omni into feat/cuda-graph-code-predictor

update naming

538eca2

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

28318aa

… feat/cuda-graph-code-predictor

_ensure_buffers now takes the actual batch size istead of max_seqs

0b12340

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

JuanPZuluaga added 6 commits April 15, 2026 04:29

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

0ebb4f6

… feat/cuda-graph-code-predictor

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

84b0ef9

… feat/cuda-graph-code-predictor

Merge branch 'feat/cuda-graph-code-predictor' of https://github.com/J…

10f174d

…uanPZuluaga/vllm-omni into feat/cuda-graph-code-predictor Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

95c389d

… feat/cuda-graph-code-predictor

merge

70064af

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

Merge branch 'main' of https://github.com/vllm-project/vllm-omni into…

c40d0ae

… feat/cuda-graph-code-predictor

ZeldaHuang reviewed Apr 15, 2026

View reviewed changes

revert example

4c4c581

Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>

ZeldaHuang added the ready label to trigger buildkite CI label Apr 15, 2026

ZeldaHuang enabled auto-merge (squash) April 15, 2026 07:14

ZeldaHuang approved these changes Apr 15, 2026

View reviewed changes

ZeldaHuang merged commit 82f8c93 into vllm-project:main Apr 15, 2026
8 checks passed

JuanPZuluaga deleted the feat/cuda-graph-code-predictor branch May 17, 2026 09:37

		@@ -0,0 +1,654 @@
		"""Code Predictor -- optimized re-prefill, no KV cache.

Conversation

JuanPZuluaga commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

amy-why-3459 commented Mar 31, 2026

Uh oh!

lishunyang12 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hsliuustc0106 left a comment

Choose a reason for hiding this comment

Uh oh!

ZeldaHuang commented Apr 9, 2026

Uh oh!

JuanPZuluaga commented Apr 9, 2026

Uh oh!

ZeldaHuang commented Apr 9, 2026

Uh oh!

ZeldaHuang commented Apr 10, 2026

Uh oh!

ZeldaHuang commented Apr 11, 2026

Uh oh!

JuanPZuluaga commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ZeldaHuang commented Apr 14, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ZeldaHuang commented Apr 14, 2026

Uh oh!

ZeldaHuang commented Apr 15, 2026

CI Test Failure: Tensor Shape Mismatch in Code Predictor

Uh oh!

JuanPZuluaga commented Apr 15, 2026

CI Test Failure: Tensor Shape Mismatch in Code Predictor

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

JuanPZuluaga commented Mar 31, 2026 •

edited

Loading

JuanPZuluaga commented Apr 11, 2026 •

edited

Loading