[Feat][Qwen3-Omni] Shared code predictor module for Qwen3-TTS and Qwen3-Omni#2375
Conversation
… warmup Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
|
@LJH-LBJ PTAL |
lishunyang12
left a comment
There was a problem hiding this comment.
left a few comments. the compile+bucket change looks solid overall — nice that it follows the TTS code predictor pattern.
| # Convert to numpy array and ensure correct format | ||
| # In async_chunk mode, audio may arrive as a list of chunks | ||
| if isinstance(audio_tensor, list): | ||
| import torch |
There was a problem hiding this comment.
torch is already used transitively elsewhere in this file (via audio_tensor.float()). Move the import to the top-level imports instead of burying it inside a conditional.
There was a problem hiding this comment.
moved to the top.
| def _ensure_buffers(self, device: torch.device, dtype: torch.dtype, min_bsz: int = 0) -> None: | ||
| """Pre-allocate projection buffer sized to max(max_num_seqs, min_bsz).""" | ||
| max_seq = self.num_code_groups + 1 | ||
| max_bsz = max(self._vllm_config.scheduler_config.max_num_seqs, min_bsz) |
There was a problem hiding this comment.
The min_bsz parameter is not present in the TTS code predictor version of _ensure_buffers. Is this needed? max_num_seqs should already be the upper bound — if bsz > max_num_seqs something else has gone wrong.
There was a problem hiding this comment.
Thanks, fixed now.
| proj_buf[:bsz, 0:1, :] = last_talker_hidden | ||
| proj_buf[:bsz, 1:2, :] = layer0_embed | ||
|
|
||
| # Get pre-computed pos_ids for this bucket |
There was a problem hiding this comment.
Nit: _setup_compile does warmup internally which can be expensive. Might be worth adding a log line or comment at the call site so someone debugging a slow first-call knows to look there.
… feat/cuda-graph-code-predictor
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… feat/cuda-graph-code-predictor
… feat/cuda-graph-code-predictor
… feat/cuda-graph-code-predictor
|
The implementations of |
@ZeldaHuang that's correct. It's quite overlapped; should I propose a shared module for |
You can include it in this PR if it’s not too complicated, and it would be great to add some tests to protect the module as well. Thanks! |
… feat/cuda-graph-code-predictor
… feat/cuda-graph-code-predictor
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
|
@JuanPZuluaga Hi, I notice you abstract the whole code predictor model, can you change the PR title? |
… feat/cuda-graph-code-predictor
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… feat/cuda-graph-code-predictor
|
@JuanPZuluaga To speed up the process, it would be better to first land just the torch.compile abstraction in this PR, and leave the rest (modeling, cudagraph support, etc.) for follow-up PRs. |
|
Hi, I'll update the body. @ZeldaHuang
the issue is that these optimizations are already done on Qwen3TTS code-predictor model. If we drop them, it would regress Qwen3TTS. The shared module keeps the full stack and gates cudagraph capture behind |
… feat/cuda-graph-code-predictor
… feat/cuda-graph-code-predictor
It make sense. For this PR, we can focus on resolving the shared module first, while keeping the current CUDA graph capture approach for each code predictor unchanged. |
| @@ -0,0 +1,654 @@ | |||
| """Code Predictor -- optimized re-prefill, no KV cache. | |||
There was a problem hiding this comment.
Would it be better to rename this shared module from CodePredictor to QwenCodePredictor (since other models also use code predictors, such as Fish Speech), or to Qwen3OmniCodePredictor (since it was first introduced in Qwen3Omni)?
|
@JuanPZuluaga Please fix conflicts |
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
…aga/vllm-omni into feat/cuda-graph-code-predictor
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… feat/cuda-graph-code-predictor
CI Test Failure: Tensor Shape Mismatch in Code PredictorThe CI test test_mix_to_text_audio_001[omni_server0] is failing with a tensor dimension mismatch error. Location: vllm_omni/model_executor/models/common/qwen3_code_predictor.py line 537 @JuanPZuluaga PTAL |
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
thanks for caching this issue: the thing was that with the the fix is only |
… feat/cuda-graph-code-predictor
… feat/cuda-graph-code-predictor
…uanPZuluaga/vllm-omni into feat/cuda-graph-code-predictor Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
… feat/cuda-graph-code-predictor
… feat/cuda-graph-code-predictor
| output_wav = os.path.join(output_dir, f"output_{request_id}.wav") | ||
|
|
||
| # Convert to numpy array and ensure correct format | ||
| # In async_chunk mode, audio may arrive as a list of chunks |
There was a problem hiding this comment.
We already have examples/offline_inference/qwen3_omni/end2end_async_chunk.py to run offline inference with async_chunk enabled
There was a problem hiding this comment.
I fixed this @ZeldaHuang and also modified the PR body with more consistent to what was done. Thanks for the review :)
Signed-off-by: JuanPZuluaga <juanz9312@gmail.com>
…n3-Omni (vllm-project#2375) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…n3-Omni (vllm-project#2375) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…n3-Omni (vllm-project#2375) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
…n3-Omni (vllm-project#2375) Signed-off-by: JuanPZuluaga <juanz9312@gmail.com> Co-authored-by: Hongsheng Liu <liuhongsheng4@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
in this PR:
Qwen3-TTSandQwen3-Omniinto a sharedCodePredictorWrapperbase classcommon/qwen3_code_predictor.pyCodePredictorWrapperConfigdataclass with 5 behavioral flags:use_cuda_graphs,use_parallel_embedding,use_projection,return_proj_buf,sampling_modetorch.compile(dynamic=False)flag + CUDA graph capture per power-of-2 batch buckets, withepilogue_fusion=Falseto preservefloat32precision inRMSNorm/RoPEfor audio quality (this was reported in previous PRs)stage_init_utils.py:hasattrreturnedTrueforNone-valuedcustom_process_input_func; replaced withgetattr(..., None)truthiness checkTest Plan
Test Result
the e2e time is more or less the same overall, but the code-predictor is a bit faster.
some audios generated at concurrency=16:
output_0_4178681c-d9ac-423e-a274-8daaf2bd4b64.wav
output_1_fd088db1-2725-4321-9286-cf7d966dfff0.wav
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)