[perf]Qwen3-Omni performance optimization#3164
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 5d8b019d2d
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| dst.copy_(src, non_blocking=src.is_cuda) | ||
| if src.is_cuda: | ||
| torch.cuda.current_stream(device=self.device).synchronize() | ||
| return dst |
There was a problem hiding this comment.
Clone hidden-state CPU buffer before exposing to async paths
Returning dst directly here leaks a view into self._hidden_states_cpu_pinned, which is reused and overwritten on the next step. In chunk-transfer mode, omni_ar_scheduler.update_from_output passes pooler_output to OmniChunkTransferAdapter.save_async, and _send_single_request consumes it later on a background thread, so enqueued hidden tensors can be mutated before send and carry wrong step data. The previous code allocated a fresh CPU tensor each step, so this introduces a new correctness regression.
Useful? React with 👍 / 👎.
| dst.copy_(src, non_blocking=True) | ||
| return dst, src.device |
There was a problem hiding this comment.
Avoid returning reusable MM pinned buffers to output payloads
This returns cached pinned tensors by reference, but downstream payload construction (to_payload_element) often keeps slices/views of those tensors. Because GPUARModelRunner.sample_tokens now enables tensor_buffer_cache, the same buffers are rewritten on subsequent steps; with async chunk sending (save_async + background _send_single_request), previously queued multimodal payloads can be overwritten before transmission. That can corrupt inter-stage multimodal data for active requests.
Useful? React with 👍 / 👎.
0160799 to
a08cc64
Compare
|
is there any reason why async chunk scenarios did not benifit from the modifications? |
The main benefit of this modification comes from |
Updated |
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Performance data for the Qwen3-TTS has been updated. |
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com> Signed-off-by: sphinxkkkbc <binchengkang8@gmail.com>
Signed-off-by: amy-why-3459 <wuhaiyan17@huawei.com>
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Fix: #3093
Qwen3-Omni performance optimization
Test Plan
Test Result
Before:
After:
Qwen3-Omni:
text+audio:
text-only:
Qwen3-TTS:
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)