Run omni with latest vllm commit 1b6cb920e6ebcac57154e6154578c39d4892a16c#2182
Run omni with latest vllm commit 1b6cb920e6ebcac57154e6154578c39d4892a16c#2182QiuMike wants to merge 1 commit intovllm-project:mainfrom
Conversation
with vllm top commit 1b6cb920e6ebcac57154e6154578c39d4892a16c has some diffs with vllm-omni, vllm-project/vllm#32104 vllm-project/vllm#32951 vllm-project/vllm#37287 vllm-project/vllm#36483 just modify vllm-omni to work Signed-off-by: Michael Qiu <qiudayu.qdy@antgroup.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b47622f2cd
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| addresses=addresses, | ||
| ) | ||
| engine_manager, coordinator, addresses = launch_cm.__enter__() | ||
| engine_manager, coordinator, addresses, _ = launch_cm.__enter__() |
There was a problem hiding this comment.
Preserve tensor_queue when launching stage engines
launch_core_engines() now yields a fourth value (tensor_queue), but this code discards it (_) and never propagates it into StartedLlmStage/client_addresses for StageEngineCoreClient. In the new vLLM path, that queue is what enables AsyncMPClient to set up out-of-band tensor IPC for multimodal payloads; dropping it forces fallback serialization for tensor data and can cause major latency/memory regressions for multimodal requests (especially when mm_tensor_ipc=torch_shm is configured).
Useful? React with 👍 / 👎.
lishunyang12
left a comment
There was a problem hiding this comment.
left a couple comments
| addresses=addresses, | ||
| ) | ||
| engine_manager, coordinator, addresses = launch_cm.__enter__() | ||
| engine_manager, coordinator, addresses, _ = launch_cm.__enter__() |
There was a problem hiding this comment.
Should this 4th value (tensor_queue) actually be wired into StartedLlmStage? Discarding it silently could break tensor IPC for multimodal requests if upstream expects it to be propagated.
| self.seq_lens.np[num_reqs:] = 0 | ||
| self.seq_lens.copy_to_gpu() | ||
| self.optimistic_seq_lens_cpu[:num_reqs] = seq_lens | ||
| self.optimistic_seq_lens_cpu[num_reqs:].fill_(0) |
There was a problem hiding this comment.
Missing a vllm dependency bump — optimistic_seq_lens_cpu and the new _get_cumsum_and_arange(num_tokens, arange_out) signature only exist in recent vllm. Without pinning the minimum version this will break on older installs.
|
Thanks for your efforts. Would you mind to take a look at the branch dev/vllm-align or PR #1003 which is used to track the vllm commits. |
with vllm top commit 1b6cb920e6ebcac57154e6154578c39d4892a16c has some diffs with vllm-omni,
vllm-project/vllm#32104
vllm-project/vllm#32951
vllm-project/vllm#37287
vllm-project/vllm#36483
just modify vllm-omni to work
PLEASE FILL IN THE PR DESCRIPTION HERE ENSURING ALL CHECKLIST ITEMS (AT THE BOTTOM) HAVE BEEN CONSIDERED.
Purpose
Test Plan
Test Result
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model. Please runmkdocs serveto sync the documentation editions to./docs.BEFORE SUBMITTING, PLEASE READ https://github.com/vllm-project/vllm-omni/blob/main/CONTRIBUTING.md (anything written below this line will be removed by GitHub Actions)