Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion vllm_omni/engine/async_omni_engine.py
Original file line number Diff line number Diff line change
Expand Up @@ -341,7 +341,7 @@ def _launch_llm_stage(
log_stats=False,
addresses=addresses,
)
engine_manager, coordinator, addresses = launch_cm.__enter__()
engine_manager, coordinator, addresses, _ = launch_cm.__enter__()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve tensor_queue when launching stage engines

launch_core_engines() now yields a fourth value (tensor_queue), but this code discards it (_) and never propagates it into StartedLlmStage/client_addresses for StageEngineCoreClient. In the new vLLM path, that queue is what enables AsyncMPClient to set up out-of-band tensor IPC for multimodal payloads; dropping it forces fallback serialization for tensor data and can cause major latency/memory regressions for multimodal requests (especially when mm_tensor_ipc=torch_shm is configured).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this 4th value (tensor_queue) actually be wired into StartedLlmStage? Discarding it silently could break tensor IPC for multimodal requests if upstream expects it to be propagated.

started_stage = StartedLlmStage(
stage_id=metadata.stage_id,
metadata=metadata,
Expand Down
3 changes: 3 additions & 0 deletions vllm_omni/entrypoints/openai/api_server.py
Original file line number Diff line number Diff line change
Expand Up @@ -630,6 +630,7 @@ async def omni_init_app_state(
OpenAIServingResponses(
engine_client,
state.openai_serving_models,
state.openai_serving_render,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
Expand Down Expand Up @@ -737,6 +738,7 @@ async def omni_init_app_state(
state.openai_serving_tokenization = OpenAIServingTokenization(
engine_client,
state.openai_serving_models,
state.openai_serving_render,
request_logger=request_logger,
chat_template=resolved_chat_template,
chat_template_content_format=args.chat_template_content_format,
Expand Down Expand Up @@ -786,6 +788,7 @@ async def omni_init_app_state(
ServingTokens(
engine_client,
state.openai_serving_models,
openai_serving_render=state.openai_serving_render,
request_logger=request_logger,
return_tokens_as_token_ids=args.return_tokens_as_token_ids,
enable_prompt_tokens_details=args.enable_prompt_tokens_details,
Expand Down
10 changes: 6 additions & 4 deletions vllm_omni/worker/gpu_model_runner.py
Original file line number Diff line number Diff line change
Expand Up @@ -704,11 +704,13 @@ def _dummy_run(
seq_lens = [1] * num_decode_tokens + [num_prefill_tokens + 1] # type: ignore[assignment]
else:
seq_lens = max_query_len # type: ignore[assignment]
self.seq_lens.np[:num_reqs] = seq_lens
self.seq_lens.np[num_reqs:] = 0
self.seq_lens.copy_to_gpu()
self.optimistic_seq_lens_cpu[:num_reqs] = seq_lens
self.optimistic_seq_lens_cpu[num_reqs:].fill_(0)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing a vllm dependency bump — optimistic_seq_lens_cpu and the new _get_cumsum_and_arange(num_tokens, arange_out) signature only exist in recent vllm. Without pinning the minimum version this will break on older installs.

self.seq_lens.copy_(self.optimistic_seq_lens_cpu, non_blocking=True)

cum_num_tokens, _ = self._get_cumsum_and_arange(num_scheduled_tokens)
cum_num_tokens = self._get_cumsum_and_arange(
num_scheduled_tokens, self.query_pos.np
)
self.query_start_loc.np[1 : num_reqs + 1] = cum_num_tokens
self.query_start_loc.copy_to_gpu()

Expand Down
Loading