Add tensor IPC transfer mechanism for multimodal data by brandonpelfrey · Pull Request #32104 · vllm-project/vllm

brandonpelfrey · 2026-01-11T04:00:35Z

Introduce Multimodal Content Tensor IPC/SHMEM Data Path

Following on from a request to break down the RFC/PR in #31925 , this PR introduces a IPC/SHMEM pathway for sending multimodal content from API Server -> CoreEngine processes via multiprocessing Queues. Part of the intention of this change is to reduce the number of changes in the original PR and introduce easier-to-review components which are required for the complete solution.

Note that this pathway is only used when the multimodal processing cache is disabled. Functionally this is because the cache mechanism replaces tensors with integers which causes the tensors to not go over this new multiprocessing queue mechanism. While this is a known limitation, this is still useful in many situations where a single input prompt task, e.g. video captioning, is needed.

Purpose

In the above-mentioned RFC/PR, we have demonstrated a method for enable multi-GPU scaling in video-decode heavy workloads. This requires a means of passing HW video decode results (sitting in VRAM in the API Server process) to CoreEngine process(es). When utilized with CUDA-device tensors, this IPC/SHMEM mechanism can provide a fast mechanism for data transfer and avoids any GPU->CPU->GPU copies.

Test Plan

For testing, I am depending on existing CI testing. Functional testing includes both a vllm serve + vllm bench combination (see below) as well as utilizing this PR with the above-mentioned PR to demonstrate that it will also work in the GPU zero-copy case.

Serve command and Bench Commands

vllm serve nvidia/cosmos-reason1-7b \
    --limit-mm-per-prompt '{\"video\": 1}' \
    --allowed-local-media-path / \
    --enable-log-requests --disable-log-stats \
    --mm-processor-cache-gb 0 \
    --trust-remote-code \
    --tensor-parallel-size 1 \
    --max-model-len 65536 \
    --gpu-memory-utilization 0.6 \
    --no-enforce-eager \
    --max-num-seqs 64 \
    --no-enable-prefix-caching \
    --api-server-count 3 \
    --media-io-kwargs '{\"video\":{\"num_frames\":10}}' \
    --mm-processor-kwargs '{\"size\":{\"shortest_edge\":100352,\"longest_edge\":151200}}' \
    --maximum-concurrent-videos 140

vllm bench serve \
    --endpoint /v1/chat/completions --backend openai-chat \
    --model nvidia/cosmos-reason1-7b \
    --dataset-name sharegpt \
    --dataset-path '$DATASET_PATH' \
    --save-result \
    --save-detailed \
    --disable-shuffle \
    --num-warmups 20 \
    --num-prompts 1000 --max-concurrency 400 \
    --sharegpt-output-len 128

Test Results

CPU: AMD EPYC 9124 16-Core Processor
GPU: H100
Memory: 512GB
uname -r: 6.14.0-37-generic

Without IPC Tensor Datapath enabled

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             400
Benchmark duration (s):                  103.91
Total input tokens:                      16003
Total generated tokens:                  33806
Request throughput (req/s):              9.62
Output token throughput (tok/s):         325.35
Peak output token throughput (tok/s):    1636.00
Peak concurrent requests:                454.00
Total token throughput (tok/s):          479.36
---------------Time to First Token----------------
Mean TTFT (ms):                          34909.64
Median TTFT (ms):                        31583.43
P99 TTFT (ms):                           75718.16
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          86.46
Median TPOT (ms):                        26.23
P99 TPOT (ms):                           638.71
---------------Inter-token Latency----------------
Mean ITL (ms):                           224.89
Median ITL (ms):                         77.25
P99 ITL (ms):                            4442.88
==================================================

With IPC Tensor Datapath enabled

============ Serving Benchmark Result ============
Successful requests:                     1000
Failed requests:                         0
Maximum request concurrency:             400
Benchmark duration (s):                  104.25
Total input tokens:                      16003
Total generated tokens:                  33468
Request throughput (req/s):              9.59
Output token throughput (tok/s):         321.05
Peak output token throughput (tok/s):    2624.00
Peak concurrent requests:                464.00
Total token throughput (tok/s):          474.56
---------------Time to First Token----------------
Mean TTFT (ms):                          34538.27
Median TTFT (ms):                        29643.85
P99 TTFT (ms):                           76100.64
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          120.09
Median TPOT (ms):                        75.44
P99 TPOT (ms):                           856.38
---------------Inter-token Latency----------------
Mean ITL (ms):                           255.97
Median ITL (ms):                         77.76
P99 ITL (ms):                            5233.55
==================================================

After multiple runs, it appears that the performance is approximately identical (within noise).

Essential Elements of an Effective PR Description Checklist

[*] The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
[*] The test plan, such as providing test command.
[*] The test results, such as pasting the results comparison before and after, or e2e results
[*] (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

Note

Implements zero-copy IPC for multimodal tensors and wires it through engine/client startup with configs and tests.

Introduces TensorIpcData/TensorIpcHandle and updates MsgpackEncoder/MsgpackDecoder to send/receive CUDA and CPU tensors via torch.multiprocessing.Queue (per-engine queues), routed by set_target_engine; falls back to standard serialization when disabled/unavailable
Plumbs tensor queues through engine lifecycle: created per DP engine in vllm.v1.engine.utils, included in handshake metadata (index only), passed to EngineCoreProc/DPEngineCoreProc, and used by input decoders; CoreEngineClient configures encoder with queues and IPC setting
Adds configuration and flags: MultiModalConfig gains max_concurrent_videos and multimodal_tensor_ipc; ModelConfig/arg parsing expose --maximum-concurrent-videos and --enable/--disable-multimodal-tensor-ipc; new env VLLM_MULTIMODAL_TENSOR_IPC (default True)
Starts API servers with shared tensor_queues via APIServerProcessManager; CLI serve passes queues
Adds comprehensive tests in tests/v1/test_tensor_ipc_queue.py for CUDA/CPU IPC, multiple producers, buffer management, and IPC disablement
Minor fixes: base64-encode image tensors on CPU; only pin memory for CPU tensors during concatenation

^{Written by Cursor Bugbot for commit d527841933b4dcd62a95d7e1ab58455f9b0cc88f. This will update automatically on new commits. Configure here.}

Note

Enables zero-copy transfer of multimodal tensors between API servers and engine cores using per-engine torch.multiprocessing.Queue.

Introduces TensorIpcData/TensorIpcHandle and updates MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via IPC queues (with fallback to standard serialization)
Plumbs tensor queues through engine lifecycle: creation in vllm.v1.engine.utils, inclusion of queue index in handshake, passed into EngineCoreProc/DPEngineCoreProc, and used by core input decoders; client sets target engine for routing
Adds config and controls: MultiModalConfig gains max_concurrent_videos and multimodal_tensor_ipc; ModelConfig/args expose --maximum-concurrent-videos and --enable/--disable-multimodal-tensor-ipc; new env VLLM_MULTIMODAL_TENSOR_IPC
API servers launched with shared tensor_queues via APIServerProcessManager; CLI serve passes queues to workers
Adds tests tests/v1/test_tensor_ipc_queue.py covering CUDA/CPU IPC, multiple producers, buffering, and IPC disablement
Minor fixes: base64-encode image tensors from CPU; only pin memory for CPU tensors during concatenation

^{Written by Cursor Bugbot for commit 1a1460b05b9cc3c4b695682ef36aa3d5c5c959cc. This will update automatically on new commits. Configure here.}

Note

Introduces a shared-memory IPC path for multimodal tensors to avoid serialization and GPU↔CPU copies.

Adds TensorIpcData/TensorIpcHandle and extends MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via per-engine torch.multiprocessing.Queue (fallback to standard serialization when disabled/unavailable)
Plumbs tensor queues through engine lifecycle and handshake: created per DP engine, queue index included in handshake, passed to EngineCoreProc/DPEngineCoreProc, used by core input decoders; client sets target engine for routing and initializes encoder with queues
CLI/config/env: MultiModalConfig gains max_concurrent_videos and multimodal_tensor_ipc; surfaced via ModelConfig and args --maximum-concurrent-videos and --[en|dis]able-multimodal-tensor-ipc; new env VLLM_MULTIMODAL_TENSOR_IPC
API server manager/serve path passes shared tensor_queues to workers
Adds tests tests/v1/test_tensor_ipc_queue.py covering CUDA/CPU IPC, multiple producers, buffering, and IPC disablement
Minor fixes: base64-encode image tensors from CPU; only pin memory for CPU tensors during concatenation

^{Written by Cursor Bugbot for commit 29a2efb85b1118a8e306489139ca0315c7162049. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 326ce4ad86e3928bc9fedaf9458ae2835aaece31. Configure here.}

Note

Introduces a shared-memory IPC path for multimodal tensors to avoid serialization and GPU↔CPU copies.

Adds TensorIpcData/TensorIpcHandle and updates MsgpackEncoder/MsgpackDecoder to route tensors via per-engine torch.multiprocessing.Queue using set_target_engine, with fallback to standard serialization
Plumbs tensor queues through engine lifecycle: created in vllm/v1/engine/utils.py, queue index included in handshake, passed to EngineCoreProc/DPEngineCoreProc, and used by core input decoders; client configures encoder with queues and IPC setting in vllm/v1/engine/core_client.py
Config/CLI/env: MultiModalConfig gains max_concurrent_videos and multimodal_tensor_ipc; ModelConfig wires through; adds --maximum-concurrent-videos and --enable/--disable-multimodal-tensor-ipc; new env VLLM_MULTIMODAL_TENSOR_IPC
API servers launched with shared tensor_queues via APIServerProcessManager; CLI serve passes queues
Adds tests tests/v1/test_tensor_ipc_queue.py covering CUDA/CPU IPC, multiple producers, buffering, and IPC disablement
Minor fixes: base64-encode image tensors from CPU; only pin memory for CPU tensors during concatenation

^{Written by Cursor Bugbot for commit 326ce4ad86e3928bc9fedaf9458ae2835aaece31. This will update automatically on new commits. Configure here.}

Note

Introduces a zero-copy IPC path for multimodal tensors routed over per-engine torch.multiprocessing.Queues and integrates it across client/engine startup and request handling.

Adds TensorIpcData/TensorIpcHandle and extends MsgpackEncoder/MsgpackDecoder to send/receive CPU and CUDA tensors via IPC queues (with fallback to standard serialization); supports request context and buffer cleanup
Creates per-engine tensor queues in vllm.v1.engine.utils, includes queue index in handshake, passes queues into EngineCoreProc/DPEngineCoreProc, and uses them in input decoders; client sets target engine, request context, and initializes encoder with queues
Adds abort_requests cleanup of orphaned tensors in engine core
Exposes controls: MultiModalConfig gains max_concurrent_videos and multimodal_tensor_ipc; ModelConfig wires through; CLI adds --maximum-concurrent-videos and --enable/--disable-multimodal-tensor-ipc; env var VLLM_MULTIMODAL_TENSOR_IPC (default True)
API server workers launched with shared tensor_queues via APIServerProcessManager; CLI serve passes queues
Adds tests tests/v1/test_tensor_ipc_queue.py covering CUDA/CPU IPC, multiple producers, buffering, cleanup, and IPC disablement

^{Written by Cursor Bugbot for commit 8ead3dd36fced6545f72d60b73be7df26aee9670. This will update automatically on new commits. Configure here.}

Note

Enables zero-copy transfer of multimodal tensors between API servers and engine cores using per-engine torch.multiprocessing.Queue.

Introduces TensorIpcData/TensorIpcHandle and extends MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via IPC queues, with request context, buffering, and cleanup fallback to standard serialization
Plumbs tensor queues through engine lifecycle: created in vllm/v1/engine/utils.py, queue index included in handshake, passed to EngineCoreProc/DPEngineCoreProc, and used by input decoders; client sets target engine and configures encoder
Adds request abort cleanup in engine core to remove orphaned tensors
Configuration and controls: MultiModalConfig gains max_concurrent_videos and multimodal_tensor_ipc; ModelConfig wires through; CLI adds --maximum-concurrent-videos and --enable/--disable-multimodal-tensor-ipc; env VLLM_MULTIMODAL_TENSOR_IPC
API servers launched with shared tensor_queues via APIServerProcessManager; serve passes queues
Adds tests tests/v1/test_tensor_ipc_queue.py covering CUDA/CPU IPC, multiple producers, buffering/cleanup, and IPC disablement

^{Written by Cursor Bugbot for commit 96e9bcf. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit 053716e. Configure here.}

Note

Introduces a shared-memory IPC path for multimodal tensors and wires it through engine/client startup and request handling.

Adds TensorIpcData/TensorIpcHandle and extends MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via per-engine torch.multiprocessing.Queue (zero-copy), with request context, buffering, and fallback to standard serialization
Creates per-engine tensor_queues in engine startup; includes queue index in handshake; passes queues into EngineCoreProc/DPEngineCoreProc; input decoders consume from queues; engine abort cleans orphaned tensors
Core client sets target engine for routing and initializes encoder with queues; API server manager/CLI serve pass shared tensor_queues to workers
Adds controls: multimodal_tensor_ipc in MultiModalConfig/ModelConfig, CLI flags --enable/--disable-multimodal-tensor-ipc, and env VLLM_MULTIMODAL_TENSOR_IPC (default True)
New tests tests/v1/test_tensor_ipc_queue.py cover CUDA/CPU IPC, multi-producer queueing, buffer management/cleanup, and IPC disablement

^{Written by Cursor Bugbot for commit 053716e. This will update automatically on new commits. Configure here.}

Note

Introduces a zero-copy IPC pathway for multimodal tensors and integrates it across the engine/client lifecycle.

Adds TensorIpcData/TensorIpcHandle and updates MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via per-engine torch.multiprocessing.Queue with request context, buffering, and cleanup; falls back to standard serialization when disabled/unavailable
Plumbs tensor_queues through startup/handshake: created per DP engine in vllm.v1.engine.utils, queue index included in handshake, passed to EngineCoreProc/DPEngineCoreProc, and consumed by core input decoders; client sets target engine and request context during encoding
Adds abort-time tensor cleanup in EngineCoreProc.abort_requests
Exposes controls: multimodal_tensor_ipc in MultiModalConfig/ModelConfig, CLI --enable/--disable-multimodal-tensor-ipc, and env VLLM_MULTIMODAL_TENSOR_IPC; serve and APIServerProcessManager pass tensor_queues to workers
New tests tests/v1/test_tensor_ipc_queue.py cover CUDA/CPU IPC, multi-producer behavior, buffer management/cleanup, and IPC disablement

^{Written by Cursor Bugbot for commit eefe86b. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit c756621. Configure here.}

Note

Introduces a shared-memory IPC pathway for multimodal tensors and integrates it across the client/engine lifecycle.

Adds TensorIpcData/TensorIpcHandle and extends MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via per-engine torch.multiprocessing.Queue (zero-copy), with request context, buffering, and cleanup; falls back to standard serialization
Plumbs tensor_queues through startup/handshake and engine core: queues created per DP engine, queue index included in handshake, passed into EngineCoreProc/DPEngineCoreProc, and consumed by input decoders; client sets target engine/request context during encoding; abort_requests now cleans orphaned tensors
Adds controls: multimodal_tensor_ipc in MultiModalConfig/ModelConfig, CLI --enable/--disable-multimodal-tensor-ipc, and env VLLM_MULTIMODAL_TENSOR_IPC; serve and APIServerProcessManager pass tensor_queues to API workers
Fixes decoding of non-multimodal torch.Tensor fields when IPC is enabled by handling TensorIpcHandle dicts in _decode_tensor
Adds comprehensive tests (tests/v1/test_tensor_ipc_queue.py, new cases in test_serial_utils.py) covering CPU/CUDA IPC, multi-producer behavior, buffer management/cleanup, IPC disablement, and encoder request context

^{Written by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.}

Note

Introduces a shared-memory IPC path for multimodal tensors to avoid serialization and GPU↔CPU copies.

Adds TensorIpcData/TensorIpcHandle and extends MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via per-engine torch.multiprocessing.Queue, with request context, buffering, and cleanup (fallback to standard serialization)
Wires queues through startup/handshake (EngineZmqAddresses, EngineHandshakeMetadata), passes into EngineCoreProc/DPEngineCoreProc, and uses them in core input decoders; client sets target engine and request context during encoding; engine abort path cleans orphaned tensors
Exposes control via MultiModalConfig.multimodal_tensor_ipc, ModelConfig plumbing, --[enable|disable]-multimodal-tensor-ipc flag, and VLLM_MULTIMODAL_TENSOR_IPC env; CLI serve/API server manager pass tensor_queues
Adds tests covering CUDA/CPU IPC, multi-producer behavior, buffer management/cleanup, IPC disabled mode, and non-multimodal tensor fields

^{Written by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.}

Note

Implements a shared-memory IPC path for multimodal tensors and wires it through engine/client startup, request handling, and cleanup.

Introduces TensorIpcData/TensorIpcHandle and updates MsgpackEncoder/MsgpackDecoder to route CPU/CUDA tensors via per-engine torch.multiprocessing.Queue (with request context, buffering, and abort-time cleanup); falls back to standard serialization when disabled
Creates per-engine tensor_queues, passes queue index in handshake, and plumbs queues into EngineCoreProc/DPEngineCoreProc; core input decoders use the queues, and abort_requests removes orphaned tensors
Client configures encoder with queues and IPC setting; adds request-scoped encoding context to set target engine and request ID
Adds config/CLI/env controls: multimodal_tensor_ipc in MultiModalConfig/ModelConfig, --enable/--disable-multimodal-tensor-ipc, and VLLM_MULTIMODAL_TENSOR_IPC; serve and APIServerProcessManager pass tensor_queues to API workers
Fixes decoding for non-multimodal torch.Tensor fields when IPC is enabled
Adds comprehensive tests (tests/v1/test_tensor_ipc_queue.py, new cases in test_serial_utils.py) covering CPU/CUDA IPC, multi-producer behavior, buffer management/cleanup, and IPC disablement

^{Written by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.}

Note

^{Cursor Bugbot is generating a summary for commit c756621. Configure here.}

Note

Introduces a shared-memory IPC path for multimodal tensors and integrates it across engine/client lifecycle.

Adds TensorIpcData/TensorIpcHandle and updates MsgpackEncoder/MsgpackDecoder to send/receive tensors via per-engine torch.multiprocessing.Queue (routing via set_target_engine, request context, buffering, and cleanup; fallback to standard serialization). Fixes decoding of non-multimodal torch.Tensor fields when IPC is enabled.
Wires queues through startup/handshake: created per DP engine in vllm/v1/engine/utils.py, queue index included in handshake, passed into EngineCoreProc/DPEngineCoreProc, and consumed by core input decoders; client configures encoder with queues and uses a request-scoped context.
Adds abort-time tensor cleanup in EngineCoreProc.abort_requests and passes tensor_queues to API workers in serve/APIServerProcessManager.
Exposes controls: multimodal_tensor_ipc in MultiModalConfig/ModelConfig, CLI --enable/--disable-multimodal-tensor-ipc, and env VLLM_MULTIMODAL_TENSOR_IPC.
New tests (tests/v1/test_tensor_ipc_queue.py, plus cases in test_serial_utils.py) cover CPU/CUDA IPC, multi-producer behavior, buffer management/cleanup, IPC disabled mode, and encoder request context.

^{Written by Cursor Bugbot for commit c756621. This will update automatically on new commits. Configure here.}

gemini-code-assist

Code Review

This pull request introduces a significant feature: an IPC/SHMEM pathway for multimodal tensors to improve performance in multi-GPU setups. The changes are extensive, touching configuration, argument parsing, engine core logic, and serialization utilities. The addition of comprehensive tests for the new IPC queue functionality is commendable. I've identified a critical issue in the engine core logic that could break data parallelism for non-MoE models, along with a couple of important bug fixes for handling tensors on CUDA devices in multimodal data processing.

vllm/v1/engine/core.py

mergify · 2026-01-11T04:11:01Z

Hi @brandonpelfrey, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vllm/v1/serial_utils.py

vllm/v1/engine/core.py

vllm/engine/arg_utils.py

mergify · 2026-01-11T04:43:22Z

Hi @brandonpelfrey, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy or markdownlint failing?

mypy and markdownlint are run differently in CI. If the failure is related to either of these checks, please use the following commands to run them locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10
# For markdownlint
pre-commit run --hook-stage manual markdownlint

vllm/v1/engine/core.py

vllm/v1/serial_utils.py

DarkLight1337 · 2026-01-11T05:12:54Z

Just to be sure , the benchmarks in the PR description are run without GPU preprocessing right?

vllm/config/model.py

vllm/multimodal/image.py

vllm/multimodal/inputs.py

brandonpelfrey · 2026-01-11T05:14:50Z

Just to be sure , the benchmarks in the PR description are run without GPU preprocessing right?

Correct. There is no GPU preprocessing on the API Server. Note, I'm resolving some of the bot-identified issues at the moment, formatting etc.

brandonpelfrey · 2026-01-11T05:44:46Z

Force-pushed purely to resolve DCO.

vllm/v1/engine/core.py

vllm/v1/serial_utils.py

vllm/v1/engine/core_client.py

vllm/config/multimodal.py

vllm/engine/arg_utils.py

njhill

@brandonpelfrey apologies again for taking so long to re-review. Again it's quite a large PR with nontrivial changes to core parts of the code, so needed to dedicate some time to it.

Generally it would be appreciated if you could spend more time on self-review.

And it would still be good to understand how we are thinking about reconciling this with the existing mm shm-based tensor propagation (ie from #20452).

vllm/config/vllm.py

vllm/config/multimodal.py

vllm/engine/arg_utils.py

vllm/v1/engine/core_client.py

vllm/v1/serial_utils.py

mergify · 2026-03-07T14:16:40Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

mergify · 2026-03-09T23:02:12Z

Hi @brandonpelfrey, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

mergify · 2026-03-09T23:10:29Z

Hi @brandonpelfrey, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

mergify · 2026-03-11T20:13:39Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

brandonpelfrey · 2026-03-13T04:10:08Z

@njhill I believe all concerns regarding separating state of tensor transfers that Nick and I discussed from the Msgspec Encoder/Decoder are done.

EngineZmqAddresses is no longer used to pass tensor_queues between processes. The tensor_queue_index is still present. I would propose that the purpose of this data structure is now more generally to hold information about how processes can reach each other, in which case this queue index makes sense, and perhaps the name of this class could change to reflect this slightly more generic purpose.

Unused serialization/deserialization pathways were determined via coverage maps and the unused paths were removed.

Latest testing succeeded:

$ VLLM_LOGGING_LEVEL=DEBUG   vllm serve llava-hf/llava-1.5-7b-hf     --host 127.0.0.1     --port 8000         --limit-mm-per-prompt.image 1     --max-model-len 4096 --allowed-local-media-path /

curl http://127.0.0.1:8000/v1/chat/completions     -H 'Content-Type: application/json'     -d '{
      "model": "llava-hf/llava-1.5-7b-hf",
      "messages": [
        {
          "role": "user",
          "content": [
            {"type": "text", "text": "Describe this image briefly."},
            {
              "type": "image_url",
              "image_url": {
                "url": "file:///workspace/vllm/tests/v1/ec_connector/integration/hato.jpg"
              }
            }
          ]
        }
      ],
      "max_tokens": 64
    }'
{"id":"chatcmpl-ac21e4b789bbc4ba","object":"chat.completion","created":1773374565,"model":"llava-hf/llava-1.5-7b-hf","choices":[{"index":0,"message":{"role":"assistant","content":" The image captures a group of pigeons walking in a row along the edge of a brick brick courtyard. There are at least eleven pigeons amongst the group.\n\nAround the pigeons, some pedestrians are noticeable, with one person standing near the bottom left side","refusal":null,"annotations":null,"audio":null,"function_call":null,"tool_calls":[],"reasoning":null},"logprobs":null,"finish_reason":"length","stop_reason":null,"token_ids":null}],"service_tier":null,"system_fingerprint":null,"usage":{"prompt_tokens":593,"total_tokens":657,"completion_tokens":64,"prompt_tokens_details":null},"prompt_logprobs":null,"prompt_token_ids":null,"kv_transfer_params":null}

I've confirmed in the debug logs that tensors are sent and received as expected.

(APIServer pid=608529) INFO:     Started server process [608529]
(APIServer pid=608529) INFO:     Waiting for application startup.
(APIServer pid=608529) INFO:     Application startup complete.
(APIServer pid=608529) DEBUG 03-13 03:59:49 [v1/sample/logits_processor/__init__.py:63] No logitsprocs plugins installed (group vllm.logits_processors).
(APIServer pid=608529) DEBUG 03-13 03:59:49 [v1/engine/tensor_ipc.py:94] Sent tensor 131200840719280_0 for request chatcmpl-8e72c411f307d2b8-ad244d70 (shape=torch.Size([3, 336, 336]), device=cpu) via IPC queue (shared memory)
(EngineCore pid=608624) DEBUG 03-13 03:59:49 [v1/engine/tensor_ipc.py:168] Received tensor 131200840719280_0 for request chatcmpl-8e72c411f307d2b8-ad244d70 (shape=torch.Size([3, 336, 336]), device=cpu) via IPC queue (shared memory)
(EngineCore pid=608624) DEBUG 03-13 03:59:49 [v1/engine/core.py:1185] EngineCore loop active.
(EngineCore pid=608624) DEBUG 03-13 03:59:49 [v1/worker/gpu_model_runner.py:3655] Running batch with cudagraph_mode: NONE, batch_descriptor: BatchDescriptor(num_tokens=593, num_reqs=None, uniform=False, has_lora=False, num_active_loras=0), should_ubatch: False, num_tokens_across_dp: None

mergify · 2026-03-13T13:44:14Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

mergify · 2026-03-13T19:52:56Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

mergify · 2026-03-14T16:43:18Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

njhill

@brandonpelfrey I have done another pass. I think there is still some work needed and also still some basic things that could have been caught by self-review and/or ai review.

I think we can still better decouple from the encoder/decoder, basically just slightly generalize the existing aux_buffer handling, so that for example if the last ("data") element of the tuple is a dict that the sender returns rather than an int, and just pass opaque dict to the receiver. Then encoder/decoder don't need any knowledge of request id etc.

I also feel the changes to the core classes are still too big, but need to spend bit more time on that.

vllm/config/model.py

njhill · 2026-03-17T17:09:20Z

vllm/config/vllm.py

+            raise ValueError(
+                "torch_shm is known to fail without "
+                "VLLM_WORKER_MULTIPROC_METHOD set to spawn"
+            )


Is there a reason for this check being here rather than along with the other validation in config/model.py?

Actually I think there is somewhere that we auto-default to spawn based on certain other conditions, could check whether that logic could be updated instead.

@njhill This is mimicking the same logic also in this file (where WhisperForConditionalGeneration enforces spawn in the same way) checks if the method is set to spawn, also in post_init. In general, when I've found an established pattern, I have tried to mimic that. Is it preferable to have this in config/model.py?

I looked for your "auto-default to spawn based on certain conditions": entrypoints/utils.py has a function to establish the default to spawn. It is not based on any other logic or configuration though, just a default.

Ah ok, I was thinking of this method

vllm/vllm/utils/system_utils.py

Line 126 in 5ce2d10

def _maybe_force_spawn():

.. but I guess that is only based on some other global state and not the config.

It's not ideal that that function might end up being called after the checks here (I think in the vllm serve case it may end up being called first but only by accident)... we really need to clean up our config validation / imputation approach imo (not for this PR of course).

Anyhow I guess here it might be better to set to spawn if not already set (and log info message if changed), rather than failing?

This is what we were doing before. I agree that all of these could likely be more centralized in a different PR (I can help with that later too). I think a failure is better actually. Spawn is already the default. That means that if the user specified something else they may actually need something other than spawn for some particular reason, and us changing it and only logging the change might be sort of silently missed until some later failure needs to be diagnosed. I think an early failure is preferable. Thoughts?

vllm/v1/engine/utils.py

vllm/v1/serial_utils.py

vllm/v1/engine/core_client.py

vllm/v1/engine/core.py

vllm/v1/serial_utils.py

vllm/v1/engine/core_client.py

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

brandonpelfrey · 2026-03-18T22:35:23Z

vllm/v1/serial_utils.py

 from vllm.v1.utils import tensor_data

+if TYPE_CHECKING:
+    from vllm.v1.engine.tensor_ipc import TensorIpcReceiver, TensorIpcSender


I don't see this as a problem, but just to explain why these are imported this way: if we import from tensor_ipc, tensor_ipc imports "from vllm.v1.engine import EngineCoreRequestType" to get an enum. That is defined within an init which triggers a recursive/import loop. To break this, I have done this. In my mind, moving EngineCoreRequestType to a separate python module which avoids re-importing the entire engine module would be preferable, but I am not trying to complicate this PR any further.

brandonpelfrey · 2026-03-18T22:47:01Z

@njhill Latest issues have been addressed. The particular ask to model TensorIpcHandle as some variant of aux_buffers with a different type as a different part of a tuple I did not quite understand, and it sounded a little more difficult to reason about based on my understanding. If you have a particular implementation suggestion, please help me understand that more explicitly.

I also feel the changes to the core classes are still too big, but need to spend bit more time on that.

I believe given that these queues need to be made available across multiple configs and processes, I have tried to minimize changes to core classes. However, because this is a new communication mechanism, I have not heard another suggestion which would avoid this.

If we cannot merge this before Friday I would suggest we use part of the Friday meeting to close on any remaining issues on this PR.

I've walked over the full change again to try and make sure there is no reference remaining to multiple queues since we only support one, no lingering leftovers from old changes/whitespace/etc. I'm in parallel asking agents to review this.

Thanks again for your help on this!

brandonpelfrey requested review from DarkLight1337, NickLucche, ProExpertProg, WoosukKwon, aarnphm, chaunceyjiang, hmellor, houseroad, mgoin, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners January 11, 2026 04:00

mergify bot added frontend multi-modality Related to multi-modality (#4194) v1 labels Jan 11, 2026

gemini-code-assist bot reviewed Jan 11, 2026

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

cursor bot reviewed Jan 11, 2026

View reviewed changes

vllm/v1/serial_utils.py Outdated Show resolved Hide resolved

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Show resolved Hide resolved

cursor bot reviewed Jan 11, 2026

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

vllm/v1/serial_utils.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Jan 11, 2026

View reviewed changes

vllm/config/model.py Outdated Show resolved Hide resolved

vllm/multimodal/image.py Outdated Show resolved Hide resolved

vllm/multimodal/inputs.py Outdated Show resolved Hide resolved

brandonpelfrey force-pushed the tensor-ipc branch from 8ead3dd to 96e9bcf Compare January 11, 2026 05:44

cursor bot reviewed Jan 11, 2026

View reviewed changes

vllm/v1/engine/core.py Outdated Show resolved Hide resolved

vllm/v1/serial_utils.py Outdated Show resolved Hide resolved

vllm/v1/engine/core_client.py Outdated Show resolved Hide resolved

DarkLight1337 reviewed Jan 11, 2026

View reviewed changes

vllm/config/multimodal.py Outdated Show resolved Hide resolved

vllm/engine/arg_utils.py Show resolved Hide resolved

njhill reviewed Mar 6, 2026

View reviewed changes

Merge origin/main

a464ac1

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

Applying Nick Hill's suggestions

294a47a

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

standardize on mm_tensor_ipc

c3b7df0

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

brandonpelfrey added 7 commits March 12, 2026 18:14

Merge origin/main

6736b5b

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

PR suggestions, DP=TP=PP=1 check when torch_shm enabled

7039d81

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

Move state to TensorIPCSender/Receiver

3d85bd4

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

Remove unused serialization/deserialization paths

8345c8a

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

end to end testing with decoupling applied

dfb5543

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

Remove leftover unused function

c320996

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

Remove tensor_queues from EngineZmqAddresses

3dcfebd

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

merge origin/main

260df8a

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

merge origin/main

0212a81

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

merge origin/main

4d61018

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

njhill reviewed Mar 17, 2026

View reviewed changes

brandonpelfrey added 5 commits March 18, 2026 21:09

review comments, no more plural tensor queues, only single queue

f5f9799

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

Revert merge issue

9486b77

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

serial_utils cleanup

267178d

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

merge origin/main

22d7a13

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

add back required 5-tuple/tensor handle handling

544c79e

Signed-off-by: Brandon Pelfrey <bpelfrey@nvidia.com>

brandonpelfrey commented Mar 18, 2026

View reviewed changes

Uh oh!

Conversation

brandonpelfrey commented Jan 11, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Results

Without IPC Tensor Datapath enabled

With IPC Tensor Datapath enabled

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Jan 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Jan 11, 2026

Uh oh!

Uh oh!

Uh oh!

DarkLight1337 commented Jan 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

brandonpelfrey commented Jan 11, 2026

Uh oh!

brandonpelfrey commented Jan 11, 2026

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mergify bot commented Mar 7, 2026

Uh oh!

mergify bot commented Mar 9, 2026

Uh oh!

mergify bot commented Mar 9, 2026

Uh oh!

mergify bot commented Mar 11, 2026

Uh oh!

brandonpelfrey commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

mergify bot commented Mar 13, 2026

Uh oh!

mergify bot commented Mar 14, 2026

Uh oh!

njhill left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

njhill Mar 17, 2026

Choose a reason for hiding this comment

Uh oh!

brandonpelfrey Mar 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

njhill Mar 18, 2026

brandonpelfrey commented Jan 11, 2026 •

edited by github-actions bot

Loading

brandonpelfrey commented Mar 13, 2026 •

edited

Loading

brandonpelfrey Mar 17, 2026 •

edited

Loading

brandonpelfrey commented Mar 18, 2026 •

edited

Loading