[core] [DO NOT REVIEW] Enabling Zero-Copy Video with PyNvVideoCodec and IPC by brandonpelfrey · Pull Request #31925 · vllm-project/vllm

brandonpelfrey · 2026-01-07T22:09:40Z

NOT (yet) FOR GENERAL REVIEW

Introduces HW-accelerate video decode on NVIDIA GPUs, which lowers CPU utilization bottlenecks caused by CPU-based decoders during high concurrency. Note from the RFC that the use case this is especially helpful in is those where video decode is a significant portion of a requests' total lifetime, i.e. when the target output token count is small, as is the case in video captioning.

This is a draft PR for communicating implementation details in the vLLM slack and needs a bit more work before a true review takes place. This code was used to generate the data in the RFC.

This PR implements (details in the RFC):

A new PyNvVideoCodec-based HW-based VideoLoader implementation which loads directly into VRAM
A multiprocessing Queue, GPU IPC-based mechanism for zero-transfer/zero-copy sending of video frame data to CoreEngine processes
A rate limiting mechanism which only allows a specific number of in-flight video requests using this decoder to be serviced, avoiding a situation where a burst of video-related requests would decode faster than inference and exhaust VRAM.

NOTE: This implementation is lacking a small safe-guard mechanism as it is only valid in the context of single GPU DP=TP=1 cases. There are several use cases where this is meaningful though, for example, where scaling on a node is achieved by increasing a number of containers, each having one GPU/vLLM instance.

Purpose

See RFC.

Test Plan

N/A, not for review.

Test Result

N/A, not for review. Benchmarking results are in the RFC.

Essential Elements of an Effective PR Description Checklist

The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
The test plan, such as providing test command.
The test results, such as pasting the results comparison before and after, or e2e results
(Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
(Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

github-actions · 2026-01-07T22:09:49Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

mergify · 2026-01-07T22:10:23Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

gemini-code-assist

Code Review

This pull request introduces hardware-accelerated video decoding using PyNvVideoCodec and an IPC mechanism for zero-copy tensor sharing between processes. The changes include adding the new video backend, implementing a semaphore to limit concurrent video processing, and setting up the necessary IPC queues and serialization logic. The overall implementation is comprehensive, covering configuration, testing, and integration into the serving endpoints. My main feedback is to pin the new PyNvVideoCodec dependency to ensure reproducible builds.

gemini-code-assist · 2026-01-07T22:11:37Z

requirements/cuda.txt

 torchvision==0.24.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
 # FlashInfer should be updated together with the Dockerfile
 flashinfer-python==0.5.3
+PyNvVideoCodec


The PyNvVideoCodec dependency is added without a specific version. This can lead to non-reproducible builds if a new version of the library is released with breaking changes. To ensure stability and reproducibility, it's recommended to pin this dependency to a specific version, for example: PyNvVideoCodec==X.Y.Z.

DarkLight1337 · 2026-01-08T07:33:37Z

vllm/multimodal/video.py



-class VideoMediaIO(MediaIO[tuple[npt.NDArray, dict[str, Any]]]):
+@VIDEO_LOADER_REGISTRY.register("pynvvideocodec")


Hmm from my understanding of this code, this means that GPU 0 will always be used for decoding the videos?

In general, how much VRAM do you expect this to use?

Hmm from my understanding of this code, this means that GPU 0 will always be used for decoding the videos?

That is correct. Not present in this PR at the moment (but posed in the slack thread for this work), we need to check and raise an error if a user attempts to use this decode+IPC path with more than one GPU for the time being. There is no handling at the moment for cross-GPU frame tensor data, and it is not possible at the time a frontend request is made to know what the "destination" GPU would be in a DP>1 TP=1 case, so at this point, the supported path to exploit this feature would be to run multiple instances of vLLM, each with one GPU exposed, in which case this approach can work. The way we have been doing this in internal testing is to launch N containers, each with vLLM and one GPU.

In general, how much VRAM do you expect this to use?

This is also something I am getting data from the pynvvideocodec owners right now. @benchislett's suggestion has been that we do some form of "maximum" video decoding during the normal memory estimation so we naturally account for it. In the case that we need to statically 'account' for it, the data from the pynvvideocodec team can be used. I'll report back that data once I have it.

it is not possible at the time a frontend request is made to know what the "destination" GPU would be in a DP>1 TP=1 case, so at this point, the supported path to exploit this feature would be to run multiple instances of vLLM, each with one GPU exposed, in which case this approach can work. The way we have been doing this in internal testing is to launch N containers, each with vLLM and one GPU.

This won't work for TP > 1 either right? Since all of the API server processes can see all GPUs, while each model runner process only see 1 of them.

That's correct. This specifically supports DP=1 TP=1 at this point. We can support DP only via container or similar instancing mechanisms and TP>1 would require a kind of broadcast mechanism in the future. Let me update the RFC with the various approaches discussed so far internally as well as in the thread to contrast. To simultaneously support single GPU and enabling scaling via instancing, this currently appears one of the better options immediately available which supports this kind of scaling.

DarkLight1337 · 2026-01-08T07:36:38Z

vllm/v1/serial_utils.py

        assert self.aux_buffers is not None
+
+        # Check if this is a CUDA tensor and we have queues available
+        if (


Have you tried using the tensor queue for CPU tensors as well? I wonder whether it's any better than the previous implementation

I have not, though honestly it should work well I think. If any processing on the frontend ends in pytorch tensors, they could presumably use this pathway. In order to limit any unintended side effects, I've scoped down to what you see here. As a follow-up we could do that testing and relax this restriction to all tensors if we see it is safe and faster.

If any processing on the frontend ends in pytorch tensors

Basically all of the HF processors return a dictionary of tensors

If you can demonstrate that the tensor queue works well on CPU tensors, perhaps we could implement that separately first before the main content of this PR

mergify · 2026-01-13T16:31:22Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

brandonpelfrey requested review from DarkLight1337, NickLucche, ProExpertProg, WoosukKwon, aarnphm, chaunceyjiang, hmellor, houseroad, mgoin, robertgshaw2-redhat, tjtanaa, tlrmchlsmth, yewentao256, youkaichao and ywang96 as code owners January 7, 2026 22:09

brandonpelfrey marked this pull request as draft January 7, 2026 22:09

mergify bot added ci/build frontend multi-modality Related to multi-modality (#4194) nvidia v1 labels Jan 7, 2026

github-project-automation bot added this to NVIDIA Jan 7, 2026

mergify bot added the needs-rebase label Jan 7, 2026

gemini-code-assist bot reviewed Jan 7, 2026

View reviewed changes

DarkLight1337 reviewed Jan 8, 2026

View reviewed changes

brandonpelfrey added 2 commits January 8, 2026 22:46

GPU IPC + PyNvVideoCodec

b28bf53

fix for parsing

08a208c

video rate limiting

85f4f56

brandonpelfrey force-pushed the gpu-video-ipc branch from d238030 to 85f4f56 Compare January 8, 2026 22:47

mergify bot removed the needs-rebase label Jan 8, 2026

brandonpelfrey mentioned this pull request Jan 11, 2026

Add tensor IPC transfer mechanism for multimodal data #32104

Open

1 task

mergify bot added the needs-rebase label Jan 13, 2026



		class VideoMediaIO(MediaIO[tuple[npt.NDArray, dict[str, Any]]]):
		@VIDEO_LOADER_REGISTRY.register("pynvvideocodec")

Uh oh!

Conversation

brandonpelfrey commented Jan 7, 2026 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Uh oh!

github-actions bot commented Jan 7, 2026

Uh oh!

mergify bot commented Jan 7, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Jan 7, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonpelfrey Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brandonpelfrey Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

brandonpelfrey Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jan 8, 2026

Choose a reason for hiding this comment

Uh oh!

DarkLight1337 Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brandonpelfrey commented Jan 7, 2026 •

edited by github-actions bot

Loading

DarkLight1337 Jan 8, 2026 •

edited

Loading

DarkLight1337 Jan 8, 2026 •

edited

Loading

brandonpelfrey Jan 8, 2026 •

edited

Loading

DarkLight1337 Jan 8, 2026 •

edited

Loading