Skip to content

[core] [DO NOT REVIEW] Enabling Zero-Copy Video with PyNvVideoCodec and IPC#31925

Draft
brandonpelfrey wants to merge 3 commits intovllm-project:mainfrom
brandonpelfrey:gpu-video-ipc
Draft

[core] [DO NOT REVIEW] Enabling Zero-Copy Video with PyNvVideoCodec and IPC#31925
brandonpelfrey wants to merge 3 commits intovllm-project:mainfrom
brandonpelfrey:gpu-video-ipc

Conversation

@brandonpelfrey
Copy link

@brandonpelfrey brandonpelfrey commented Jan 7, 2026

NOT (yet) FOR GENERAL REVIEW

Introduces HW-accelerate video decode on NVIDIA GPUs, which lowers CPU utilization bottlenecks caused by CPU-based decoders during high concurrency. Note from the RFC that the use case this is especially helpful in is those where video decode is a significant portion of a requests' total lifetime, i.e. when the target output token count is small, as is the case in video captioning.

This is a draft PR for communicating implementation details in the vLLM slack and needs a bit more work before a true review takes place. This code was used to generate the data in the RFC.

This PR implements (details in the RFC):

  • A new PyNvVideoCodec-based HW-based VideoLoader implementation which loads directly into VRAM
  • A multiprocessing Queue, GPU IPC-based mechanism for zero-transfer/zero-copy sending of video frame data to CoreEngine processes
  • A rate limiting mechanism which only allows a specific number of in-flight video requests using this decoder to be serviced, avoiding a situation where a burst of video-related requests would decode faster than inference and exhaust VRAM.

NOTE: This implementation is lacking a small safe-guard mechanism as it is only valid in the context of single GPU DP=TP=1 cases. There are several use cases where this is meaningful though, for example, where scaling on a node is achieved by increasing a number of containers, each having one GPU/vLLM instance.

Purpose

See RFC.

Test Plan

N/A, not for review.

Test Result

N/A, not for review. Benchmarking results are in the RFC.


Essential Elements of an Effective PR Description Checklist
  • The purpose of the PR, such as "Fix some issue (link existing issues this PR will resolve)".
  • The test plan, such as providing test command.
  • The test results, such as pasting the results comparison before and after, or e2e results
  • (Optional) The necessary documentation update, such as updating supported_models.md and examples for a new model.
  • (Optional) Release notes update. If your change is user facing, please update the release notes draft in the Google Doc.

@github-actions
Copy link

github-actions bot commented Jan 7, 2026

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors.

You ask your reviewers to trigger select CI tests on top of fastcheck CI.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

🚀

@brandonpelfrey brandonpelfrey marked this pull request as draft January 7, 2026 22:09
@mergify mergify bot added ci/build frontend multi-modality Related to multi-modality (#4194) nvidia v1 labels Jan 7, 2026
@mergify
Copy link

mergify bot commented Jan 7, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces hardware-accelerated video decoding using PyNvVideoCodec and an IPC mechanism for zero-copy tensor sharing between processes. The changes include adding the new video backend, implementing a semaphore to limit concurrent video processing, and setting up the necessary IPC queues and serialization logic. The overall implementation is comprehensive, covering configuration, testing, and integration into the serving endpoints. My main feedback is to pin the new PyNvVideoCodec dependency to ensure reproducible builds.

torchvision==0.24.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version
# FlashInfer should be updated together with the Dockerfile
flashinfer-python==0.5.3
PyNvVideoCodec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The PyNvVideoCodec dependency is added without a specific version. This can lead to non-reproducible builds if a new version of the library is released with breaking changes. To ensure stability and reproducibility, it's recommended to pin this dependency to a specific version, for example: PyNvVideoCodec==X.Y.Z.



class VideoMediaIO(MediaIO[tuple[npt.NDArray, dict[str, Any]]]):
@VIDEO_LOADER_REGISTRY.register("pynvvideocodec")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm from my understanding of this code, this means that GPU 0 will always be used for decoding the videos?

Copy link
Member

@DarkLight1337 DarkLight1337 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, how much VRAM do you expect this to use?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm from my understanding of this code, this means that GPU 0 will always be used for decoding the videos?

That is correct. Not present in this PR at the moment (but posed in the slack thread for this work), we need to check and raise an error if a user attempts to use this decode+IPC path with more than one GPU for the time being. There is no handling at the moment for cross-GPU frame tensor data, and it is not possible at the time a frontend request is made to know what the "destination" GPU would be in a DP>1 TP=1 case, so at this point, the supported path to exploit this feature would be to run multiple instances of vLLM, each with one GPU exposed, in which case this approach can work. The way we have been doing this in internal testing is to launch N containers, each with vLLM and one GPU.

In general, how much VRAM do you expect this to use?

This is also something I am getting data from the pynvvideocodec owners right now. @benchislett's suggestion has been that we do some form of "maximum" video decoding during the normal memory estimation so we naturally account for it. In the case that we need to statically 'account' for it, the data from the pynvvideocodec team can be used. I'll report back that data once I have it.

Copy link
Member

@DarkLight1337 DarkLight1337 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is not possible at the time a frontend request is made to know what the "destination" GPU would be in a DP>1 TP=1 case, so at this point, the supported path to exploit this feature would be to run multiple instances of vLLM, each with one GPU exposed, in which case this approach can work. The way we have been doing this in internal testing is to launch N containers, each with vLLM and one GPU.

This won't work for TP > 1 either right? Since all of the API server processes can see all GPUs, while each model runner process only see 1 of them.

Copy link
Author

@brandonpelfrey brandonpelfrey Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's correct. This specifically supports DP=1 TP=1 at this point. We can support DP only via container or similar instancing mechanisms and TP>1 would require a kind of broadcast mechanism in the future. Let me update the RFC with the various approaches discussed so far internally as well as in the thread to contrast. To simultaneously support single GPU and enabling scaling via instancing, this currently appears one of the better options immediately available which supports this kind of scaling.

assert self.aux_buffers is not None

# Check if this is a CUDA tensor and we have queues available
if (
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you tried using the tensor queue for CPU tensors as well? I wonder whether it's any better than the previous implementation

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not, though honestly it should work well I think. If any processing on the frontend ends in pytorch tensors, they could presumably use this pathway. In order to limit any unintended side effects, I've scoped down to what you see here. As a follow-up we could do that testing and relax this restriction to all tensors if we see it is safe and faster.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @njhill

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If any processing on the frontend ends in pytorch tensors

Basically all of the HF processors return a dictionary of tensors

Copy link
Member

@DarkLight1337 DarkLight1337 Jan 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you can demonstrate that the tensor queue works well on CPU tensors, perhaps we could implement that separately first before the main content of this PR

@mergify
Copy link

mergify bot commented Jan 13, 2026

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @brandonpelfrey.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

@mergify mergify bot added the needs-rebase label Jan 13, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

2 participants