[core] [DO NOT REVIEW] Enabling Zero-Copy Video with PyNvVideoCodec and IPC#31925
[core] [DO NOT REVIEW] Enabling Zero-Copy Video with PyNvVideoCodec and IPC#31925brandonpelfrey wants to merge 3 commits intovllm-project:mainfrom
Conversation
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run You ask your reviewers to trigger select CI tests on top of Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add If you have any questions, please reach out to us on Slack at https://slack.vllm.ai. 🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
There was a problem hiding this comment.
Code Review
This pull request introduces hardware-accelerated video decoding using PyNvVideoCodec and an IPC mechanism for zero-copy tensor sharing between processes. The changes include adding the new video backend, implementing a semaphore to limit concurrent video processing, and setting up the necessary IPC queues and serialization logic. The overall implementation is comprehensive, covering configuration, testing, and integration into the serving endpoints. My main feedback is to pin the new PyNvVideoCodec dependency to ensure reproducible builds.
| torchvision==0.24.0 # Required for phi3v processor. See https://github.com/pytorch/vision?tab=readme-ov-file#installation for corresponding version | ||
| # FlashInfer should be updated together with the Dockerfile | ||
| flashinfer-python==0.5.3 | ||
| PyNvVideoCodec |
There was a problem hiding this comment.
The PyNvVideoCodec dependency is added without a specific version. This can lead to non-reproducible builds if a new version of the library is released with breaking changes. To ensure stability and reproducibility, it's recommended to pin this dependency to a specific version, for example: PyNvVideoCodec==X.Y.Z.
|
|
||
|
|
||
| class VideoMediaIO(MediaIO[tuple[npt.NDArray, dict[str, Any]]]): | ||
| @VIDEO_LOADER_REGISTRY.register("pynvvideocodec") |
There was a problem hiding this comment.
Hmm from my understanding of this code, this means that GPU 0 will always be used for decoding the videos?
There was a problem hiding this comment.
In general, how much VRAM do you expect this to use?
There was a problem hiding this comment.
Hmm from my understanding of this code, this means that GPU 0 will always be used for decoding the videos?
That is correct. Not present in this PR at the moment (but posed in the slack thread for this work), we need to check and raise an error if a user attempts to use this decode+IPC path with more than one GPU for the time being. There is no handling at the moment for cross-GPU frame tensor data, and it is not possible at the time a frontend request is made to know what the "destination" GPU would be in a DP>1 TP=1 case, so at this point, the supported path to exploit this feature would be to run multiple instances of vLLM, each with one GPU exposed, in which case this approach can work. The way we have been doing this in internal testing is to launch N containers, each with vLLM and one GPU.
In general, how much VRAM do you expect this to use?
This is also something I am getting data from the pynvvideocodec owners right now. @benchislett's suggestion has been that we do some form of "maximum" video decoding during the normal memory estimation so we naturally account for it. In the case that we need to statically 'account' for it, the data from the pynvvideocodec team can be used. I'll report back that data once I have it.
There was a problem hiding this comment.
it is not possible at the time a frontend request is made to know what the "destination" GPU would be in a DP>1 TP=1 case, so at this point, the supported path to exploit this feature would be to run multiple instances of vLLM, each with one GPU exposed, in which case this approach can work. The way we have been doing this in internal testing is to launch N containers, each with vLLM and one GPU.
This won't work for TP > 1 either right? Since all of the API server processes can see all GPUs, while each model runner process only see 1 of them.
There was a problem hiding this comment.
That's correct. This specifically supports DP=1 TP=1 at this point. We can support DP only via container or similar instancing mechanisms and TP>1 would require a kind of broadcast mechanism in the future. Let me update the RFC with the various approaches discussed so far internally as well as in the thread to contrast. To simultaneously support single GPU and enabling scaling via instancing, this currently appears one of the better options immediately available which supports this kind of scaling.
| assert self.aux_buffers is not None | ||
|
|
||
| # Check if this is a CUDA tensor and we have queues available | ||
| if ( |
There was a problem hiding this comment.
Have you tried using the tensor queue for CPU tensors as well? I wonder whether it's any better than the previous implementation
There was a problem hiding this comment.
I have not, though honestly it should work well I think. If any processing on the frontend ends in pytorch tensors, they could presumably use this pathway. In order to limit any unintended side effects, I've scoped down to what you see here. As a follow-up we could do that testing and relax this restriction to all tensors if we see it is safe and faster.
There was a problem hiding this comment.
If any processing on the frontend ends in pytorch tensors
Basically all of the HF processors return a dictionary of tensors
There was a problem hiding this comment.
If you can demonstrate that the tensor queue works well on CPU tensors, perhaps we could implement that separately first before the main content of this PR
d238030 to
85f4f56
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
NOT (yet) FOR GENERAL REVIEW
Introduces HW-accelerate video decode on NVIDIA GPUs, which lowers CPU utilization bottlenecks caused by CPU-based decoders during high concurrency. Note from the RFC that the use case this is especially helpful in is those where video decode is a significant portion of a requests' total lifetime, i.e. when the target output token count is small, as is the case in video captioning.
This is a draft PR for communicating implementation details in the vLLM slack and needs a bit more work before a true review takes place. This code was used to generate the data in the RFC.
This PR implements (details in the RFC):
NOTE: This implementation is lacking a small safe-guard mechanism as it is only valid in the context of single GPU DP=TP=1 cases. There are several use cases where this is meaningful though, for example, where scaling on a node is achieved by increasing a number of containers, each having one GPU/vLLM instance.
Purpose
See RFC.
Test Plan
N/A, not for review.
Test Result
N/A, not for review. Benchmarking results are in the RFC.
Essential Elements of an Effective PR Description Checklist
supported_models.mdandexamplesfor a new model.