-
-
Notifications
You must be signed in to change notification settings - Fork 7.1k
[model] Support for Llava-Next-Video model #7559
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge). To run full CI, you can do one of these:
🚀 |
c27fd58
to
d95701b
Compare
This is an exciting development! This should be the last of the common modality types to cover for now. |
2beafbe
to
d6c783b
Compare
393f206
to
3179393
Compare
@DarkLight1337 @ywang96 The initial support for Llava-Next-Video is done, which is also the first support for video in vLLM. Could you help review this PR? It now supports integrating a single video into a prompt with the "<video>" symbol. I am happy to tell you that compared to SGLang which fixes the number of input frames when launching the model, vLLM now has a stronger implementation that supports any number of input frames per video. |
This is great and thank you very much for making this contribution! @TKONIY I will review this PR either today or tomorrow! |
Thank you! |
12f4d45
to
d4a0112
Compare
Dear @ywang96, are you available today to review this PR? |
@TKONIY Reviewing it now! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you for the PR @TKONIY! It seems that this PR only focuses on the offline inference, is that correct? OpenAI Vision API does support video frames as input, so we should make changes to our frontend to support the end-to-end online inference. (This can be in a later PR if you plan to work on it as well)
Overall I think PR is in the right track! I left a few comments as the first round of the review, mostly regarding the code cleanups. Please take a look, thanks!
e28f0b6
to
65dd96a
Compare
tokens_per_frame = get_llava_next_video_frame_feature_size(hf_config) | ||
video_feature_size = frames_per_video * tokens_per_frame | ||
|
||
if isinstance(vision_config, CLIPVisionConfig): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think LLaVA models should enable the Siglip vision encoder.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right! It seems not complicated, but I haven't had time to verify the Siglip implementation. I would like to leave this to other PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am excited about it. Many of our models use the siglip. BTW, I am currently using your PR to verify our video models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Feel free to take a look at our llava-next implementation, where we allow both CLIP and SIGLIP to be the vision tower
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! I saw it in Llava Next implementation. I'll try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@litianjian @ywang96 I tried to integrate SIGLIP like llava-next
did. But I am not confident about its correctness. If would be nice if you could have a check on my latest commits.
Thanks for your review. It would be exciting if video input is supported by chat completion APIs. I have read the code of our front end and I think it requires changes to some interfaces, so I did not implement it in this PR. I may open an RFC to discuss it if I got time. |
@ywang96 Thanks for your review! I have resolved most of the comments except |
Thank you for addressing my comments! I have replied regarding your question as well! Could you add a correctness test for this model in |
I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry. |
@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution! |
I am back and working on it now. |
Hey @TKONIY! Sounds good - I was actually just working on this branch but still run into the issue of Also FYI, #8049 should be merged pretty soon in case you also want to add the frontend support. |
There is an error in the newest release version of the Transformers library. I fixed them with this commit huggingface/transformers@a27182b so that vLLM can correctly load the model config. Therefore, vLLM could only run with the Transformers library after this commit, which has not been released. |
video_pixels = inputs["data"] | ||
|
||
if isinstance(video_pixels, torch.Tensor): | ||
b, num_frames, c, h, w = video_pixels.shape |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For some reason the shape of video_pixels has an additional dimension between batch size and num_frames so I had to change this line to b, _, num_frames, c, h, w = video_pixels.shape
to make this PR work with the main branch of transformers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is due to the recent update to batched inputs, which introduced an additional level to the tensor shape where the second dimension now refers to the number of multimodal inputs (here, the number of videos).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is because of the multi-multimodal input supports. My PR doesn't support multiple video and images per input yet. So I will just leave it as b, _, num_frames, c, h, w
yet.
46960dc
to
925c49e
Compare
44e239b
to
6d481dd
Compare
Can you check the build failure for AMD? Other PRs don't have this problem so it is likely caused by your changes. |
I will check, thank you. |
Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]>
So we can now use the main branch to implement qwenvl2, internvl2 and other multi-modal models of openai vllm server video inference? |
No, OpenAI API support is not out yet (unless you consider multi-image input). |
So you mean that the openai api must support video input? |
I mean that for now, you can pass in multiple images as a "video" to the model. Explicit video support will be added later as mentioned in #7558. |
Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Alvant <[email protected]>
Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: Amit Garg <[email protected]>
Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Signed-off-by: LeiWang1999 <[email protected]>
Roadmap
VideoPlugin
toMultiModalPlugin
.LLM.generate
API for a single video input.LlavaNextVideoForConditionalGeneration
model with single video input.[15 Aug. Update] Waiting for the configuration file in hugging-face to be fixed. https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf/discussions/4
Related
This PR is in the roadmap of RFC #7558
FIX #5124
FIX #6571