Skip to content

[model] Support for Llava-Next-Video model #7559

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 29 commits into from
Sep 11, 2024

Conversation

TKONIY
Copy link
Contributor

@TKONIY TKONIY commented Aug 15, 2024

Roadmap

  • Add VideoPlugin to MultiModalPlugin.
  • LLM.generate API for a single video input.
    LLM.generate({
        "prompt": "<video> please summarize this video",
        "multi_modal_data": {
            "video": video # currently only support type of np.ndarry
        }
    })
  • Support LlavaNextVideoForConditionalGeneration model with single video input.
    [15 Aug. Update] Waiting for the configuration file in hugging-face to be fixed. https://huggingface.co/llava-hf/LLaVA-NeXT-Video-7B-hf/discussions/4
  • Add example for llava-next-video.
  • Support all kinds of video input type like transformers
    VideoInput = Union[
        List["PIL.Image.Image"], # Supported
        "np.ndarray",
        "torch.Tensor",
        List["np.ndarray"],
        List["torch.Tensor"],
        List[List["PIL.Image.Image"]],
        List[List["np.ndarrray"]],
        List[List["torch.Tensor"]],
    ]  
  • Support multiple image-video-mixed input.
  • Support Siglip.
  • Support Chat Completion APIs.
  • Support prefix caching

Related

This PR is in the roadmap of RFC #7558

FIX #5124
FIX #6571

Copy link

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which consists a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of default ones by unblocking the steps in your fast-check build on Buildkite UI.

Once the PR is approved and ready to go, please make sure to run full CI as it is required to merge (or just use auto-merge).

To run full CI, you can do one of these:

  • Comment /ready on the PR
  • Add ready label to the PR
  • Enable auto-merge.

🚀

@DarkLight1337
Copy link
Member

This is an exciting development! This should be the last of the common modality types to cover for now.

@DarkLight1337 DarkLight1337 self-assigned this Aug 15, 2024
@TKONIY TKONIY force-pushed the llava-next-video branch 2 times, most recently from 2beafbe to d6c783b Compare August 19, 2024 18:53
@TKONIY TKONIY marked this pull request as ready for review August 19, 2024 18:54
@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 19, 2024

@DarkLight1337 @ywang96 The initial support for Llava-Next-Video is done, which is also the first support for video in vLLM. Could you help review this PR?

It now supports integrating a single video into a prompt with the "<video>" symbol. I am happy to tell you that compared to SGLang which fixes the number of input frames when launching the model, vLLM now has a stronger implementation that supports any number of input frames per video.

@ywang96
Copy link
Member

ywang96 commented Aug 19, 2024

@DarkLight1337 @ywang96 The initial support for Llava-Next-Video is done, which is also the first support for video in vLLM. Could you help review this PR?

It now supports integrating a single video into a prompt with the "

This is great and thank you very much for making this contribution! @TKONIY

I will review this PR either today or tomorrow!

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 20, 2024

Thank you!

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 21, 2024

@DarkLight1337 @ywang96 The initial support for Llava-Next-Video is done, which is also the first support for video in vLLM. Could you help review this PR?
It now supports integrating a single video into a prompt with the "

This is great and thank you very much for making this contribution! @TKONIY

I will review this PR either today or tomorrow!

Dear @ywang96, are you available today to review this PR?

@ywang96
Copy link
Member

ywang96 commented Aug 21, 2024

@TKONIY Reviewing it now!

Copy link
Member

@ywang96 ywang96 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @TKONIY! It seems that this PR only focuses on the offline inference, is that correct? OpenAI Vision API does support video frames as input, so we should make changes to our frontend to support the end-to-end online inference. (This can be in a later PR if you plan to work on it as well)

Overall I think PR is in the right track! I left a few comments as the first round of the review, mostly regarding the code cleanups. Please take a look, thanks!

tokens_per_frame = get_llava_next_video_frame_feature_size(hf_config)
video_feature_size = frames_per_video * tokens_per_frame

if isinstance(vision_config, CLIPVisionConfig):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think LLaVA models should enable the Siglip vision encoder.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You are right! It seems not complicated, but I haven't had time to verify the Siglip implementation. I would like to leave this to other PRs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am excited about it. Many of our models use the siglip. BTW, I am currently using your PR to verify our video models.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to take a look at our llava-next implementation, where we allow both CLIP and SIGLIP to be the vision tower

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I saw it in Llava Next implementation. I'll try.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@litianjian @ywang96 I tried to integrate SIGLIP like llava-next did. But I am not confident about its correctness. If would be nice if you could have a check on my latest commits.

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 22, 2024

Thank you for the PR @TKONIY! It seems that this PR only focuses on the offline inference, is that correct? OpenAI Vision API does support video frames as input, so we should make changes to our frontend to support the end-to-end online inference. (This can be in a later PR if you plan to work on it as well)

Overall I think PR is in the right track! I left a few comments as the first round of the review, mostly regarding the code cleanups. Please take a look, thanks!

Thanks for your review. It would be exciting if video input is supported by chat completion APIs. I have read the code of our front end and I think it requires changes to some interfaces, so I did not implement it in this PR. I may open an RFC to discuss it if I got time.

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 23, 2024

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

@ywang96
Copy link
Member

ywang96 commented Aug 23, 2024

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!

Could you add a correctness test for this model in tests/models, like for all other models we have?

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 26, 2024

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!

Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@ywang96
Copy link
Member

ywang96 commented Aug 26, 2024

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 31, 2024

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

I am back and working on it now.

@ywang96
Copy link
Member

ywang96 commented Aug 31, 2024

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

I am back and working on it now.

Hey @TKONIY! Sounds good - I was actually just working on this branch but still run into the issue of ValueError: The checkpoint you are trying to load has model type `llava_next_video` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.. Are you also seeing this in your dev environment?

Also FYI, #8049 should be merged pretty soon in case you also want to add the frontend support.

@TKONIY
Copy link
Contributor Author

TKONIY commented Aug 31, 2024

@ywang96 Thanks for your review! I have resolved most of the comments except max_llm_tokens. Please have a check. Thank you!

Thank you for addressing my comments! I have replied regarding your question as well!
Could you add a correctness test for this model in tests/models, like for all other models we have?

I am working on it. But I am busy with a conference this week. So the progress would not be fast. Sorry.

@TKONIY No worries at all and I can work on adding the test too if you don't mind. Thank you for the contribution!

I am back and working on it now.

Hey @TKONIY! Sounds good - I was actually just working on this branch but still run into the issue of ValueError: The checkpoint you are trying to load has model type `llava_next_video` but Transformers does not recognize this architecture. This could be because of an issue with the checkpoint, or because your version of Transformers is out of date.. Are you also seeing this in your dev environment?

Also FYI, #8049 should be merged pretty soon in case you also want to add the frontend support.

There is an error in the newest release version of the Transformers library. I fixed them with this commit huggingface/transformers@a27182b so that vLLM can correctly load the model config. Therefore, vLLM could only run with the Transformers library after this commit, which has not been released.

video_pixels = inputs["data"]

if isinstance(video_pixels, torch.Tensor):
b, num_frames, c, h, w = video_pixels.shape
Copy link
Member

@ywang96 ywang96 Aug 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For some reason the shape of video_pixels has an additional dimension between batch size and num_frames so I had to change this line to b, _, num_frames, c, h, w = video_pixels.shape to make this PR work with the main branch of transformers

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is due to the recent update to batched inputs, which introduced an additional level to the tensor shape where the second dimension now refers to the number of multimodal inputs (here, the number of videos).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is because of the multi-multimodal input supports. My PR doesn't support multiple video and images per input yet. So I will just leave it as b, _, num_frames, c, h, w yet.

@DarkLight1337
Copy link
Member

DarkLight1337 commented Sep 10, 2024

Can you check the build failure for AMD? Other PRs don't have this problem so it is likely caused by your changes.

@TKONIY
Copy link
Contributor Author

TKONIY commented Sep 10, 2024

Can you check the build failure for AMD? Other PRs don't have this problem so it is likely caused by your changes.

I will check, thank you.

@DarkLight1337 DarkLight1337 added the ready ONLY add when PR is ready to merge/full CI is needed label Sep 10, 2024
@DarkLight1337 DarkLight1337 enabled auto-merge (squash) September 11, 2024 03:35
@youkaichao youkaichao disabled auto-merge September 11, 2024 05:21
@youkaichao youkaichao merged commit 6a512a0 into vllm-project:main Sep 11, 2024
70 of 72 checks passed
dtrifiro pushed a commit to opendatahub-io/vllm that referenced this pull request Sep 12, 2024
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
@PancakeAwesome
Copy link

So we can now use the main branch to implement qwenvl2, internvl2 and other multi-modal models of openai vllm server video inference?

@DarkLight1337
Copy link
Member

No, OpenAI API support is not out yet (unless you consider multi-image input).

@PancakeAwesome
Copy link

No, OpenAI API support is not out yet (unless you consider multi-image input).不,OpenAI API目前还不支持这种功能(除非你把多张图片作为输入)。

So you mean that the openai api must support video input?

@DarkLight1337
Copy link
Member

I mean that for now, you can pass in multiple images as a "video" to the model. Explicit video support will be added later as mentioned in #7558.

Alvant pushed a commit to compressa-ai/vllm that referenced this pull request Oct 26, 2024
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Alvant <[email protected]>
garg-amit pushed a commit to garg-amit/vllm that referenced this pull request Oct 28, 2024
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: Amit Garg <[email protected]>
LeiWang1999 pushed a commit to LeiWang1999/vllm-bitblas that referenced this pull request Mar 26, 2025
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Signed-off-by: LeiWang1999 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready ONLY add when PR is ready to merge/full CI is needed
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Model]: Llava-Next-Video support [New Model]: LLaVA-NeXT-Video support
6 participants