Skip to content

Add LLaVa NeXT Video#31252

Merged
zucchini-nlp merged 24 commits intohuggingface:mainfrom
zucchini-nlp:llava_next_video
Jun 26, 2024
Merged

Add LLaVa NeXT Video#31252
zucchini-nlp merged 24 commits intohuggingface:mainfrom
zucchini-nlp:llava_next_video

Conversation

@zucchini-nlp
Copy link
Member

@zucchini-nlp zucchini-nlp commented Jun 5, 2024

What does this PR do?

Adds a new model for Video LLMs built by tuning LLaVa-NeXT on video dataset, current SOTA for videos on VideoMME bench.

This PR introduces some changes to how we usually handled interleaved multimodal data. I need @amyeroberts opinion on that, of it's a viable option for these kinds of models. I added a separate videoProcessor class which has all same parameters as ImageProcessor but a different logic. The reason: imageProcessor file was already full of extra helper functions, and adding more conditions seemed to degrade readbility

Also, I added a chat template, if you can verify that it's correct @NielsRogge 😄 (it's in the convert_model.py). I will push all templates to the hub later, currently only 7B model has its template working

Fixes #31164

@zucchini-nlp zucchini-nlp marked this pull request as ready for review June 5, 2024 07:51
@zucchini-nlp zucchini-nlp changed the title Add LLaVa NeXT Video [WIP] Add LLaVa NeXT Video Jun 5, 2024
@zucchini-nlp zucchini-nlp marked this pull request as draft June 5, 2024 07:52
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the import needs to be like this I believe

@zucchini-nlp zucchini-nlp marked this pull request as ready for review June 7, 2024 08:04
@zucchini-nlp
Copy link
Member Author

ready for review!

@zucchini-nlp zucchini-nlp changed the title [WIP] Add LLaVa NeXT Video Add LLaVa NeXT Video Jun 7, 2024
Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the work adding this model!

Most comments are about copied from and diff file choices

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be consistent with other models, this should really be LlavaNextVideoImageProcessor.

I'd say we should for do this for this model, and then in a separate PR we can consider introduce a VideoProcessor class, which we might want to use for new models, and as a possible alias for the image processor for the other video models.

The outstanding question would be for video llava, which takes both images and videos. Given the model, I'd say it should probably be a VideoProcessor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oke, but in case of this model, it also accepts both as input. Right now it uses simple "LLaVaNext" as image processor and adds a new class LlavaNextVideoImageProcessor for video processing. So, for VideoLlava we can separate into two classes in this same way, one for image and another for video processing.

The only major change is that arg name for images will be pixel_values and not pixel_values_images if we do so.

I will call this one LlavaNextVideoImageProcessor processor then and try to unify video-image models in separate PR.

Copy link
Contributor

@NielsRogge NielsRogge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments, in particular the diff should probably be entirely defined in diff_llava_next_video.py, which then populates the config and modeling files

@zucchini-nlp
Copy link
Member Author

zucchini-nlp commented Jun 11, 2024

@amyeroberts ready for re-review

  • Diff was not working and wasn't copying content from parent class (LLava_NeXT). I did some changes in the diff-converter file, so would be nice to get @ArthurZucker review on that. However, diff still doesn't work when we have to overwrite docstring for config, so I simply made my own new config file. I also tried diff-converting example2, still doesn't copy docstring and I couldn't solve it yet
  • VideoProcessor is gone, so now we just have two ImageProcessor classes used for each modality, simply to isolate unrelated code in two files

Copy link
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 kudos for using the diff file so well

Comment on lines 335 to 337
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so you mean original_methods.items() is overwriding some things?

Otherwise the goal is indeed to overwrite the methods using the func.with_changes!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This happens only when original class has two methods with identical naming, in my case property and it setter. If we use dict, only one is retained as dict can't hold two keys with same naming.

Not sure if there're cases when directly iterating might cause errors 🤔

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hhhhhha I see!
Then it means we are not taking into account the decorator. Indeed that is a good catch and was probably affectingh @younesbelkada in his cohere porting

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow - looks great! 🔥

Some comments relating to the diff logic and the generated model file, but overall looks great

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ArthurZucker I can forsee hitting potential issues when e.g. we have more than one modeling file. For example, the models under data2vec.

Seeing the config file here made me realise this. Rather than try to handle which imports belong to which file e.g. configuration_llava_next vs. modeling_llava_next we could make this simpler (and less prone to error) by having diff files for each e.g. diff_modeling_llava_next_video.py, diff_configuration_llava_next_video.py.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, but there should be an easy way to solve importing stuff. Basically pasting all imports everywhere and removing what is not used

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Keeping the config and tokenizer and whatever else in a single diff file for now will be simpler IMO !

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to copy across the copied from statements from the original file 👀

Copy link
Member Author

@zucchini-nlp zucchini-nlp Jun 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, this is diff-generated but I'll see if that can be fixed + ping @ArthurZucker for visibility

UPD: Found how to remove those and can push a commit if needed. Btw, make fix-copies currently doesn't check for diff-files, so imo "copied from" statements are needed to catch and propagate code changes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep we could / should add the check into make fix-copies to test the hash of the file vs hash of the diff converter generated file

Copy link
Member Author

@zucchini-nlp zucchini-nlp Jun 20, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave this one for next the PR (to whoever wants to add the checks) as it's not related to LLaVa-NeXT :)

Edit: will open an issue to not forget

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to define all_generative_models here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as in another PR, VLMs cannot do GenerationTesterMixin out of the box due to pixel_values, custom generation code or partially having generation related code in modeling (e.g. expand_attn_mask). It's on my todo list, but requires more unification of VLMs before

@zucchini-nlp
Copy link
Member Author

@amyeroberts this one is also ready for the last review

Copy link
Contributor

@amyeroberts amyeroberts left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work - thanks for adding the model and for testing the diff converter!

@amyeroberts
Copy link
Contributor

@zucchini-nlp Only thing before merge is to do a run on the slow tests for the model with a [run_slow] llava-next-video commit message

zucchini-nlp and others added 3 commits June 21, 2024 21:04
…_video.py

Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
Co-authored-by: amyeroberts <22614925+amyeroberts@users.noreply.github.com>
@zucchini-nlp
Copy link
Member Author

Hmm I am not sure slow tests are working, they were skipped in here and in BlipVideo. I think the commit message format is correct

@amyeroberts
Copy link
Contributor

Hmm I am not sure slow tests are working, they were skipped in here and in BlipVideo. I think the commit message format is correct

Ah, we didn't have the run-slow label on the PR, which might have caused it to be skipped. @zucchini-nlp Could you try again?

@amyeroberts amyeroberts mentioned this pull request Jun 24, 2024
5 tasks
@ydshieh ydshieh force-pushed the llava_next_video branch from b0433ed to bc51d0c Compare June 26, 2024 07:19
@zucchini-nlp zucchini-nlp merged commit e71f286 into huggingface:main Jun 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

LLaVA-NeXT-Video support

7 participants