Add video modality for InstrucBLIP#30182
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
src/transformers/models/instructblip/image_processing_instructblip.py
Outdated
Show resolved
Hide resolved
src/transformers/models/instructblip/image_processing_instructblip.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
This can perhaps also be Copied from
There was a problem hiding this comment.
Okay, let me try. Then I have to add "copy ignore" on the preprocess probably
There was a problem hiding this comment.
No that's only if you add Copied from to the class. In that case you can add "Ignore copy" above methods that you don't want to copy
There was a problem hiding this comment.
Wow that's awesome, thanks for working on that!
I'm just concerned about 2 things:
- making sure that we have a robust API for multimodal processors that is consistent
- the current InstructBLIP models on the hub all use BlipImageProcessor. This PR would introduce a new image processor, I guess we would then need to update the auto mapping to make sure
AutoImageProcessorstill works as expected.
There was a problem hiding this comment.
cc @molbap since we'd like to standardize multimodal processors, this one isn't making it easier 😅
at some point we will have a VideoTextToTextPipeline, and we'll need to make sure they all have the same API.
See also the ImageTextToTextPipeline which is worked on at #29572. Although technically it could work if we just expect the following to work for any video-text-to-text model:
inputs = processor(videos=..., text=..., return_tensors="pt")
generated_ids = model.generate(**inputs, max_new_tokens=20)
generated_text = processor.batch_decode(generated_ids, skip_special_tokens=True)There was a problem hiding this comment.
Sure! Re to normalizing processors, for models taking video inputs, vivit, videomae, tvlt have videos: ImageInput but tvp has videos: Union[ImageInput, List[ImageInput], List[List[ImageInput]]],. However x_clip reuses videomae's processor.
Overall ImageInput is defined as a type union of Images or list of images. Looks like in the future we might prefer supporting at least list of list of images so a VideoInput defined as such could make sense, or an union of types as done in x_clip.
Unfortunately video llava processing is going a different way to easily be able to interleave modalities. I guess that is part of what is being discussed internally in slack
Yep, seems like now it works only if calling specifically "InstructBlipImageProcessor" |
amyeroberts
left a comment
There was a problem hiding this comment.
Thanks for working on adding this capability!
Two general comments:
- We have to be careful here with the mappings wrt backwards compatibility and expected behaviour. As a user, I should be able to do:
image_processor = InstructBlipImageProcessor()
images = image_processor(images, return_tensors="pt")
and get exactly the same output as I was getting before with the blip image processor
- We should avoid adding lots of if-statements in existing modeling code and instead add a new model
There was a problem hiding this comment.
The addition of all the if statements here indicate this should really be split into a different model. For things like torch.compile we really want to avoid models being able to have inputs with a varying number of dimensions e.g. adding InstructBLIPForVideoQuestionAnswering instead.
There was a problem hiding this comment.
This is going to break things. Either:
- We add the new image processor to auto map for existing checkpoints. This might lead to differences in output as the image processor used is different. AFAICT processing steps are the same but output shape isn't.
- We load the old image processors, which will break or emit warnings with the
videosinput
There was a problem hiding this comment.
Is it ever going to support both? Otherwise this message can be misleading
There was a problem hiding this comment.
If we're adding an image processor for InstructBlip, and it can process images, then it should process images consistently with how they were processed for the previous image processor (blip).
Whilst the processing steps are the same i.e. the processed image is consistent, the output shape won't be because this will add an extra axis to the output images i.e. they become (batch_size, num_frames, num_channels, height, width). Instead, we should keep the same output shape for images, this will allow us to add this to the image_processing_auto.py mapping for instruct blip
|
@amyeroberts What if:
|
|
@amyeroberts ping |
|
@zucchini-nlp one thing related that I'll merge this week, regarding processors: #30511 I added a |
|
@zucchini-nlp Sorry for the late reply here.
This is a neat solution. My main question would be, what does this do to the shapes of the returned tensors e.g. If they're not the same, then we'll need to add a new class in the modeling file e.g. In terms of the image processor:
My vote would be for a new model. Having to handle video and images within the same image processor is a good indication they should be separated |
|
Thanks for detailed explanations! At this point it should be possible to go with the first option, I will have to go back and check. If not, making a separate model sounds good since BLIP will never work with both modalities at the same time. @molbap i see, probably having separate VideoProcessor files can be a better solutions for mutli-modal models, instead of packing it all in ImageProcessor |
|
I went for the "let's keep one processing and modeling files" way. Currently the following changes are applied:
Slow tests are passing locally + added a few tests for the video generation. |
|
Thanks for the continued work on this and apologies for the delay in my review. Skimming over the PR, and thinking about this more I think we should split this up such that we have a video image processor and a video model. We want to avoid conditional outputs from our base processing objects as well as conditional logic within our model's forward passes. As we won't interleave images and videos i.e. we don't expect a user to both be using video and images at the same time then we don't need to be able to handle these all at once |
|
@amyeroberts oke, I made Video InstructBlip its own model with its separate modeling and image processing files. Added some tests and indicated in the model-doc that it's the same model as InstructBlip except for the video processing capability. Ready for review @ArthurZucker I made more changes into diff converter in this PR, as it didn't work in some cases. Specifically:
Plus these are the changes in LLaVa-NeXT-Video PR, just duplicating to have all written down somewhere :)
Still have to fix cases when only "init" docstring is changed, I couldn't make it work yet |
amyeroberts
left a comment
There was a problem hiding this comment.
Looks great - thanks for splitting this up and adding this model!
Just a few comment and some questions about the changes to the diff_converter. Main comment is about making sure generate is properly tested
src/transformers/models/instructblipvideo/diff_instructblipvideo.py
Outdated
Show resolved
Hide resolved
| class InstructBlipVideoForConditionalGenerationDecoderOnlyTest( | ||
| ModelTesterMixin, GenerationTesterMixin, unittest.TestCase | ||
| ): | ||
| all_model_classes = (InstructBlipVideoForConditionalGeneration,) if is_torch_available() else () |
There was a problem hiding this comment.
don't we need to define all_generative_models here too to properly test the generation mixin?
There was a problem hiding this comment.
GenerationMixin still can't work for VLMs, and I am planning to properly add it after some unification of VLM processors. Otherwise we'll have so many conditional checks inside testing
AFAIK all VLMs currently are tested in IntegrationTests for that
utils/diff_model_converter.py
Outdated
| self.transformers_imports = {} # maps the imports name like "from transformers.models.xxx" to the parsed AST module | ||
| self.imported_mapping = {} # stores the name of the imported classes, with their source {"LlamaModel":"transformers.model.llama.modeling_llama"} | ||
| self.visited_module = {} # modules visited like "transformers.models.llama.modeling_llama" | ||
| self.visited_module_config = {} # modules visited like "transformers.models.llama.modeling_llama" in config file, needed to not mix config vs modeling imports |
There was a problem hiding this comment.
This comment isn't completely clear to me as transformers.models.llama.modeling_llama" is a modeling import and not a config import and the instructblip configuration file doesn't import transformers.models.llama.modeling_llama
There was a problem hiding this comment.
Sorry, maybe needs another example. This was needed for me because the auto-generated config wasn't getting imports from its "super config". In other words, it was only copying imports from diff files and removing unused ones by ruff.
One solution may be to indicate all imports in diff, but if they aren't used ruff removes them eventually. In my case PreTrainedConfig wasn't being imported for ex
There was a problem hiding this comment.
yep this is also something I noticed, mixing import was not super well done
utils/diff_model_converter.py
Outdated
| ) | ||
|
|
||
| if super_file_name not in self.visited_module: # only extract classes once | ||
| visited_module = self.visited_module_config if "Config" in class_name else self.visited_module |
There was a problem hiding this comment.
How well does this extend to other files we might have under the model folder e.g. do we need flags for visited_module_processor etc?
There was a problem hiding this comment.
Not sure if processors work for diff. I tried actually to add image-processor to diff but it messed up, so I believe it doesn't yet support that
There was a problem hiding this comment.
not yet supported but planned for sure.
| parser.add_argument( | ||
| "--old_model_name", | ||
| required=False, | ||
| help="The name of the model from which the copying is done in CamelCase. If not provided is inferred from diff-file", | ||
| ) | ||
| parser.add_argument( | ||
| "--new_model_name", | ||
| required=False, | ||
| help="The name of the new model being added in CamelCase. If not provided is inferred from diff-file", | ||
| ) |
There was a problem hiding this comment.
Is this for models with composite config files e.g. CLIPVisionConfig which we don't want to infer as CLIPVision?
There was a problem hiding this comment.
This is for models that do camel case without underscores, like InstructBlip (instructblip) vs LlavaNext (llava_next). In second case we can infer where to make a capital letter, while in former it's impossible so I decided to give users freedom passing model names
There was a problem hiding this comment.
that is a great addition indeed! WOuld even add this comment in the help 😉
| for dependency, _ in list_dependencies: | ||
| node = class_finder.global_nodes.get(dependency, None) | ||
| if node is not None: | ||
| if node is not None and "Config" not in class_name: |
There was a problem hiding this comment.
Same q here - do we need to account for all other classes e.g. "Processor" not in class_name?
There was a problem hiding this comment.
This is added because diff importing configs to the configuration files, even though they are defined as a class a few lines below
There was a problem hiding this comment.
I would be good to get a second review of the changes here in this file from @ArthurZucker
There was a problem hiding this comment.
overall good to me!
Separating config imports is the way to go, and further separating process import later on will be needed as well
ArthurZucker
left a comment
There was a problem hiding this comment.
diff converter changes look nice IMO, but we should not need to import all the classes. New diff converter is able to parse dependencies so tell me if this is not the case!
There was a problem hiding this comment.
I am late to the party here, bug normally you should not have to import all these classes. The diff converter will automatically detect dependencies and copy classes that are required. !
Unless this is an edge case?
There was a problem hiding this comment.
We discussed this on Slack and decided we shouldn't have separate imports for each file, and let ruff clean-out unnecessary ones. So I'm manually filtering the issue with configs and adding all imports. That worked for InstructBlip
utils/diff_model_converter.py
Outdated
| self.transformers_imports = {} # maps the imports name like "from transformers.models.xxx" to the parsed AST module | ||
| self.imported_mapping = {} # stores the name of the imported classes, with their source {"LlamaModel":"transformers.model.llama.modeling_llama"} | ||
| self.visited_module = {} # modules visited like "transformers.models.llama.modeling_llama" | ||
| self.visited_module_config = {} # modules visited like "transformers.models.llama.modeling_llama" in config file, needed to not mix config vs modeling imports |
There was a problem hiding this comment.
yep this is also something I noticed, mixing import was not super well done
utils/diff_model_converter.py
Outdated
| ) | ||
|
|
||
| if super_file_name not in self.visited_module: # only extract classes once | ||
| visited_module = self.visited_module_config if "Config" in class_name else self.visited_module |
There was a problem hiding this comment.
not yet supported but planned for sure.
| parser.add_argument( | ||
| "--old_model_name", | ||
| required=False, | ||
| help="The name of the model from which the copying is done in CamelCase. If not provided is inferred from diff-file", | ||
| ) | ||
| parser.add_argument( | ||
| "--new_model_name", | ||
| required=False, | ||
| help="The name of the new model being added in CamelCase. If not provided is inferred from diff-file", | ||
| ) |
There was a problem hiding this comment.
that is a great addition indeed! WOuld even add this comment in the help 😉
There was a problem hiding this comment.
overall good to me!
Separating config imports is the way to go, and further separating process import later on will be needed as well
|
@amyeroberts this one is ready for the final review I guess |
amyeroberts
left a comment
There was a problem hiding this comment.
Another great addition - thanks!
As with Llava Next Video, we just need a final run of the slow tests and we're good to merge ❤️
|
Got the CI green, including slow tests. Will merge the PR |
What does this PR do?
I made these changes a month ago and forgot contributing. This PR adds video processing capabilities for InstructBLIP models. The paper states InstructBLIP was trained and evaluated on video, along with images and the original repo has some code on how video inference works.
Seems like this feature has some interest from the community (see here), so I believe we can add it.