-
Notifications
You must be signed in to change notification settings - Fork 31.9k
[Idefics] add image_embeddings option in generate-related methods
#25442
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Idefics] add image_embeddings option in generate-related methods
#25442
Conversation
Idefics] add image_embeddings option in generate-related methods
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you just run make style! 🤗
|
The documentation is not available anymore as the PR was closed or merged. |
VictorSanh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am missing something in the logic: let's say we call generate with pixel_values. the first time it calls the forward, it computes the image hidden states through vision_encoder (and optionally through the perceiver). how are these image hidden states passed to the second call to the forward?
VictorSanh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the logic looks about right to me.
I doubt that if return_dict_in_generate is True, the whole thing works. In particular, this call model_kwargs["encoder_outputs"] (in greedy_search) will crash.
Also, I am not an expert in the generate function. Computing the encoder hidden states (i.e. the vision/perceiverd hidden states in our case) inside the prepare... function seems curious to me. Looking at modeling_t5.py, the encoder_hidden_states are computed inside the forward of the model (and then returned through Seq2SeqModelOutput.
But perhaps it's ok? I will let transformers folks comment instead.
| model_kwargs["perceiver_embeddings"] = self.model.perceiver_resampler(image_encoder_embeddings) | ||
|
|
||
| image_seq_len, image_hidden_size = model_kwargs["perceiver_embeddings"].size(1), model_kwargs[ | ||
| "perceiver_embeddings" | ||
| ].size(2) | ||
| model_kwargs["perceiver_embeddings"] = model_kwargs["perceiver_embeddings"].view( | ||
| batch_size, num_images, image_seq_len, image_hidden_size | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do the resizes before assigning it into model_kwargs["perceiver_embeddings"]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure I'll change that
Yes I'm not sure it is very standard. |
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this! Left a few commets. Could you also add some tests to make sure you can pass both expected embeddings and have the expected behaviour! 🤗
| if pixel_values is not None: | ||
| batch_size, num_images = pixel_values.shape[:2] | ||
| pixel_values = pixel_values.contiguous().view(batch_size * num_images, *pixel_values.shape[2:]) | ||
| image_encoder_embeddings = self.model.vision_model(pixel_values=pixel_values).last_hidden_state | ||
|
|
||
| elif image_encoder_embeddings is not None: | ||
| batch_size, num_images, image_seq_len, image_hidden_size = image_encoder_embeddings.size() | ||
| image_encoder_embeddings = image_encoder_embeddings.view( | ||
| batch_size * num_images, image_seq_len, image_hidden_size | ||
| ) | ||
|
|
||
| if self.config.use_resampler: | ||
| if perceiver_embeddings is None: | ||
| perceiver_embeddings = self.model.perceiver_resampler(image_encoder_embeddings) | ||
| image_seq_len, image_hidden_size = perceiver_embeddings.size(1), perceiver_embeddings.size(2) | ||
| model_kwargs["perceiver_embeddings"] = perceiver_embeddings.view( | ||
| batch_size, num_images, image_seq_len, image_hidden_size | ||
| ) | ||
| else: | ||
| model_kwargs["perceiver_embeddings"] = perceiver_embeddings | ||
| else: | ||
| image_seq_len, image_hidden_size = image_encoder_embeddings.size(1), image_encoder_embeddings.size(2) | ||
| model_kwargs["image_encoder_embeddings"] = image_encoder_embeddings.view( | ||
| batch_size, num_images, image_seq_len, image_hidden_size |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Regarding @VictorSanh's comment, it makes more sense indeed to compute these values in the forward, and return them as outputs. Our API usually works this way so let's keep this. It should also simplify this function!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okk will look into this
| if model_kwargs["image_encoder_embeddings"] is not None: | ||
| model_kwargs["image_encoder_embeddings"] = model_kwargs["image_encoder_embeddings"].index_select( | ||
| 0, expanded_return_idx | ||
| ) | ||
|
|
||
| elif model_kwargs["perceiver_embeddings"] is not None: | ||
| model_kwargs["perceiver_embeddings"] = model_kwargs["perceiver_embeddings"].index_select( | ||
| 0, expanded_return_idx |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this something that only happens with generate? (if yes, then no worries let's keep it here!)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure what you mean here.. Basically, to the best of my understanding, this is used in beam_search as you need to expand the inputs bsz to match the number of beams
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok 👍🏻 then this is all good 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually you if you put outputs in encoder_outputs the default https://github.com/ArthurZucker/transformers/blob/main/src/transformers/generation/utils.py#L721 should work!
VictorSanh
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yaaay! congrats on your first PR to Transformers (I think?)!
lgtm! I'll let @ArthurZucker double confirm
ArthurZucker
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a few small nits, good to go otherwise ! Congrats 🔥
Co-authored-by: Arthur <[email protected]>
Co-authored-by: Arthur <[email protected]>
Idefics] add image_embeddings option in generate-related methodsIdefics] add image_embeddings option in generate-related methods
| image_hidden_states = image_embeddings.to(dtype=self.dtype, device=input_ids.device) | ||
| elif image_encoder_embeddings is not None: | ||
| batch_size, num_images, image_seq_len, image_hidden_size = image_encoder_embeddings.size() | ||
| image_hidden_states = image_encoder_embeddings.to(dtype=self.dtype, device=input_ids.device) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Last nit, is this required? Would think that accelerate handles this since it's not a tensor created on the fly. (unless the casting is what's required not necessarly moving to a different device!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure, maybe not. I kept it out of caution when modifying this part. @VictorSanh, do you know if there was a particular reason for adding this?
|
And if it’s not too much work, a test calling generate to make sure this all works well 🤗 |
|
I think there's a test here |
|
Okay |
* rename * restore * mappings * unedited tests+docs * docs * fixes * fix auto-sync breakage * cleanup * wip * wip * add fetch_images * remove einops dependency * update * fix * fix * fix * fix * fix * re-add * add batching * rework * fix * improve * add Leo as I am extending his work * cleanup * fix * cleanup * slow-test * fix * fix * fixes * deal with warning * rename modified llama classes * rework fetch_images * alternative implementation * cleanup * strict version * cleanup * [`IDEFICS`] Fix idefics ci (#25056) * Fix IDEFICS CI * fix test file * fixup * some changes to make tests pass * fix * fixup * Update src/transformers/models/idefics/configuration_idefics.py Co-authored-by: Stas Bekman <[email protected]> --------- Co-authored-by: Stas Bekman <[email protected]> * remove compat checks * style * explain that Idefics is not for training from scratch * require pt>=2.0 * fix idefics vision config (#25092) * fix idefics vision config * fixup * clean * Update src/transformers/models/idefics/configuration_idefics.py --------- Co-authored-by: Stas Bekman <[email protected]> * cleanup * style * cleanup * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * upcase * sequence of images * handle the case with no images * Update src/transformers/image_processing_utils.py Co-authored-by: Victor SANH <[email protected]> * support pure lm take 2 * support tokenizer options * parameterize num_channels * fix upcase * s|IdeficsForCausalLM|IdeficsForVisionText2Text|g * manual to one line * addressing review * unbreak * remove clip dependency * fix test * consistency * PIL import * Idefics prefix * Idefics prefix * hack to make tests work * style * fix * fix * revert * try/finally * cleanup * clean up * move * [`IDEFICS`] Fix idefics config refactor (#25149) * refactor config * nuke init weights * more refactor * oops * remove visual question answering pipeline support * Update src/transformers/models/idefics/clip.py Co-authored-by: Stas Bekman <[email protected]> * Update src/transformers/models/idefics/modeling_idefics.py * cleanup * mv clip.py vision.py * tidyup --------- Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> * fix * license * condition on pt * fix * style * fix * rm torchvision dependency, allow custom transforms * address review * rework device arg * add_eos_token * s/transforms/transform/ * fix top level imports * fix return value * cleanup * cleanup * fix * style * license * license * Update src/transformers/models/idefics/image_processing_idefics.py Co-authored-by: Sylvain Gugger <[email protected]> * add a wrapper to freeze vision layears * tidyup * use the correct std/mean settings * parameterize values from config * add tests/models/idefics/test_image_processing_idefics.py * add test_processor_idefics.py * cleanup * cleanups * fix * fix * move to the right group * style * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * add perceiver config * reset * missing arg docs * Apply suggestions from code review Co-authored-by: Leo Tronchon <[email protected]> * address review comments * inject automatic end of utterance tokens (#25218) * inject automatic end of utterance tokens * fix * fix * fix * rework to not use the config * not end_of_utterance_token at the end * Update src/transformers/models/idefics/processing_idefics.py Co-authored-by: Sylvain Gugger <[email protected]> * address review * Apply suggestions from code review Co-authored-by: Joao Gante <[email protected]> * Update src/transformers/image_processing_utils.py Co-authored-by: Nicolas Patry <[email protected]> * [`Idefics`] add image_embeddings option in generate-related methods (#25442) * add image_embeddings option in generate-related methods * style * rename image_embeddings and allow perceiver embeddings precomputation * compute embeddings within generate * make is_encoder_decoder= True the default in config * nested if else fix * better triple check * switch if elif order for pixel values / img embeds * update model_kwargs perceiver only at the end * use _prepare_model_inputs instead of encoder_decoder logic * fix comment typo * fix config default for is_encoder_decoder * style * add typehints * precompute in forward * doc builder * style * pop instead of get image hidden states * Trigger CI * Update src/transformers/models/idefics/modeling_idefics.py Co-authored-by: Arthur <[email protected]> * Update src/transformers/models/idefics/modeling_idefics.py Co-authored-by: Arthur <[email protected]> * fix * + indentation + style * simplify a bit the use_resampler logic using comments * update diocstrings * Trigger CI --------- Co-authored-by: Arthur <[email protected]> * fix rebase changes * unbreak #25237 - to be fixed in follow up PRs * is_composition = False * no longer needed --------- Co-authored-by: leot13 <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Victor SANH <[email protected]> Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Arthur <[email protected]>
* rename * restore * mappings * unedited tests+docs * docs * fixes * fix auto-sync breakage * cleanup * wip * wip * add fetch_images * remove einops dependency * update * fix * fix * fix * fix * fix * re-add * add batching * rework * fix * improve * add Leo as I am extending his work * cleanup * fix * cleanup * slow-test * fix * fix * fixes * deal with warning * rename modified llama classes * rework fetch_images * alternative implementation * cleanup * strict version * cleanup * [`IDEFICS`] Fix idefics ci (#25056) * Fix IDEFICS CI * fix test file * fixup * some changes to make tests pass * fix * fixup * Update src/transformers/models/idefics/configuration_idefics.py Co-authored-by: Stas Bekman <[email protected]> --------- Co-authored-by: Stas Bekman <[email protected]> * remove compat checks * style * explain that Idefics is not for training from scratch * require pt>=2.0 * fix idefics vision config (#25092) * fix idefics vision config * fixup * clean * Update src/transformers/models/idefics/configuration_idefics.py --------- Co-authored-by: Stas Bekman <[email protected]> * cleanup * style * cleanup * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * upcase * sequence of images * handle the case with no images * Update src/transformers/image_processing_utils.py Co-authored-by: Victor SANH <[email protected]> * support pure lm take 2 * support tokenizer options * parameterize num_channels * fix upcase * s|IdeficsForCausalLM|IdeficsForVisionText2Text|g * manual to one line * addressing review * unbreak * remove clip dependency * fix test * consistency * PIL import * Idefics prefix * Idefics prefix * hack to make tests work * style * fix * fix * revert * try/finally * cleanup * clean up * move * [`IDEFICS`] Fix idefics config refactor (#25149) * refactor config * nuke init weights * more refactor * oops * remove visual question answering pipeline support * Update src/transformers/models/idefics/clip.py Co-authored-by: Stas Bekman <[email protected]> * Update src/transformers/models/idefics/modeling_idefics.py * cleanup * mv clip.py vision.py * tidyup --------- Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> * fix * license * condition on pt * fix * style * fix * rm torchvision dependency, allow custom transforms * address review * rework device arg * add_eos_token * s/transforms/transform/ * fix top level imports * fix return value * cleanup * cleanup * fix * style * license * license * Update src/transformers/models/idefics/image_processing_idefics.py Co-authored-by: Sylvain Gugger <[email protected]> * add a wrapper to freeze vision layears * tidyup * use the correct std/mean settings * parameterize values from config * add tests/models/idefics/test_image_processing_idefics.py * add test_processor_idefics.py * cleanup * cleanups * fix * fix * move to the right group * style * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * add perceiver config * reset * missing arg docs * Apply suggestions from code review Co-authored-by: Leo Tronchon <[email protected]> * address review comments * inject automatic end of utterance tokens (#25218) * inject automatic end of utterance tokens * fix * fix * fix * rework to not use the config * not end_of_utterance_token at the end * Update src/transformers/models/idefics/processing_idefics.py Co-authored-by: Sylvain Gugger <[email protected]> * address review * Apply suggestions from code review Co-authored-by: Joao Gante <[email protected]> * Update src/transformers/image_processing_utils.py Co-authored-by: Nicolas Patry <[email protected]> * [`Idefics`] add image_embeddings option in generate-related methods (#25442) * add image_embeddings option in generate-related methods * style * rename image_embeddings and allow perceiver embeddings precomputation * compute embeddings within generate * make is_encoder_decoder= True the default in config * nested if else fix * better triple check * switch if elif order for pixel values / img embeds * update model_kwargs perceiver only at the end * use _prepare_model_inputs instead of encoder_decoder logic * fix comment typo * fix config default for is_encoder_decoder * style * add typehints * precompute in forward * doc builder * style * pop instead of get image hidden states * Trigger CI * Update src/transformers/models/idefics/modeling_idefics.py Co-authored-by: Arthur <[email protected]> * Update src/transformers/models/idefics/modeling_idefics.py Co-authored-by: Arthur <[email protected]> * fix * + indentation + style * simplify a bit the use_resampler logic using comments * update diocstrings * Trigger CI --------- Co-authored-by: Arthur <[email protected]> * fix rebase changes * unbreak #25237 - to be fixed in follow up PRs * is_composition = False * no longer needed --------- Co-authored-by: leot13 <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Victor SANH <[email protected]> Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Arthur <[email protected]>
* rename * restore * mappings * unedited tests+docs * docs * fixes * fix auto-sync breakage * cleanup * wip * wip * add fetch_images * remove einops dependency * update * fix * fix * fix * fix * fix * re-add * add batching * rework * fix * improve * add Leo as I am extending his work * cleanup * fix * cleanup * slow-test * fix * fix * fixes * deal with warning * rename modified llama classes * rework fetch_images * alternative implementation * cleanup * strict version * cleanup * [`IDEFICS`] Fix idefics ci (huggingface#25056) * Fix IDEFICS CI * fix test file * fixup * some changes to make tests pass * fix * fixup * Update src/transformers/models/idefics/configuration_idefics.py Co-authored-by: Stas Bekman <[email protected]> --------- Co-authored-by: Stas Bekman <[email protected]> * remove compat checks * style * explain that Idefics is not for training from scratch * require pt>=2.0 * fix idefics vision config (huggingface#25092) * fix idefics vision config * fixup * clean * Update src/transformers/models/idefics/configuration_idefics.py --------- Co-authored-by: Stas Bekman <[email protected]> * cleanup * style * cleanup * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * upcase * sequence of images * handle the case with no images * Update src/transformers/image_processing_utils.py Co-authored-by: Victor SANH <[email protected]> * support pure lm take 2 * support tokenizer options * parameterize num_channels * fix upcase * s|IdeficsForCausalLM|IdeficsForVisionText2Text|g * manual to one line * addressing review * unbreak * remove clip dependency * fix test * consistency * PIL import * Idefics prefix * Idefics prefix * hack to make tests work * style * fix * fix * revert * try/finally * cleanup * clean up * move * [`IDEFICS`] Fix idefics config refactor (huggingface#25149) * refactor config * nuke init weights * more refactor * oops * remove visual question answering pipeline support * Update src/transformers/models/idefics/clip.py Co-authored-by: Stas Bekman <[email protected]> * Update src/transformers/models/idefics/modeling_idefics.py * cleanup * mv clip.py vision.py * tidyup --------- Co-authored-by: Stas Bekman <[email protected]> Co-authored-by: Stas Bekman <[email protected]> * fix * license * condition on pt * fix * style * fix * rm torchvision dependency, allow custom transforms * address review * rework device arg * add_eos_token * s/transforms/transform/ * fix top level imports * fix return value * cleanup * cleanup * fix * style * license * license * Update src/transformers/models/idefics/image_processing_idefics.py Co-authored-by: Sylvain Gugger <[email protected]> * add a wrapper to freeze vision layears * tidyup * use the correct std/mean settings * parameterize values from config * add tests/models/idefics/test_image_processing_idefics.py * add test_processor_idefics.py * cleanup * cleanups * fix * fix * move to the right group * style * Apply suggestions from code review Co-authored-by: Sylvain Gugger <[email protected]> * add perceiver config * reset * missing arg docs * Apply suggestions from code review Co-authored-by: Leo Tronchon <[email protected]> * address review comments * inject automatic end of utterance tokens (huggingface#25218) * inject automatic end of utterance tokens * fix * fix * fix * rework to not use the config * not end_of_utterance_token at the end * Update src/transformers/models/idefics/processing_idefics.py Co-authored-by: Sylvain Gugger <[email protected]> * address review * Apply suggestions from code review Co-authored-by: Joao Gante <[email protected]> * Update src/transformers/image_processing_utils.py Co-authored-by: Nicolas Patry <[email protected]> * [`Idefics`] add image_embeddings option in generate-related methods (huggingface#25442) * add image_embeddings option in generate-related methods * style * rename image_embeddings and allow perceiver embeddings precomputation * compute embeddings within generate * make is_encoder_decoder= True the default in config * nested if else fix * better triple check * switch if elif order for pixel values / img embeds * update model_kwargs perceiver only at the end * use _prepare_model_inputs instead of encoder_decoder logic * fix comment typo * fix config default for is_encoder_decoder * style * add typehints * precompute in forward * doc builder * style * pop instead of get image hidden states * Trigger CI * Update src/transformers/models/idefics/modeling_idefics.py Co-authored-by: Arthur <[email protected]> * Update src/transformers/models/idefics/modeling_idefics.py Co-authored-by: Arthur <[email protected]> * fix * + indentation + style * simplify a bit the use_resampler logic using comments * update diocstrings * Trigger CI --------- Co-authored-by: Arthur <[email protected]> * fix rebase changes * unbreak huggingface#25237 - to be fixed in follow up PRs * is_composition = False * no longer needed --------- Co-authored-by: leot13 <[email protected]> Co-authored-by: Younes Belkada <[email protected]> Co-authored-by: Sylvain Gugger <[email protected]> Co-authored-by: Victor SANH <[email protected]> Co-authored-by: Joao Gante <[email protected]> Co-authored-by: Nicolas Patry <[email protected]> Co-authored-by: Arthur <[email protected]>
Update Idefics generate-related functions to allow for precomputed image embeddings