-
Notifications
You must be signed in to change notification settings - Fork 33.5k
VLMs: patch_size -> num_image_tokens in processing
#33424
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 6 commits
af34e5c
95b5e9d
850a8c7
5dc84ee
8dde9f4
4aea55e
0ce015d
4d5ca86
77aac79
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -51,6 +51,9 @@ class LlavaProcessor(ProcessorMixin): | |
| The tokenizer is a required input. | ||
| patch_size (`int`, *optional*): | ||
| Patch size from the vision tower. | ||
| num_additional_image_tokens (`int`, *optional*, defaults to 0): | ||
| Number of additional tokens added to the image embeddings, such as CLS (+1). If the backbone has no CLS or other | ||
| extra tokens appended, no need to set this arg. | ||
| vision_feature_select_strategy (`str`, *optional*): | ||
| The feature selection strategy used to select the vision feature from the vision backbone. | ||
| Shoudl be same as in model's config | ||
|
|
@@ -61,7 +64,13 @@ class LlavaProcessor(ProcessorMixin): | |
| """ | ||
|
|
||
| attributes = ["image_processor", "tokenizer"] | ||
| valid_kwargs = ["chat_template", "patch_size", "vision_feature_select_strategy", "image_token"] | ||
| valid_kwargs = [ | ||
| "chat_template", | ||
| "patch_size", | ||
| "vision_feature_select_strategy", | ||
| "image_token", | ||
| "num_additional_image_tokens", | ||
| ] | ||
| image_processor_class = "AutoImageProcessor" | ||
| tokenizer_class = "AutoTokenizer" | ||
|
|
||
|
|
@@ -70,12 +79,14 @@ def __init__( | |
| image_processor=None, | ||
| tokenizer=None, | ||
| patch_size=None, | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For backwards compatibility, we can't just remove accepting patch_size immediately. We need to deprecate it and not allow both You can use the @deprecate_kwarg decorator to handle this |
||
| num_additional_image_tokens=0, | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. should probably be placed at the end for BC as well 😉
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. also it was set to
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this one is important
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. oke, will fix it and merge after making sure tests pass |
||
| vision_feature_select_strategy=None, | ||
| chat_template=None, | ||
| image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases | ||
| **kwargs, | ||
| ): | ||
| self.patch_size = patch_size | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Same here - we need to deprecate this property |
||
| self.num_additional_image_tokens = num_additional_image_tokens | ||
| self.vision_feature_select_strategy = vision_feature_select_strategy | ||
| self.image_token = image_token | ||
| super().__init__(image_processor, tokenizer, chat_template=chat_template) | ||
|
|
@@ -147,9 +158,11 @@ def __call__( | |
| # Replace the image token with the expanded image token sequence | ||
| pixel_values = image_inputs["pixel_values"] | ||
| height, width = get_image_size(to_numpy_array(pixel_values[0])) | ||
| num_image_tokens = (height // self.patch_size) * (width // self.patch_size) + 1 | ||
| num_image_tokens = (height // self.patch_size) * ( | ||
| width // self.patch_size | ||
| ) + self.num_additional_image_tokens | ||
| if self.vision_feature_select_strategy == "default": | ||
| num_image_tokens -= 1 | ||
| num_image_tokens -= self.num_additional_image_tokens | ||
|
|
||
| prompt_strings = [] | ||
| for sample in text: | ||
|
|
@@ -158,8 +171,9 @@ def __call__( | |
| else: | ||
| logger.warning_once( | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. this will be in 4.47 |
||
| "Expanding inputs for image tokens in LLaVa should be done in processing. " | ||
| "Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
| "with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
| "Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
| "with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` " | ||
| "and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
| "Using processors without these attributes in the config is deprecated and will throw an error in v4.47." | ||
| ) | ||
|
|
||
|
|
||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -54,6 +54,9 @@ class LlavaNextProcessor(ProcessorMixin): | |
| The tokenizer is a required input. | ||
| patch_size (`int`, *optional*): | ||
| Patch size from the vision tower. | ||
| num_additional_image_tokens (`int`, *optional*, defaults to 0): | ||
| Number of additional tokens added to the image embeddings, such as CLS (+1). If the backbone has no CLS or other | ||
| extra tokens appended, no need to set this arg. | ||
| vision_feature_select_strategy (`str`, *optional*): | ||
| The feature selection strategy used to select the vision feature from the vision backbone. | ||
| Shoudl be same as in model's config | ||
|
|
@@ -64,7 +67,13 @@ class LlavaNextProcessor(ProcessorMixin): | |
| """ | ||
|
|
||
| attributes = ["image_processor", "tokenizer"] | ||
| valid_kwargs = ["chat_template", "patch_size", "vision_feature_select_strategy", "image_token"] | ||
| valid_kwargs = [ | ||
| "chat_template", | ||
| "patch_size", | ||
| "vision_feature_select_strategy", | ||
| "image_token", | ||
| "num_additional_image_tokens", | ||
| ] | ||
| image_processor_class = "AutoImageProcessor" | ||
| tokenizer_class = "AutoTokenizer" | ||
|
|
||
|
|
@@ -73,12 +82,14 @@ def __init__( | |
| image_processor=None, | ||
| tokenizer=None, | ||
| patch_size=None, | ||
| num_additional_image_tokens=0, | ||
| vision_feature_select_strategy=None, | ||
| chat_template=None, | ||
| image_token="<image>", # set the default and let users change if they have peculiar special tokens in rare cases | ||
| **kwargs, | ||
| ): | ||
| self.patch_size = patch_size | ||
| self.num_additional_image_tokens = num_additional_image_tokens | ||
| self.vision_feature_select_strategy = vision_feature_select_strategy | ||
| self.image_token = image_token | ||
| super().__init__(image_processor, tokenizer, chat_template=chat_template) | ||
|
|
@@ -141,8 +152,9 @@ def __call__( | |
| if self.patch_size is None or self.vision_feature_select_strategy is None: | ||
| logger.warning_once( | ||
| "Expanding inputs for image tokens in LLaVa-NeXT should be done in processing. " | ||
| "Please add `patch_size` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
| "with `processor.patch_size = {{patch_size}}` and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
| "Please add `patch_size`, `num_additional_image_tokens` and `vision_feature_select_strategy` to the model's processing config or set directly " | ||
| "with `processor.patch_size = {{patch_size}}`, `processor.num_additional_image_tokens = {{num_additional_image_tokens}}` " | ||
| "and processor.vision_feature_select_strategy = {{vision_feature_select_strategy}}`. " | ||
| "Using processors without these attributes in the config is deprecated and will throw an error in v4.47." | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. same here (unless this was not released!) |
||
| ) | ||
| else: | ||
|
|
@@ -155,7 +167,7 @@ def __call__( | |
| orig_height, orig_width = image_size | ||
| num_image_tokens = self._get_number_of_features(orig_height, orig_width, height, width) | ||
| if self.vision_feature_select_strategy == "default": | ||
| num_image_tokens -= 1 | ||
| num_image_tokens -= self.num_additional_image_tokens | ||
| sample = sample.replace(self.image_token, "<placeholder>" * num_image_tokens, 1) | ||
| prompt_strings.append(sample) | ||
| prompt_strings = [sample.replace("<placeholder>", self.image_token) for sample in prompt_strings] | ||
|
|
@@ -178,7 +190,7 @@ def _get_number_of_features(self, orig_height: int, orig_width: int, height: int | |
| orig_height, orig_width, patches_height, patches_width, scale_height, scale_width | ||
| ) | ||
| # The base patch covers the entire image (+1 for the CLS) | ||
| base_features = patches_height * patches_width + 1 | ||
| base_features = patches_height * patches_width + self.num_additional_image_tokens | ||
| num_image_tokens = unpadded_features + newline_features + base_features | ||
| return num_image_tokens | ||
|
|
||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
they are added where (to which image embedding and why) also I am not even sure I understand the usage of this myself. As it will be in all processors, can you explain a bit more why someone should set this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, the very first reason for this PR is that some ViT backbones can add a CLS token to image patches, while other do not. Therefore our current processors can't work with SigLIP because we hardcoded the CLS token addition in code.
We had two options to fix it:
num_image_tokensarg and let users specify any amount so we don't have to infer how many patches there will be frompatch_sizeandimage_size. Some processor already do that like Paligemmapatch-sizeandimage-size. I realize we have no models that accept non-square images, but that was the strongest objection from core maintainer against thenum-image-tokensSo after discussions with Pavel, we decided to make as less changes as possible and still support SigLIP by adding
num_additional_tokens. It will be 1 if the vision tower adds CLS, and otherwise 0. It can be more than 1 in case there is any vision tower with extra tokens added on top of image patchesThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok thanks for explaning!