Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
50 commits
Select commit Hold shift + click to select a range
3a87a41
i guessreverted all CdGen classes
zucchini-nlp Mar 26, 2025
8d7088a
style
zucchini-nlp Mar 26, 2025
95ac049
llava onevision
zucchini-nlp Mar 26, 2025
f0e917e
fix copies
zucchini-nlp Mar 27, 2025
85b1e7a
Merge branch 'main' into vlm-base-models
zucchini-nlp Mar 27, 2025
5e4d0e8
fix some tests
zucchini-nlp Mar 27, 2025
02e7b6e
some more tests
zucchini-nlp Mar 27, 2025
c0e41e6
dump
zucchini-nlp Mar 27, 2025
ef70523
Merge branch 'main' into vlm-base-models
zucchini-nlp Mar 28, 2025
06b8227
skip these
zucchini-nlp Mar 28, 2025
5655657
nevermind, i am dumb
zucchini-nlp Mar 28, 2025
083b9bc
revert fix not needed
zucchini-nlp Mar 28, 2025
4fe8a82
Merge branch 'main' into vlm-base-models
zucchini-nlp Mar 31, 2025
2e6caa4
fixup
zucchini-nlp Mar 31, 2025
d397075
Merge branch 'main' into vlm-base-models
zucchini-nlp Mar 31, 2025
0d1409f
Merge branch 'main' into vlm-base-models
zucchini-nlp Mar 31, 2025
a32e47e
Merge branch 'main' into vlm-base-models
zucchini-nlp Apr 1, 2025
5c019fe
fixup
zucchini-nlp Apr 4, 2025
32a67b1
Merge remote-tracking branch 'upstream/main' into vlm-base-models
zucchini-nlp Apr 4, 2025
a9b3816
another fixup
zucchini-nlp Apr 4, 2025
1f7172c
more fixup to make ci finally happy
zucchini-nlp Apr 4, 2025
1e5ee3b
merge main
zucchini-nlp Apr 22, 2025
c6bfa8d
fixup after rebasing
zucchini-nlp Apr 22, 2025
7631fdb
fix qwen tests
zucchini-nlp Apr 22, 2025
da33a04
add internVL + typos here and there
zucchini-nlp Apr 22, 2025
141c102
image token index -> id
zucchini-nlp Apr 22, 2025
ba58575
style
zucchini-nlp Apr 22, 2025
4a73546
fix init weights
zucchini-nlp Apr 22, 2025
4d4ae05
Merge remote-tracking branch 'upstream/main' into vlm-base-models
zucchini-nlp Apr 22, 2025
6298cc4
Merge branch 'main' into vlm-base-models
zucchini-nlp Apr 24, 2025
a25e02d
revert blip-2 not supported
zucchini-nlp May 1, 2025
3bbf3fd
address comments
zucchini-nlp May 1, 2025
8087394
Merge remote-tracking branch 'upstream/main' into vlm-base-models
zucchini-nlp May 1, 2025
32cbc87
Merge remote-tracking branch 'upstream/main' into vlm-base-models
zucchini-nlp May 1, 2025
43999e8
fix copies
zucchini-nlp May 1, 2025
43639f4
revert blip2 test file as well
zucchini-nlp May 1, 2025
d31a4c9
as discussed internally, revert back CdGen models
zucchini-nlp May 2, 2025
e7ff08c
fix some tests
zucchini-nlp May 2, 2025
c265726
fix more tests for compile
zucchini-nlp May 2, 2025
db069f1
CI red
zucchini-nlp May 2, 2025
d309ead
fix copies
zucchini-nlp May 2, 2025
f5b18eb
enumerate explicitly allowed models
zucchini-nlp May 2, 2025
c58c4f2
address comments
zucchini-nlp May 6, 2025
9971e7f
fix tests
zucchini-nlp May 7, 2025
f601c52
fixup
zucchini-nlp May 7, 2025
4e617b4
merge main
zucchini-nlp May 7, 2025
df62bdf
style again
zucchini-nlp May 7, 2025
2509f77
add tests for new model class
zucchini-nlp May 7, 2025
ce4374b
another fixup ( x _ x )
zucchini-nlp May 7, 2025
24d127f
[fixup] unused attributes can be removed post-deprecation
zucchini-nlp May 7, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/aria.md
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,10 @@ response = processor.decode(output_ids, skip_special_tokens=True)

[[autodoc]] AriaTextModel

## AriaModel

[[autodoc]] AriaModel

## AriaTextForCausalLM

[[autodoc]] AriaTextForCausalLM
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/aya_vision.md
Original file line number Diff line number Diff line change
Expand Up @@ -237,6 +237,10 @@ for i, output in enumerate(batch_outputs):

[[autodoc]] AyaVisionConfig

## AyaVisionModel

[[autodoc]] AyaVisionModel

## AyaVisionForConditionalGeneration

[[autodoc]] AyaVisionForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/emu3.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,10 @@ for i, image in enumerate(images['pixel_values']):
[[autodoc]] Emu3TextModel
- forward

## Emu3Model

[[autodoc]] Emu3Model

## Emu3ForCausalLM

[[autodoc]] Emu3ForCausalLM
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/fuyu.md
Original file line number Diff line number Diff line change
Expand Up @@ -103,6 +103,10 @@ The `LlamaTokenizer` is used as it is a standard wrapper around sentencepiece.

[[autodoc]] FuyuConfig

## FuyuModel

[[autodoc]] FuyuModel

## FuyuForCausalLM

[[autodoc]] FuyuForCausalLM
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/gemma3.md
Original file line number Diff line number Diff line change
Expand Up @@ -254,6 +254,10 @@ visualizer("<img>What is shown in this image?")
[[autodoc]] Gemma3TextModel
- forward

## Gemma3Model

[[autodoc]] Gemma3Model

## Gemma3ForCausalLM

[[autodoc]] Gemma3ForCausalLM
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/got_ocr2.md
Original file line number Diff line number Diff line change
Expand Up @@ -277,6 +277,10 @@ alt="drawing" width="600"/>

[[autodoc]] GotOcr2Processor

## GotOcr2Model

[[autodoc]] GotOcr2Model

## GotOcr2ForConditionalGeneration

[[autodoc]] GotOcr2ForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/instructblip.md
Original file line number Diff line number Diff line change
Expand Up @@ -69,6 +69,10 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
[[autodoc]] InstructBlipQFormerModel
- forward

## InstructBlipModel

[[autodoc]] InstructBlipModel

## InstructBlipForConditionalGeneration

[[autodoc]] InstructBlipForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/instructblipvideo.md
Original file line number Diff line number Diff line change
Expand Up @@ -73,6 +73,10 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
[[autodoc]] InstructBlipVideoQFormerModel
- forward

## InstructBlipVideoModel
[[autodoc]] InstructBlipVideoModel
- forward

## InstructBlipVideoForConditionalGeneration

[[autodoc]] InstructBlipVideoForConditionalGeneration
Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/internvl.md
Original file line number Diff line number Diff line change
Expand Up @@ -340,6 +340,11 @@ This example showcases how to handle a batch of chat conversations with interlea
[[autodoc]] InternVLVisionModel
- forward

## InternVLModel

[[autodoc]] InternVLModel
- forward

## InternVLForConditionalGeneration

[[autodoc]] InternVLForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -256,6 +256,10 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

[[autodoc]] LlavaProcessor

## LlavaModel

[[autodoc]] LlavaModel

## LlavaForConditionalGeneration

[[autodoc]] LlavaForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/llava_next.md
Original file line number Diff line number Diff line change
Expand Up @@ -315,6 +315,10 @@ model = AutoModelForImageTextToText.from_pretrained(

[[autodoc]] LlavaNextProcessor

## LlavaNextModel

[[autodoc]] LlavaNextModel

## LlavaNextForConditionalGeneration

[[autodoc]] LlavaNextForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/llava_next_video.md
Original file line number Diff line number Diff line change
Expand Up @@ -262,6 +262,10 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(

[[autodoc]] LlavaNextVideoImageProcessor

## LlavaNextVideoModel

[[autodoc]] LlavaNextVideoModel

## LlavaNextVideoForConditionalGeneration

[[autodoc]] LlavaNextVideoForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/llava_onevision.md
Original file line number Diff line number Diff line change
Expand Up @@ -313,6 +313,10 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(

[[autodoc]] LlavaOnevisionVideoProcessor

## LlavaOnevisionModel

[[autodoc]] LlavaOnevisionModel

## LlavaOnevisionForConditionalGeneration

[[autodoc]] LlavaOnevisionForConditionalGeneration
Expand Down
3 changes: 3 additions & 0 deletions docs/source/en/model_doc/mistral3.md
Original file line number Diff line number Diff line change
Expand Up @@ -227,6 +227,9 @@ This example also how to use `BitsAndBytes` to load the model in 4bit quantizati

[[autodoc]] Mistral3Config

## Mistral3Model

[[autodoc]] Mistral3Model

## Mistral3ForConditionalGeneration

Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/mllama.md
Original file line number Diff line number Diff line change
Expand Up @@ -130,6 +130,10 @@ print(processor.decode(output[0], skip_special_tokens=True))
[[autodoc]] MllamaTextModel
- forward

## MllamaModel

[[autodoc]] MllamaModel

## MllamaForCausalLM

[[autodoc]] MllamaForCausalLM
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/paligemma.md
Original file line number Diff line number Diff line change
Expand Up @@ -174,6 +174,10 @@ visualizer("<img> What is in this image?")

[[autodoc]] PaliGemmaProcessor

## PaliGemmaModel

[[autodoc]] PaliGemmaModel

## PaliGemmaForConditionalGeneration

[[autodoc]] PaliGemmaForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/qwen2_5_vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,10 @@ model = Qwen2_5_VLForConditionalGeneration.from_pretrained(

[[autodoc]] Qwen2_5_VLProcessor

## Qwen2_5_VLTextModel

[[autodoc]] Qwen2_5_VLTextModel
- forward

## Qwen2_5_VLModel

Expand Down
5 changes: 5 additions & 0 deletions docs/source/en/model_doc/qwen2_vl.md
Original file line number Diff line number Diff line change
Expand Up @@ -296,6 +296,11 @@ model = Qwen2VLForConditionalGeneration.from_pretrained(

[[autodoc]] Qwen2VLProcessor

## Qwen2VLTextModel

[[autodoc]] Qwen2VLTextModel
- forward

## Qwen2VLModel

[[autodoc]] Qwen2VLModel
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/video_llava.md
Original file line number Diff line number Diff line change
Expand Up @@ -215,6 +215,10 @@ model = VideoLlavaForConditionalGeneration.from_pretrained(

[[autodoc]] VideoLlavaProcessor

## VideoLlavaModel

[[autodoc]] VideoLlavaModel

## VideoLlavaForConditionalGeneration

[[autodoc]] VideoLlavaForConditionalGeneration
Expand Down
4 changes: 4 additions & 0 deletions docs/source/en/model_doc/vipllava.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,6 +101,10 @@ A chat between a curious human and an artificial intelligence assistant. The ass

[[autodoc]] VipLlavaConfig

## VipLlavaModel

[[autodoc]] VipLlavaModel

## VipLlavaForConditionalGeneration

[[autodoc]] VipLlavaForConditionalGeneration
Expand Down
47 changes: 46 additions & 1 deletion src/transformers/modeling_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -216,6 +216,28 @@ def is_local_dist_rank_0():
"kaiming_normal": nn.init.kaiming_normal,
}

# DO NOT MODIFY, KEPT FOR BC ONLY
VLMS = [
"aria",
"aya_vision",
"emu3",
"fuyu",
"got_ocr2",
"gemma3",
"internvl",
"llava",
"llava_next",
"llava_next_video",
"llava_onevision",
"mistral3",
"mllama",
"paligemma",
"qwen2_vl",
"qwem2_5_vl",
"video_llava",
"vipllava",
]


@contextmanager
def no_init_weights():
Expand Down Expand Up @@ -1778,6 +1800,8 @@ class PreTrainedModel(nn.Module, ModuleUtilsMixin, PushToHubMixin, PeftAdapterMi
main_input_name = "input_ids"
model_tags = None

_checkpoint_conversion_mapping = {} # used for BC support in VLMs, not meant to be used by new models

_auto_class = None
_no_split_modules = None
_skip_keys_device_placement = None
Expand Down Expand Up @@ -3484,6 +3508,21 @@ def save_pretrained(
module_map[name + f".{key}"] = module
state_dict = model_to_save.state_dict()

if any(allowed_name in self.__class__.__name__.lower() for allowed_name in VLMS):
reverse_key_mapping = {v: k for k, v in self._checkpoint_conversion_mapping.items()}

original_state_dict = {}
for key, value in state_dict.items():
for pattern, replacement in reverse_key_mapping.items():
replacement = replacement.lstrip("^") # strip off un-needed chars and patterns
replacement = re.sub(r"\(.*?\)", "", pattern)
key, n_replace = re.subn(pattern, replacement, key)
# Early exit of the loop
if n_replace > 0:
break
original_state_dict[key] = value
state_dict = original_state_dict

# Translate state_dict from smp to hf if saving with smp >= 1.10
if IS_SAGEMAKER_MP_POST_1_10:
for smp_to_hf, _ in smp.state.module_manager.translate_functions:
Expand Down Expand Up @@ -4071,7 +4110,13 @@ def from_pretrained(
gguf_file = kwargs.pop("gguf_file", None)
tp_plan = kwargs.pop("tp_plan", None)
tp_size = kwargs.pop("tp_size", None)
key_mapping = kwargs.pop("key_mapping", None)

# Load models with hardcoded key mapping on class for VLMs only, to keep BC and standardize model
if any(allowed_name in cls.__name__.lower() for allowed_name in VLMS):
Copy link
Contributor

@ManuelFay ManuelFay Jun 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is quite brittle and breaks adapters (in peft). How would you go about this?
I'm thinking we can propagate the key_mapping to the peft integration in the from pretrained function ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since a lot of people (including me) use adapters with VLMs, that's quite a big breaking change

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For anyone looking, this was fixed!

key_mapping = kwargs.pop("key_mapping", cls._checkpoint_conversion_mapping)
else:
key_mapping = kwargs.pop("key_mapping", None)

# Not used anymore -- remove them from the kwargs
_ = kwargs.pop("resume_download", None)
_ = kwargs.pop("trust_remote_code", None)
Expand Down
Loading