Address changes in transformers VLM architecture #2554

githubnemo · 2025-05-26T15:31:46Z

transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things).

This breaks the following code (from test_vision_models):

peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)

with

AttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'

Reason being that model.language_model is not a *CausalLM anymore but a base model (e.g., LlamaModel).
PEFT assumes that the model is specific and all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels according to the number of virtual tokens for causal LM). We also can't simply use task_type='FEATURE_EXTRACTION' as a workaround because this would not deal with labels either.

Luckily the newly structured VLM is almost behaving like a LM (e.g., get_input_embeddings just refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the base_model.text_config instead of base_model directly.

Backward compatibility

Old models trained with PEFT should still work since VLM models like llava use the lm_head and language_model from the state dict and re-arrange them accordingly. However, old code that uses model.language_model as a target will fail with AttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'. I fear that we have no way of mitigating this since we don't have access to the 'top-level' model, the only thing we could do is to raise a more helpful error message but for reliably detecting the case that we're targeting a VLM's language model we'd need the VLM's model config (e.g., to check for the text_config key) or have access to the VLM itself. We cannot do much here, sadly.

[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.

HuggingFaceDocBuilderDev · 2025-05-26T15:35:13Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

BenjaminBossan

Thanks for fixing this, the solution LGTM. Failing MacOS tests are unrelated.

However, old code that uses model.language_model as a target will fail with AttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'. I fear that we have no way of mitigating this since we don't have access to the 'top-level' model, the only thing we could do is to raise a more helpful error message but for reliably detecting the case that we're targeting a VLM's language model we'd need the VLM's model config (e.g., to check for the text_config key) or have access to the VLM itself. We cannot do much here, sadly.

I agree with your conclusion here. My suggestion is to take a note that in the next release text, we should mention this possible backwards incompatibility and provide an example how to migrate the code.

Follow up to huggingface#2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 currently fails. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. Note that we skip prompt learning methods when applying the mapping. This is because they don't have the "base_model.model." prefix, which we need to remove before mapping. Instead just using "base_model.". This could be fine, we could only remove "base_model.", However, the subsequent sub-module could also be called "model", resulting in what looks like "base_model.model.". To avoid this confusion, we skip prefix tuning. Since it should be applied to the language model part directly and applies itself on the outer model (unlike LoRA et al), skipping should be fine. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).

FIX Transformers VLM architecture changes Follow up to #2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 are marked as xfail. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).

FIX Transformers VLM architecture changes Follow up to huggingface#2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 are marked as xfail. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).

[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.

FIX Transformers VLM architecture changes Follow up to huggingface#2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 are marked as xfail. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).

githubnemo requested a review from BenjaminBossan May 26, 2025 15:31

Make style

1c04796

BenjaminBossan approved these changes May 30, 2025

View reviewed changes

githubnemo merged commit 5a42bb7 into huggingface:main Jun 2, 2025
10 of 14 checks passed

BenjaminBossan mentioned this pull request Jun 6, 2025

FIX: Transformers VLM architecture changes #2574

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Address changes in transformers VLM architecture #2554

Address changes in transformers VLM architecture #2554

Uh oh!

githubnemo commented May 26, 2025

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2025

Uh oh!

BenjaminBossan left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Address changes in transformers VLM architecture #2554

Address changes in transformers VLM architecture #2554

Uh oh!

Conversation

githubnemo commented May 26, 2025

Backward compatibility

Uh oh!

HuggingFaceDocBuilderDev commented May 26, 2025

Uh oh!

BenjaminBossan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants