Skip to content

Conversation

@githubnemo
Copy link
Collaborator

transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things).

This breaks the following code (from test_vision_models):

peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)

with

AttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'

Reason being that model.language_model is not a *CausalLM anymore but a base model (e.g., LlamaModel).
PEFT assumes that the model is specific and all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels according to the number of virtual tokens for causal LM). We also can't simply use task_type='FEATURE_EXTRACTION' as a workaround because this would not deal with labels either.

Luckily the newly structured VLM is almost behaving like a LM (e.g., get_input_embeddings just refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the base_model.text_config instead of base_model directly.

Backward compatibility

Old models trained with PEFT should still work since VLM models like llava use the lm_head and language_model from the state dict and re-arrange them accordingly. However, old code that uses model.language_model as a target will fail with AttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'. I fear that we have no way of mitigating this since we don't have access to the 'top-level' model, the only thing we could do is to raise a more helpful error message but for reliably detecting the case that we're targeting a VLM's language model we'd need the VLM's model config (e.g., to check for the text_config key) or have access to the VLM itself. We cannot do much here, sadly.

[transformers PR #37033](huggingface/transformers#37033) re-arranges
the way visual language models are built by moving the LM head from the language model to
the top-level VLM (among other things).

This breaks the following test:

```
peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)
```

Reason being that all soft-prompting methods need a task type since each task type has specific
handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM).
We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either.

Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM),
therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning
a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model`
directly.
@githubnemo githubnemo requested a review from BenjaminBossan May 26, 2025 15:31
@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@BenjaminBossan BenjaminBossan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing this, the solution LGTM. Failing MacOS tests are unrelated.

However, old code that uses model.language_model as a target will fail with AttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'. I fear that we have no way of mitigating this since we don't have access to the 'top-level' model, the only thing we could do is to raise a more helpful error message but for reliably detecting the case that we're targeting a VLM's language model we'd need the VLM's model config (e.g., to check for the text_config key) or have access to the VLM itself. We cannot do much here, sadly.

I agree with your conclusion here. My suggestion is to take a note that in the next release text, we should mention this possible backwards incompatibility and provide an example how to migrate the code.

@githubnemo githubnemo merged commit 5a42bb7 into huggingface:main Jun 2, 2025
10 of 14 checks passed
BenjaminBossan added a commit to BenjaminBossan/peft that referenced this pull request Jun 6, 2025
Follow up to huggingface#2554
See discussion in huggingface/transformers#38627

To quote:

> transformers PR #37033 re-arranges the way visual language models are
built by moving the LM head from the language model to the top-level
VLM (among other things).

A consequence of this is that the keys in the PEFT state_dict now also
follow the new architecture. This means that:

1. If a PEFT checkpoint was saved with the old architecture but is
   loaded with the new architecture, loading fails.
2. If a PEFT checkpoint was saved with the new architecture but is
   loaded with the old architecture, loading fails.

1. can be addressed by making use of the newly added
_checkpoint_conversion_mapping attribute for models with the new
architecture. In transformers, this is used to map old model state_dicts
to the new state_dict format. In PEFT, with some fiddling, we can use
the same mapping to make old PEFT state_dicts compatible with the new
architecture (backwards compatibility).

However, 2. is not easily addressed. We would need a reverse mapping for
this. This could be easily derived from _checkpoint_conversion_mapping,
but since this attribute doesn't exist on old models, we cannot do that.
Therefore, new checkpoints created with PEFT on these models won't load
successfully when users use old transformers (forward compatibility).

These cases are covered by the added unit tests, which means that the
test covering case 2 currently fails.

If we could reliably detect that we are in case 2, we could warn the
user and advise them to upgrade transformers, but I don't know if it's
possible to figure this out.

Note that we skip prompt learning methods when applying the mapping.
This is because they don't have the "base_model.model." prefix, which we
need to remove before mapping. Instead just using "base_model.". This
could be fine, we could only remove "base_model.", However, the
subsequent sub-module could also be called "model", resulting in what
looks like "base_model.model.". To avoid this confusion, we skip prefix
tuning. Since it should be applied to the language model part directly
and applies itself on the outer model (unlike LoRA et al), skipping
should be fine.

We also allow users to pass their own key_mapping to from_pretrained and
load_adapter, though the documentation advises against it. This argument
could theoretically be used as a workaround in case there is indeed an
issue with prompt learning state_dicts.

Apart from these changes, I also made a small change to account for
huggingface/transformers#38017 (comment).
BenjaminBossan added a commit that referenced this pull request Jun 23, 2025
FIX Transformers VLM architecture changes

Follow up to #2554
See discussion in huggingface/transformers#38627

To quote:

> transformers PR #37033 re-arranges the way visual language models are
built by moving the LM head from the language model to the top-level
VLM (among other things).

A consequence of this is that the keys in the PEFT state_dict now also
follow the new architecture. This means that:

1. If a PEFT checkpoint was saved with the old architecture but is
   loaded with the new architecture, loading fails.
2. If a PEFT checkpoint was saved with the new architecture but is
   loaded with the old architecture, loading fails.

1. can be addressed by making use of the newly added
_checkpoint_conversion_mapping attribute for models with the new
architecture. In transformers, this is used to map old model state_dicts
to the new state_dict format. In PEFT, with some fiddling, we can use
the same mapping to make old PEFT state_dicts compatible with the new
architecture (backwards compatibility).

However, 2. is not easily addressed. We would need a reverse mapping for
this. This could be easily derived from _checkpoint_conversion_mapping,
but since this attribute doesn't exist on old models, we cannot do that.
Therefore, new checkpoints created with PEFT on these models won't load
successfully when users use old transformers (forward compatibility).

These cases are covered by the added unit tests, which means that the
test covering case 2 are marked as xfail.

If we could reliably detect that we are in case 2, we could warn the
user and advise them to upgrade transformers, but I don't know if it's
possible to figure this out.

We also allow users to pass their own key_mapping to from_pretrained and
load_adapter, though the documentation advises against it. This argument
could theoretically be used as a workaround in case there is indeed an
issue with prompt learning state_dicts.

Apart from these changes, I also made a small change to account for
huggingface/transformers#38017 (comment).
yao-matrix pushed a commit to yao-matrix/peft that referenced this pull request Jun 25, 2025
FIX Transformers VLM architecture changes

Follow up to huggingface#2554
See discussion in huggingface/transformers#38627

To quote:

> transformers PR #37033 re-arranges the way visual language models are
built by moving the LM head from the language model to the top-level
VLM (among other things).

A consequence of this is that the keys in the PEFT state_dict now also
follow the new architecture. This means that:

1. If a PEFT checkpoint was saved with the old architecture but is
   loaded with the new architecture, loading fails.
2. If a PEFT checkpoint was saved with the new architecture but is
   loaded with the old architecture, loading fails.

1. can be addressed by making use of the newly added
_checkpoint_conversion_mapping attribute for models with the new
architecture. In transformers, this is used to map old model state_dicts
to the new state_dict format. In PEFT, with some fiddling, we can use
the same mapping to make old PEFT state_dicts compatible with the new
architecture (backwards compatibility).

However, 2. is not easily addressed. We would need a reverse mapping for
this. This could be easily derived from _checkpoint_conversion_mapping,
but since this attribute doesn't exist on old models, we cannot do that.
Therefore, new checkpoints created with PEFT on these models won't load
successfully when users use old transformers (forward compatibility).

These cases are covered by the added unit tests, which means that the
test covering case 2 are marked as xfail.

If we could reliably detect that we are in case 2, we could warn the
user and advise them to upgrade transformers, but I don't know if it's
possible to figure this out.

We also allow users to pass their own key_mapping to from_pretrained and
load_adapter, though the documentation advises against it. This argument
could theoretically be used as a workaround in case there is indeed an
issue with prompt learning state_dicts.

Apart from these changes, I also made a small change to account for
huggingface/transformers#38017 (comment).
efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025
[transformers PR #37033](huggingface/transformers#37033) re-arranges
the way visual language models are built by moving the LM head from the language model to
the top-level VLM (among other things).

This breaks the following test:

```
peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20)
model.language_model = get_peft_model(model.language_model, peft_config)
```

Reason being that all soft-prompting methods need a task type since each task type has specific
handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM).
We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either.

Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM),
therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning
a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model`
directly.
efraimdahl pushed a commit to efraimdahl/peft that referenced this pull request Jul 12, 2025
FIX Transformers VLM architecture changes

Follow up to huggingface#2554
See discussion in huggingface/transformers#38627

To quote:

> transformers PR #37033 re-arranges the way visual language models are
built by moving the LM head from the language model to the top-level
VLM (among other things).

A consequence of this is that the keys in the PEFT state_dict now also
follow the new architecture. This means that:

1. If a PEFT checkpoint was saved with the old architecture but is
   loaded with the new architecture, loading fails.
2. If a PEFT checkpoint was saved with the new architecture but is
   loaded with the old architecture, loading fails.

1. can be addressed by making use of the newly added
_checkpoint_conversion_mapping attribute for models with the new
architecture. In transformers, this is used to map old model state_dicts
to the new state_dict format. In PEFT, with some fiddling, we can use
the same mapping to make old PEFT state_dicts compatible with the new
architecture (backwards compatibility).

However, 2. is not easily addressed. We would need a reverse mapping for
this. This could be easily derived from _checkpoint_conversion_mapping,
but since this attribute doesn't exist on old models, we cannot do that.
Therefore, new checkpoints created with PEFT on these models won't load
successfully when users use old transformers (forward compatibility).

These cases are covered by the added unit tests, which means that the
test covering case 2 are marked as xfail.

If we could reliably detect that we are in case 2, we could warn the
user and advise them to upgrade transformers, but I don't know if it's
possible to figure this out.

We also allow users to pass their own key_mapping to from_pretrained and
load_adapter, though the documentation advises against it. This argument
could theoretically be used as a workaround in case there is indeed an
issue with prompt learning state_dicts.

Apart from these changes, I also made a small change to account for
huggingface/transformers#38017 (comment).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants