-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Address changes in transformers VLM architecture #2554
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Address changes in transformers VLM architecture #2554
Conversation
[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
BenjaminBossan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing this, the solution LGTM. Failing MacOS tests are unrelated.
However, old code that uses model.language_model as a target will fail with AttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'. I fear that we have no way of mitigating this since we don't have access to the 'top-level' model, the only thing we could do is to raise a more helpful error message but for reliably detecting the case that we're targeting a VLM's language model we'd need the VLM's model config (e.g., to check for the text_config key) or have access to the VLM itself. We cannot do much here, sadly.
I agree with your conclusion here. My suggestion is to take a note that in the next release text, we should mention this possible backwards incompatibility and provide an example how to migrate the code.
Follow up to huggingface#2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 currently fails. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. Note that we skip prompt learning methods when applying the mapping. This is because they don't have the "base_model.model." prefix, which we need to remove before mapping. Instead just using "base_model.". This could be fine, we could only remove "base_model.", However, the subsequent sub-module could also be called "model", resulting in what looks like "base_model.model.". To avoid this confusion, we skip prefix tuning. Since it should be applied to the language model part directly and applies itself on the outer model (unlike LoRA et al), skipping should be fine. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).
FIX Transformers VLM architecture changes Follow up to #2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 are marked as xfail. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).
FIX Transformers VLM architecture changes Follow up to huggingface#2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 are marked as xfail. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).
[transformers PR #37033](huggingface/transformers#37033) re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). This breaks the following test: ``` peft_config = PrefixTuningConfig(task_type="CAUSAL_LM", num_virtual_tokens=20) model.language_model = get_peft_model(model.language_model, peft_config) ``` Reason being that all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels accordingo to the number of virtual tokens for causal LM). We also can't simply use `task_type='FEATURE_EXTRACTION'` as this would not deal with `labels` either. Luckily the VLM is almost behaving like a LM (e.g., `get_input_embeddings` refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from the `base_model.text_config` instead of `base_model` directly.
FIX Transformers VLM architecture changes Follow up to huggingface#2554 See discussion in huggingface/transformers#38627 To quote: > transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things). A consequence of this is that the keys in the PEFT state_dict now also follow the new architecture. This means that: 1. If a PEFT checkpoint was saved with the old architecture but is loaded with the new architecture, loading fails. 2. If a PEFT checkpoint was saved with the new architecture but is loaded with the old architecture, loading fails. 1. can be addressed by making use of the newly added _checkpoint_conversion_mapping attribute for models with the new architecture. In transformers, this is used to map old model state_dicts to the new state_dict format. In PEFT, with some fiddling, we can use the same mapping to make old PEFT state_dicts compatible with the new architecture (backwards compatibility). However, 2. is not easily addressed. We would need a reverse mapping for this. This could be easily derived from _checkpoint_conversion_mapping, but since this attribute doesn't exist on old models, we cannot do that. Therefore, new checkpoints created with PEFT on these models won't load successfully when users use old transformers (forward compatibility). These cases are covered by the added unit tests, which means that the test covering case 2 are marked as xfail. If we could reliably detect that we are in case 2, we could warn the user and advise them to upgrade transformers, but I don't know if it's possible to figure this out. We also allow users to pass their own key_mapping to from_pretrained and load_adapter, though the documentation advises against it. This argument could theoretically be used as a workaround in case there is indeed an issue with prompt learning state_dicts. Apart from these changes, I also made a small change to account for huggingface/transformers#38017 (comment).
transformers PR #37033 re-arranges the way visual language models are built by moving the LM head from the language model to the top-level VLM (among other things).
This breaks the following code (from
test_vision_models):with
Reason being that
model.language_modelis not a*CausalLManymore but a base model (e.g.,LlamaModel).PEFT assumes that the model is specific and all soft-prompting methods need a task type since each task type has specific handling of the soft prompt (e.g., padding the labels according to the number of virtual tokens for causal LM). We also can't simply use
task_type='FEATURE_EXTRACTION'as a workaround because this would not deal withlabelseither.Luckily the newly structured VLM is almost behaving like a LM (e.g.,
get_input_embeddingsjust refers to the underlying LM), therefore we can target the VLM itself and need to have the soft prompt methods detect if we're fine-tuning a VLM so that we take the respective config variables from thebase_model.text_configinstead ofbase_modeldirectly.Backward compatibility
Old models trained with PEFT should still work since VLM models like llava use the
lm_headandlanguage_modelfrom the state dict and re-arrange them accordingly. However, old code that usesmodel.language_modelas a target will fail withAttributeError: 'LlamaModel' object has no attribute 'prepare_inputs_for_generation'. I fear that we have no way of mitigating this since we don't have access to the 'top-level' model, the only thing we could do is to raise a more helpful error message but for reliably detecting the case that we're targeting a VLM's language model we'd need the VLM's model config (e.g., to check for thetext_configkey) or have access to the VLM itself. We cannot do much here, sadly.