huggingface · ArthurZucker · Jan 23, 2026 · Dec 2, 2025 · Dec 2, 2025 · Dec 2, 2025
diff --git a/MIGRATION_GUIDE_V5.md b/MIGRATION_GUIDE_V5.md
@@ -453,6 +453,37 @@ We dropped support for two torch APIs:
 
 Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs `dynamo` and `export`.
 
+### Feature extraction helpers: `get_*_features`
+
+Many multi-modal models expose convenience methods such as `get_text_features`, `get_image_features`, `get_audio_features`, and `get_video_features` to run inference on a single modality without calling `model(**inputs)` directly.
+
+Starting with v5, these 4 helper methods now return a `BaseModelOutputWithPooling` (or a subclass) instead of only a pooled embedding tensor:
+
+- `last_hidden_state`: unpooled token/patch/frame embeddings for the requested modality.
+- `pooler_output`: pooled representation (what most models previously returned from `get_*_features`).
+- `hidden_states`: full hidden states for all layers when `output_hidden_states=True` is passed.
+- `attentions`: attention maps when `output_attentions=True` is passed.
+
+> [!IMPORTANT]
+> There is **no single universal shape** for `last_hidden_state` or `pooler_output`. It's recommended to inspect a small forward pass before making assumptions about shapes or semantics.
+
+If your code previously did something like this:
+
+```python
+text_embeddings = model.get_text_features(**inputs)
+```
+
+and you used `text_embeddings` as a tensor, you should now explicitly use `return_dict=True` take the `pooler_output` field from the returned `BaseModelOutputWithPooling`:
+
+```python
+outputs = model.get_text_features(**inputs, return_dict=True)
+text_embeddings = outputs.pooler_output
+```
+
+This will match the previous behavior in the large majority of cases. If your model-specific implementation returned a tuple of results before, those values should now be accessible as fields on the corresponding `BaseModelOutputWithPooling` subclass.
+
+Linked PR: https://github.com/huggingface/transformers/pull/42564
+
 ## Quantization changes
 
 We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted
@@ -558,7 +589,7 @@ Linked PRs:
 - `use_mps_device` -> mps will be used by default if detected
 - `fp16_backend` and `half_precision_backend` -> we will only rely on torch.amp as everything has been upstream to torch
 - `no_cuda` -> `use_cpu`
-- ` include_tokens_per_second` -> `include_num_input_tokens_seen`
+- `include_tokens_per_second` -> `include_num_input_tokens_seen`
 - `use_legacy_prediction_loop` -> we only use `evaluation_loop` function from now on
 
 ### Removing deprecated arguments in `Trainer`
@@ -574,7 +605,7 @@ Linked PRs:
 
 ###  New defaults for `Trainer`
 
-- `use_cache` in the model config will be set to `False`. You can still change the cache value through `TrainingArguments` `usel_cache` argument if needed. 
+- `use_cache` in the model config will be set to `False`. You can still change the cache value through `TrainingArguments` `use_cache` argument if needed. 
 
 ## Pipelines
 

diff --git a/docs/source/en/model_doc/aimv2.md b/docs/source/en/model_doc/aimv2.md
@@ -89,6 +89,8 @@ probs = outputs.logits_per_image.softmax(dim=-1)
 
 [[autodoc]] Aimv2Model
     - forward
+    - get_text_features
+    - get_image_features
 
 ## Aimv2VisionModel
 

diff --git a/docs/source/en/model_doc/aria.md b/docs/source/en/model_doc/aria.md
@@ -175,3 +175,4 @@ print(response)
 
 [[autodoc]] AriaForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/audioflamingo3.md b/docs/source/en/model_doc/audioflamingo3.md
@@ -401,3 +401,4 @@ are forwarded, so you can tweak padding or tensor formats just like when calling
 
 [[autodoc]] AudioFlamingo3ForConditionalGeneration
     - forward
+    - get_audio_features
diff --git a/docs/source/en/model_doc/aya_vision.md b/docs/source/en/model_doc/aya_vision.md
@@ -274,3 +274,4 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))
 
 [[autodoc]] AyaVisionForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/blip-2.md b/docs/source/en/model_doc/blip-2.md
@@ -97,6 +97,7 @@ If you're interested in submitting a resource to be included here, please feel f
 [[autodoc]] Blip2ForConditionalGeneration
     - forward
     - generate
+    - get_image_features
 
 ## Blip2ForImageTextRetrieval
 

diff --git a/docs/source/en/model_doc/chameleon.md b/docs/source/en/model_doc/chameleon.md
@@ -203,8 +203,10 @@ model = ChameleonForConditionalGeneration.from_pretrained(
 
 [[autodoc]] ChameleonModel
     - forward
+    - get_image_features
 
 ## ChameleonForConditionalGeneration
 
 [[autodoc]] ChameleonForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/cohere2_vision.md b/docs/source/en/model_doc/cohere2_vision.md
@@ -125,11 +125,13 @@ print(outputs)
 
 [[autodoc]] Cohere2VisionForConditionalGeneration
     - forward
+    - get_image_features
 
 ## Cohere2VisionModel
 
 [[autodoc]] Cohere2VisionModel
     - forward
+    - get_image_features
 
 ## Cohere2VisionImageProcessorFast
 

diff --git a/docs/source/en/model_doc/deepseek_vl.md b/docs/source/en/model_doc/deepseek_vl.md
@@ -223,6 +223,7 @@ model = DeepseekVLForConditionalGeneration.from_pretrained(
 
 [[autodoc]] DeepseekVLModel
     - forward
+    - get_image_features
 
 ## DeepseekVLForConditionalGeneration
 

diff --git a/docs/source/en/model_doc/deepseek_vl_hybrid.md b/docs/source/en/model_doc/deepseek_vl_hybrid.md
@@ -222,6 +222,7 @@ model = DeepseekVLHybridForConditionalGeneration.from_pretrained(
 
 [[autodoc]] DeepseekVLHybridModel
     - forward
+    - get_image_features
 
 ## DeepseekVLHybridForConditionalGeneration
 

diff --git a/docs/source/en/model_doc/edgetam.md b/docs/source/en/model_doc/edgetam.md
@@ -330,3 +330,4 @@ EdgeTAM can use masks from previous predictions as input to refine segmentation:
 
 [[autodoc]] EdgeTamModel
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/edgetam_video.md b/docs/source/en/model_doc/edgetam_video.md
@@ -294,3 +294,4 @@ Tracked 2 objects through 200 frames
 
 [[autodoc]] EdgeTamVideoModel
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/ernie4_5_vl_moe.md b/docs/source/en/model_doc/ernie4_5_vl_moe.md
@@ -222,8 +222,12 @@ print(output_text)
 
 [[autodoc]] Ernie4_5_VL_MoeModel
     - forward
+    - get_video_features
+    - get_image_features
 
 ## Ernie4_5_VL_MoeForConditionalGeneration
 
 [[autodoc]] Ernie4_5_VL_MoeForConditionalGeneration
     - forward
+    - get_video_features
+    - get_image_features
diff --git a/docs/source/en/model_doc/fast_vlm.md b/docs/source/en/model_doc/fast_vlm.md
@@ -171,3 +171,4 @@ Flash Attention 2 is an even faster, optimized version of the previous optimizat
 
 [[autodoc]] FastVlmForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/florence2.md b/docs/source/en/model_doc/florence2.md
@@ -177,11 +177,13 @@ print(parsed_answer)
 
 [[autodoc]] Florence2Model
     - forward
+    - get_image_features
 
 ## Florence2ForConditionalGeneration
 
 [[autodoc]] Florence2ForConditionalGeneration
     - forward
+    - get_image_features
 
 ## Florence2VisionBackbone
 

diff --git a/docs/source/en/model_doc/gemma3.md b/docs/source/en/model_doc/gemma3.md
@@ -271,6 +271,7 @@ visualizer("<img>What is shown in this image?")
 
 [[autodoc]] Gemma3ForConditionalGeneration
     - forward
+    - get_image_features
 
 ## Gemma3ForSequenceClassification
 

diff --git a/docs/source/en/model_doc/gemma3n.md b/docs/source/en/model_doc/gemma3n.md
@@ -188,6 +188,8 @@ echo -e "Plants create energy through a process known as" | transformers run --t
 
 [[autodoc]] Gemma3nModel
     - forward
+    - get_image_features
+    - get_audio_features
 
 ## Gemma3nForCausalLM
 
@@ -198,6 +200,7 @@ echo -e "Plants create energy through a process known as" | transformers run --t
 
 [[autodoc]] Gemma3nForConditionalGeneration
     - forward
+    - get_image_features
 
 [altup]: https://proceedings.neurips.cc/paper_files/paper/2023/hash/f2059277ac6ce66e7e5543001afa8bb5-Abstract-Conference.html
 [attention-mask-viz]: https://github.com/huggingface/transformers/blob/beb9b5b02246b9b7ee81ddf938f93f44cfeaad19/src/transformers/utils/attention_visualizer.py#L139

diff --git a/docs/source/en/model_doc/glm46v.md b/docs/source/en/model_doc/glm46v.md
@@ -78,8 +78,12 @@ This model was contributed by [Raushan Turganbay](https://huggingface.co/Raushan
 
 [[autodoc]] Glm46VModel
     - forward
+    - get_video_features
+    - get_image_features
 
 ## Glm46VForConditionalGeneration
 
 [[autodoc]] Glm46VForConditionalGeneration
     - forward
+    - get_video_features
+    - get_image_features
diff --git a/docs/source/en/model_doc/glm4v.md b/docs/source/en/model_doc/glm4v.md
@@ -215,19 +215,23 @@ print(output_text)
 ## Glm4vVisionModel
 
 [[autodoc]] Glm4vVisionModel
-- forward
+    - forward
 
 ## Glm4vTextModel
 
 [[autodoc]] Glm4vTextModel
-- forward
+    - forward
 
 ## Glm4vModel
 
 [[autodoc]] Glm4vModel
-- forward
+    - forward
+    - get_video_features
+    - get_image_features
 
 ## Glm4vForConditionalGeneration
 
 [[autodoc]] Glm4vForConditionalGeneration
-- forward
+    - forward
+    - get_video_features
+    - get_image_features
diff --git a/docs/source/en/model_doc/glm4v_moe.md b/docs/source/en/model_doc/glm4v_moe.md
@@ -76,8 +76,12 @@ This model was contributed by [Raushan Turganbay](https://huggingface.co/Raushan
 
 [[autodoc]] Glm4vMoeModel
     - forward
+    - get_video_features
+    - get_image_features
 
 ## Glm4vMoeForConditionalGeneration
 
 [[autodoc]] Glm4vMoeForConditionalGeneration
     - forward
+    - get_video_features
+    - get_image_features
diff --git a/docs/source/en/model_doc/glm_image.md b/docs/source/en/model_doc/glm_image.md
@@ -16,7 +16,7 @@ limitations under the License.
 ⚠️ Note that this file is in Markdown but contain specific syntax for our doc-builder (similar to MDX) that may not be rendered properly in your Markdown viewer.
 
 -->
-*This model was released on 2026-01-10 and added to Hugging Face Transformers on 2026-01-10.*
+*This model was released on 2026-01-10 and added to Hugging Face Transformers on 2026-01-13.*
 
 # GlmImage
 
@@ -199,8 +199,10 @@ print(f"Output tokens: {output_tokens}")
 
 [[autodoc]] GlmImageModel
     - forward
+    - get_image_features
 
 ## GlmImageForConditionalGeneration
 
 [[autodoc]] GlmImageForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/got_ocr2.md b/docs/source/en/model_doc/got_ocr2.md
@@ -291,3 +291,4 @@ alt="drawing" width="600"/>
 
 [[autodoc]] GotOcr2ForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/granite_speech.md b/docs/source/en/model_doc/granite_speech.md
@@ -170,3 +170,4 @@ for i, transcription in enumerate(transcriptions):
 
 [[autodoc]] GraniteSpeechForConditionalGeneration
     - forward
+    - get_audio_features
diff --git a/docs/source/en/model_doc/idefics2.md b/docs/source/en/model_doc/idefics2.md
@@ -208,11 +208,13 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 
 [[autodoc]] Idefics2Model
     - forward
+    - get_image_features
 
 ## Idefics2ForConditionalGeneration
 
 [[autodoc]] Idefics2ForConditionalGeneration
     - forward
+    - get_image_features
 
 ## Idefics2ImageProcessor
 

diff --git a/docs/source/en/model_doc/idefics3.md b/docs/source/en/model_doc/idefics3.md
@@ -70,11 +70,13 @@ This model was contributed by [amyeroberts](https://huggingface.co/amyeroberts)
 
 [[autodoc]] Idefics3Model
     - forward
+    - get_image_features
 
 ## Idefics3ForConditionalGeneration
 
 [[autodoc]] Idefics3ForConditionalGeneration
     - forward
+    - get_image_features
 
 ## Idefics3ImageProcessor
 

diff --git a/docs/source/en/model_doc/instructblip.md b/docs/source/en/model_doc/instructblip.md
@@ -78,3 +78,4 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
 [[autodoc]] InstructBlipForConditionalGeneration
     - forward
     - generate
+    - get_image_features
diff --git a/docs/source/en/model_doc/instructblipvideo.md b/docs/source/en/model_doc/instructblipvideo.md
@@ -83,3 +83,4 @@ The attributes can be obtained from model config, as `model.config.num_query_tok
 [[autodoc]] InstructBlipVideoForConditionalGeneration
     - forward
     - generate
+    - get_video_features
diff --git a/docs/source/en/model_doc/internvl.md b/docs/source/en/model_doc/internvl.md
@@ -339,11 +339,13 @@ This example showcases how to handle a batch of chat conversations with interlea
 
 [[autodoc]] InternVLModel
     - forward
+    - get_image_features
 
 ## InternVLForConditionalGeneration
 
 [[autodoc]] InternVLForConditionalGeneration
     - forward
+    - get_image_features
 
 ## InternVLProcessor
 

diff --git a/docs/source/en/model_doc/janus.md b/docs/source/en/model_doc/janus.md
@@ -229,6 +229,7 @@ for i, image in enumerate(images['pixel_values']):
 
 [[autodoc]] JanusModel
     - forward
+    - get_image_features
 
 ## JanusForConditionalGeneration
 

diff --git a/docs/source/en/model_doc/kosmos-2.md b/docs/source/en/model_doc/kosmos-2.md
@@ -96,6 +96,7 @@ This model was contributed by [Yih-Dar SHIEH](https://huggingface.co/ydshieh). T
 
 [[autodoc]] Kosmos2Model
     - forward
+    - get_image_features
 
 ## Kosmos2ForConditionalGeneration
 

diff --git a/docs/source/en/model_doc/lfm2_vl.md b/docs/source/en/model_doc/lfm2_vl.md
@@ -92,8 +92,10 @@ processor.batch_decode(outputs, skip_special_tokens=True)[0]
 
 [[autodoc]] Lfm2VlModel
     - forward
+    - get_image_features
 
 ## Lfm2VlForConditionalGeneration
 
 [[autodoc]] Lfm2VlForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/lighton_ocr.md b/docs/source/en/model_doc/lighton_ocr.md
@@ -73,8 +73,10 @@ print(output_text)
 
 [[autodoc]] LightOnOcrModel
     - forward
+    - get_image_features
 
 ## LightOnOcrForConditionalGeneration
 
 [[autodoc]] LightOnOcrForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/llama4.md b/docs/source/en/model_doc/llama4.md
@@ -426,6 +426,7 @@ model = Llama4ForConditionalGeneration.from_pretrained(
 
 [[autodoc]] Llama4ForConditionalGeneration
     - forward
+    - get_image_features
 
 ## Llama4ForCausalLM
 

diff --git a/docs/source/en/model_doc/llava.md b/docs/source/en/model_doc/llava.md
@@ -260,3 +260,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h
 
 [[autodoc]] LlavaForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/llava_next.md b/docs/source/en/model_doc/llava_next.md
@@ -216,3 +216,4 @@ print(processor.decode(output[0], skip_special_tokens=True))
 
 [[autodoc]] LlavaNextForConditionalGeneration
     - forward
+    - get_image_features
diff --git a/docs/source/en/model_doc/llava_next_video.md b/docs/source/en/model_doc/llava_next_video.md
@@ -259,3 +259,5 @@ model = LlavaNextVideoForConditionalGeneration.from_pretrained(
 
 [[autodoc]] LlavaNextVideoForConditionalGeneration
     - forward
+    - get_image_features
+    - get_video_features
diff --git a/docs/source/en/model_doc/llava_onevision.md b/docs/source/en/model_doc/llava_onevision.md
@@ -322,3 +322,5 @@ model = LlavaOnevisionForConditionalGeneration.from_pretrained(
 
 [[autodoc]] LlavaOnevisionForConditionalGeneration
     - forward
+    - get_image_features
+    - get_video_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -175,3 +175,4 @@ print(response)

		[[autodoc]] AriaForConditionalGeneration
		- forward
		- get_image_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -401,3 +401,4 @@ are forwarded, so you can tweak padding or tensor formats just like when calling

		[[autodoc]] AudioFlamingo3ForConditionalGeneration
		- forward
		- get_audio_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -274,3 +274,4 @@ print(processor.tokenizer.decode(generated[0], skip_special_tokens=True))

		[[autodoc]] AyaVisionForConditionalGeneration
		- forward
		- get_image_features
-Original file line number
+Diff line change
@@ Expand Up / @@ -223,6 +223,7 @@ model = DeepseekVLForConditionalGeneration.from_pretrained( @@
     [[autodoc]] DeepseekVLModel
         - forward
+        - get_image_features
     ## DeepseekVLForConditionalGeneration
@@ Expand Down @@
Original file line number	Diff line number	Diff line change
Expand Up		@@ -330,3 +330,4 @@ EdgeTAM can use masks from previous predictions as input to refine segmentation:

		[[autodoc]] EdgeTamModel
		- forward
		- get_image_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -294,3 +294,4 @@ Tracked 2 objects through 200 frames

		[[autodoc]] EdgeTamVideoModel
		- forward
		- get_image_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -171,3 +171,4 @@ Flash Attention 2 is an even faster, optimized version of the previous optimizat

		[[autodoc]] FastVlmForConditionalGeneration
		- forward
		- get_image_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -291,3 +291,4 @@ alt="drawing" width="600"/>

		[[autodoc]] GotOcr2ForConditionalGeneration
		- forward
		- get_image_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -170,3 +170,4 @@ for i, transcription in enumerate(transcriptions):

		[[autodoc]] GraniteSpeechForConditionalGeneration
		- forward
		- get_audio_features
Original file line number	Diff line number	Diff line change
Expand Up		@@ -260,3 +260,4 @@ A list of official Hugging Face and community (indicated by 🌎) resources to h

		[[autodoc]] LlavaForConditionalGeneration
		- forward
		- get_image_features