Remove many output_attentions and other traced outputs on 100+ models by molbap · Pull Request #43590 · huggingface/transformers

molbap · 2026-01-29T12:23:56Z

What does this PR do?

In model additions, we often see old standards not using check_model_inputs, can_return_tuple and it's often a first review comment/something that can slip through. Doing a wide scan to try to remove all occurrences systematically.

Background

Every model used to manually resolve output_attentions, output_hidden_states, and return_dict in each forward, then collect intermediate outputs in a loop, then convert to tuple at the end. That's ~30 lines of boilerplate per model, reimplemented everywhere with subtle inconsistencies.

Two decorators now handle this:

@capture_outputs goes on the base model forward (the one with the layer loop). It reads output_attentions/output_hidden_states from kwargs or config, installs hooks on modules listed in _can_record_outputs, collects intermediate outputs automatically, injects them into the ModelOutput, and handles return_dict. The model just needs to declare which module classes produce which outputs (e.g. _can_record_outputs = {"hidden_states": DecoderLayer, "attentions": Attention}).
@can_return_tuple goes on wrapper forwards (ForCausalLM, ForSequenceClassification, VLM wrappers) that only need return_dict conversion. Wrapper models should not use @capture_outputs to avoid nested hook chains.

What changes per model

output_attentions, output_hidden_states, return_dict dropped from forward signatures, replaced by **kwargs: Unpack[TransformersKwargs]
Explicit parameter resolution lines removed
Manual all_hidden_states += (hidden_states,) collection loops removed
Decoder layers return a single tensor instead of a tuple
Attention modules always return (attn_output, attn_weights) — the if not output_attentions: attn_weights = None guard is removed since hooks capture directly from the module output

HuggingFaceDocBuilderDev · 2026-01-29T12:33:14Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…neric.py

… tested --> only need backbone mixin tests

…ubclasses

vasqu · 2026-03-05T20:14:00Z

run-slow: dinov3_convnext,dinov3_vit,zamba,layoutlmv2

github-actions · 2026-03-05T20:15:36Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/dinov3_convnext", "models/dinov3_vit", "models/layoutlmv2", "models/zamba"]
quantizations: []

vasqu

Small self-review to point out some special things

vasqu · 2026-03-05T20:14:52Z

src/transformers/generation/utils.py


        # 2. Prepare encoder args and encoder kwargs from model kwargs and generation config.
-        irrelevant_prefix = ["decoder_", "cross_attn", "use_cache"]
+        irrelevant_prefix = ["decoder_", "cross_attn", "use_cache", "past_key_values", "cache_params"]


This is needed as encoder oftentimes share the same attention module from the decoder, meaning that if we pass the cache there everything gets messy

Humm, I don't get why we need this suddenly? Even encoder may want the cache with EncoderDecoder cache no?

But the encoder module itself never wants a cache, it's only forwarded once and then "saved" for each subsequent step. Only the decoder needs the cache to properly overwrite states.

If the encoder also gets the cache, then it can update the cache as well which makes generate being broken for certain methods (not sure which anymore but CI was broken for a few tests on bart then).

Ha yes, cause you made sure to propagate kwargs now as well! Makes sense then!

vasqu · 2026-03-05T20:15:21Z

src/transformers/models/aimv2/modeling_aimv2.py

        super().__init__()
        self.config = config
        self.layers = nn.ModuleList([Aimv2EncoderLayer(config) for _ in range(config.num_hidden_layers)])
-        self.gradient_checkpointing = False


This was a wrong pattern, this is likely not needed in many cases - I tried to remove them whenever I found them

I dont know if this is used anywhere when in training code, but I see that we set it to True after enabling GC. Could we split deletion of GC (if not needed) to its own PR?

transformers/src/transformers/modeling_utils.py

Lines 3137 to 3143 in 4f91111

# Apply it on the top-level module in case the top-level modules supports it

# for example, LongT5Stack inherits from `PreTrainedModel`.

if hasattr(self, "gradient_checkpointing"):

self._gradient_checkpointing_func = gradient_checkpointing_func

self.gradient_checkpointing = enable

is_gradient_checkpointing_set = True

It's not needed because the GC layer handles this internally now

transformers/src/transformers/modeling_layers.py

Lines 34 to 57 in 4f91111

class GradientCheckpointingLayer(nn.Module):

"""Base class for layers with gradient checkpointing.

This class enables gradient checkpointing functionality for a layer. By default, gradient checkpointing is disabled

(`gradient_checkpointing = False`). When `model.set_gradient_checkpointing()` is called, gradient checkpointing is

enabled by setting `gradient_checkpointing = True` and assigning a checkpointing function to `_gradient_checkpointing_func`.

Important:

When using gradient checkpointing with `use_reentrant=True`, inputs that require gradients (e.g. hidden states)

must be passed as positional arguments (`*args`) rather than keyword arguments to properly propagate gradients.

Example:

```python

>>> # Correct - hidden_states passed as positional arg

>>> out = self.layer(hidden_states, attention_mask=attention_mask)

>>> # Incorrect - hidden_states passed as keyword arg

>>> out = self.layer(hidden_states=hidden_states, attention_mask=attention_mask)

```

"""

gradient_checkpointing = False

Agree tho that it probably makes more sense within its own PR, lemme revert these here and open a big PR for this instead

vasqu · 2026-03-05T20:16:55Z

src/transformers/models/dinov2/modeling_dinov2.py

+class Dinov2Encoder(Dinov2PreTrainedModel):
+    def __init__(self, config: Dinov2Config):
+        super().__init__(config)
+        self.layer = nn.ModuleList([Dinov2Layer(config) for _ in range(config.num_hidden_layers)])
+        self.post_init()
+
+    @merge_with_config_defaults
+    @capture_outputs(tie_last_hidden_states=False)
+    def forward(self, hidden_states: torch.Tensor, **kwargs: Unpack[TransformersKwargs]) -> BaseModelOutput:
+        for layer_module in self.layer:
+            hidden_states = layer_module(hidden_states)
+
+        return BaseModelOutput(last_hidden_state=hidden_states)


This is a new change to have this be the "collector", it allows us to have one entry point and not duplicate efforts at their parent classes. We often had weird backbone structures so some got a bit of a more refactor, see conversion mapping for those that seen the most changes

vasqu · 2026-03-05T20:17:24Z

src/transformers/models/dinov2/modeling_dinov2.py

        ```"""
-        if output_hidden_states is None:
-            output_hidden_states = self.config.output_hidden_states
+        kwargs["output_hidden_states"] = True  # required to extract layers for the stages


Also as a heads up, adding comments in the backbone utils but this is properly wrapped due to the mixin

vasqu · 2026-03-05T20:18:49Z

src/transformers/backbone_utils.py

+    @functools.wraps(forward_function)
+    def wrapper(self, *args, **kwargs):
+        output_hidden_states = kwargs.get("output_hidden_states", getattr(self.config, "output_hidden_states", False))
+        output = forward_function(self, *args, **kwargs)
+        if not output_hidden_states:
+            filtered_output_data = {k: v for k, v in output.items() if k not in ("hidden_states")}
+            output = type(output)(**filtered_output_data)
+        return output


This is the new wrapper to make backbones behave like expected: They always output hidden states so we control this here.

output = type(output)(**filtered_output_data) is a bit weird but it allows us to construct our modeling outputs properly as there is no delete function and I don't think we want one

vasqu · 2026-03-05T20:19:18Z

src/transformers/backbone_utils.py

+    def __init_subclass__(cls, **kwargs):
+        super().__init_subclass__(**kwargs)
+
+        if "forward" in cls.__dict__:
+            cls.forward = can_return_tuple(filter_output_hidden_states(cls.forward))


return tuple to guarantee the proper dict type within the filter decorator

vasqu · 2026-03-05T20:19:34Z

src/transformers/conversion_mapping.py

+        "dinov3_convnext": [WeightRenaming(r"(?<!model\.)stages", r"model.stages")],
+        "dinov3_vit": [WeightRenaming(r"layer_scale", r"scale"), WeightRenaming(r"(?<!model\.)layer", r"model.layer")],
+        "zamba": [
+            WeightRenaming(r"layers.(\d+).mamba(?!_decoder)", r"layers.\1.mamba_decoder.mamba"),
+            WeightRenaming(r"layers.(\d+).input_layernorm", r"layers.\1.mamba_decoder.input_layernorm"),
+        ],


These are the "special" models

let's test all with slow CI before merging

I was one step ahead of you #43590 (comment) 😂

vasqu · 2026-03-05T20:20:28Z

utils/check_repo.py

+    "Kosmos2TextTransformer",
+    "Kosmos2VisionTransformer",
+    "Kosmos2_5TextTransformer",
+    "XCLIPVisionTransformer",


Same ish issue as CLIP and similar models which do (at least) one wrapper too much

github-actions · 2026-03-05T20:39:38Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	1287530f	workflow commit (merge commit)
PR	433f8170	branch commit (from PR)
main	e498b5bd	base commit (on `main`)

✅ No failing test specific to this PR 🎉 👏 !

zucchini-nlp

Huge PR! Can we trigger slow tests since there are some models that were refactored along the way?

Also would be great to check-out if the GC attribute is needed for some BC behavior, because it's being used in modeling files and the "source-of-truth" Llama also has the attribute

zucchini-nlp · 2026-03-06T09:55:36Z

src/transformers/models/aimv2/modeling_aimv2.py

        super().__init__()
        self.config = config
        self.layers = nn.ModuleList([Aimv2EncoderLayer(config) for _ in range(config.num_hidden_layers)])
-        self.gradient_checkpointing = False


I dont know if this is used anywhere when in training code, but I see that we set it to True after enabling GC. Could we split deletion of GC (if not needed) to its own PR?

transformers/src/transformers/modeling_utils.py

Lines 3137 to 3143 in 4f91111

# Apply it on the top-level module in case the top-level modules supports it

# for example, LongT5Stack inherits from `PreTrainedModel`.

if hasattr(self, "gradient_checkpointing"):

self._gradient_checkpointing_func = gradient_checkpointing_func

self.gradient_checkpointing = enable

is_gradient_checkpointing_set = True

zucchini-nlp · 2026-03-06T09:56:38Z

src/transformers/models/align/modeling_align.py

-        if token_type_ids is None:
-            if hasattr(self.embeddings, "token_type_ids"):
-                buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
-                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
-                token_type_ids = buffered_token_type_ids_expanded
-            else:
-                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
-
        # We can provide a self-attention mask of dimensions [batch_size, from_seq_length, to_seq_length]


why is this deleted?

Bad pattern from previous models, the embedding module already handles this

transformers/src/transformers/models/align/modeling_align.py

Lines 548 to 557 in 4f91111

# Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs

# when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves

# issue #5664

if token_type_ids is None:

if hasattr(self, "token_type_ids"):

buffered_token_type_ids = self.token_type_ids[:, :seq_length]

buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)

token_type_ids = buffered_token_type_ids_expanded

else:

token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

zucchini-nlp · 2026-03-06T09:58:08Z

src/transformers/models/altclip/modeling_altclip.py

-        if token_type_ids is None:
-            if hasattr(self.embeddings, "token_type_ids"):
-                buffered_token_type_ids = self.embeddings.token_type_ids[:, :seq_length]
-                buffered_token_type_ids_expanded = buffered_token_type_ids.expand(batch_size, seq_length)
-                token_type_ids = buffered_token_type_ids_expanded
-            else:
-                token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=device)
-


same here 👀

oh actually, let's also run slow CI with important model list before merging

So the qwens etc? Or any preference on models (outside the ones I already checked on that one run-slow)

i'd prefer models that were refactored here + a couple multimodal and backbones, those one always return hidden states and re-use it further

zucchini-nlp · 2026-03-06T09:59:17Z

src/transformers/models/bart/modeling_bart.py

-    ) -> tuple[torch.FloatTensor, tuple[torch.FloatTensor, torch.FloatTensor] | None]:
-        """
-        Args:
-            hidden_states (`torch.FloatTensor`): input to the layer of shape `(batch, seq_len, embed_dim)`
-            attention_mask (`torch.FloatTensor`): attention mask of size
-                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
-            encoder_hidden_states (`torch.FloatTensor`):
-                cross attention input to the layer of shape `(batch, seq_len, embed_dim)`
-            encoder_attention_mask (`torch.FloatTensor`): encoder attention mask of size
-                `(batch, 1, tgt_len, src_len)` where padding elements are indicated by very large negative values.
-            past_key_values (`Cache`): cached past key and value projection states
-            output_attentions (`bool`, *optional*):
-                Whether or not to return the attentions tensors of all attention layers. See `attentions` under
-                returned tensors for more detail.
-            cache_position (`torch.LongTensor` of shape `(sequence_length)`, *optional*):
-                Indices depicting the position of the input sequence tokens in the sequence. It is used to update the
-                cache in the correct position and to infer the complete sequence length.


@auto_docstring missing i believe

The parent modules already have proper auto docstrings, it doesn't really make sense to have docs on the layers (which have the same signature mostly) and are not really user facing

zucchini-nlp · 2026-03-06T10:01:18Z

src/transformers/models/bigbird_pegasus/modeling_bigbird_pegasus.py

        context_layer = torch.transpose(context_layer, 1, 2)

        # this is just for visualizing; forward pass doesn't depend on following code
-        if output_attentions:


too many lines changed, let's slow test this model

zucchini-nlp · 2026-03-06T10:04:17Z

src/transformers/models/dinov3_vit/modeling_dinov3_vit.py

        else:
            self.mlp = DINOv3ViTMLP(config)
-        self.layer_scale2 = DINOv3ViTLayerScale(config)
+        self.scale2 = DINOv3ViTLayerScale(config)


are we completely refactoring the model, I see layer names changed! Let's slow test as well

zucchini-nlp · 2026-03-06T10:05:39Z

src/transformers/models/dpt/modeling_dpt.py

            init.zeros_(module.position_embeddings)


+class DPTViTEncoder(nn.Module):


also to slow test

zucchini-nlp · 2026-03-06T10:10:29Z

src/transformers/models/timm_wrapper/modeling_timm_wrapper.py

+        elif (
+            hasattr(module, "_get_pos_embed_values")
+            and hasattr(module, "feat_shape")
+            and module.feat_shape is not None
+        ):
+            module.pos_embed = module._get_pos_embed_values(
+                feat_shape=module.feat_shape,
+                device=module.pos_embed.device if module.pos_embed is not None else None,
+                dtype=module.pos_embed.dtype if module.pos_embed is not None else torch.float32,
+            )


is it not supposed to be covered on timm-side when calling init_non_persistent_buffers? Prob we should ask for a fix from timm team

Not sure tbh, this was already there when I took over 👀

zucchini-nlp · 2026-03-06T10:11:34Z

src/transformers/conversion_mapping.py

+        "dinov3_convnext": [WeightRenaming(r"(?<!model\.)stages", r"model.stages")],
+        "dinov3_vit": [WeightRenaming(r"layer_scale", r"scale"), WeightRenaming(r"(?<!model\.)layer", r"model.layer")],
+        "zamba": [
+            WeightRenaming(r"layers.(\d+).mamba(?!_decoder)", r"layers.\1.mamba_decoder.mamba"),
+            WeightRenaming(r"layers.(\d+).input_layernorm", r"layers.\1.mamba_decoder.input_layernorm"),
+        ],


let's test all with slow CI before merging

vasqu · 2026-03-06T10:45:45Z

run-slow: aimv2,align,altclip,apertus,aria,audio_spectrogram_transformer,audioflamingo3,autoformer,aya_vision,bamba,bart,bert,bert_generation,big_bird,bigbird_pegasus,biogpt,blenderbot,blenderbot_small,blip,bridgetower,bros,camembert,chameleon,chinese_clip,clap,clip,clipseg,clvp,cohere,cohere2,cohere2_vision,colpali,convbert,convnext,convnextv2,data2vec,decision_transformer,deit,dia,dinov2,dinov2_with_registers,dinov3_convnext,dinov3_vit,dpt,electra,eomt_dinov3,ernie,ernie4_5_vl_moe,falcon_h1,fast_vlm,florence2,fuyu,gemma3n,git,glm46v,glm4v,glm4v_moe,glm_image,glm_ocr,got_ocr2,gpt_bigcode,gpt_neox,granite,groupvit,idefics,idefics2,idefics3,ijepa,informer,instructblipvideo,internvl,janus,kosmos2,kosmos2_5,layoutlm,layoutlmv2,layoutlmv3,lightglue,lighton_ocr,llava,llava_next,llava_next_video,llava_onevision,lw_detr,m2m_100,marian,markuplm,mbart,metaclip_2,mistral3,mobilebert,musicgen,musicgen_melody,nemotron,nemotron_h,opt,ovis2,owlv2,owlvit,paddleocr_vl,paligemma,pegasus,pegasus_x,perception_lm,persimmon,phi,phi4_multimodal,pixio,pixtral,plbart,pp_doclayout_v2,prompt_depth_anything,qwen2_5_omni,qwen2_5_vl,qwen2_audio,qwen2_vl,qwen3_5,qwen3_5_moe,qwen3_omni_moe,qwen3_vl,qwen3_vl_moe,roberta,roberta_prelayernorm,roc_bert,sam,siglip,siglip2,smolvlm,speech_to_text,splinter,stablelm,time_series_transformer,timesfm,timesfm2_5,timm_wrapper,video_llama_3,video_llava,videomae,vipllava,vit,vit_mae,vit_msn,vitpose_backbone,vivit,vjepa2,voxtral,voxtral_realtime,whisper,x_clip,xglm,xlm_roberta,xlm_roberta_xl,xlstm,xmod,yolos,zamba,zamba2

github-actions · 2026-03-06T10:47:08Z

Workflow Run ⚙️

This comment contains run-slow, running the specified jobs:

models: ["models/aimv2", "models/align", "models/altclip", "models/apertus", "models/aria", "models/audio_spectrogram_transformer", "models/audioflamingo3", "models/autoformer", "models/aya_vision", "models/bamba", "models/bart", "models/bert", "models/bert_generation", "models/big_bird", "models/bigbird_pegasus", "models/biogpt", "models/blenderbot", "models/blenderbot_small", "models/blip", "models/bridgetower", "models/bros", "models/camembert", "models/chameleon", "models/chinese_clip", "models/clap", "models/clip", "models/clipseg", "models/clvp", "models/cohere", "models/cohere2", "models/cohere2_vision", "models/colpali", "models/convbert", "models/convnext", "models/convnextv2", "models/data2vec", "models/decision_transformer", "models/deit", "models/dia", "models/dinov2", "models/dinov2_with_registers", "models/dinov3_convnext", "models/dinov3_vit", "models/dpt", "models/electra", "models/eomt_dinov3", "models/ernie", "models/ernie4_5_vl_moe", "models/falcon_h1", "models/fast_vlm", "models/florence2", "models/fuyu", "models/gemma3n", "models/git", "models/glm46v", "models/glm4v", "models/glm4v_moe", "models/glm_image", "models/glm_ocr", "models/got_ocr2", "models/gpt_bigcode", "models/gpt_neox", "models/granite", "models/groupvit", "models/idefics", "models/idefics2", "models/idefics3", "models/ijepa", "models/informer", "models/instructblipvideo", "models/internvl", "models/janus", "models/kosmos2", "models/kosmos2_5", "models/layoutlm", "models/layoutlmv2", "models/layoutlmv3", "models/lightglue", "models/lighton_ocr", "models/llava", "models/llava_next", "models/llava_next_video", "models/llava_onevision", "models/lw_detr", "models/m2m_100", "models/marian", "models/markuplm", "models/mbart", "models/metaclip_2", "models/mistral3", "models/mobilebert", "models/musicgen", "models/musicgen_melody", "models/nemotron", "models/nemotron_h", "models/opt", "models/ovis2", "models/owlv2", "models/owlvit", "models/paddleocr_vl", "models/paligemma", "models/pegasus", "models/pegasus_x", "models/perception_lm", "models/persimmon", "models/phi", "models/phi4_multimodal", "models/pixio", "models/pixtral", "models/plbart", "models/pp_doclayout_v2", "models/prompt_depth_anything", "models/qwen2_5_omni", "models/qwen2_5_vl", "models/qwen2_audio", "models/qwen2_vl", "models/qwen3_5", "models/qwen3_5_moe", "models/qwen3_omni_moe", "models/qwen3_vl", "models/qwen3_vl_moe", "models/roberta", "models/roberta_prelayernorm", "models/roc_bert", "models/sam", "models/siglip", "models/siglip2", "models/smolvlm", "models/speech_to_text", "models/splinter", "models/stablelm", "models/time_series_transformer", "models/timesfm", "models/timesfm2_5", "models/timm_wrapper", "models/video_llama_3", "models/video_llava", "models/videomae", "models/vipllava", "models/vit", "models/vit_mae", "models/vit_msn", "models/vitpose_backbone", "models/vivit", "models/vjepa2", "models/voxtral", "models/voxtral_realtime", "models/whisper", "models/x_clip", "models/xglm", "models/xlm_roberta", "models/xlm_roberta_xl", "models/xlstm", "models/xmod", "models/yolos", "models/zamba", "models/zamba2"]
quantizations: []

Cyrilvallez

Ok, just reviewed the most critical parts, but in general I don't think we should need any structure change/conversion! It's only supposed to make code more readable, let's not overcomplicate IMO by adding conversions that should not be needed!

Cyrilvallez · 2026-03-06T10:49:21Z

src/transformers/generation/utils.py


        # 2. Prepare encoder args and encoder kwargs from model kwargs and generation config.
-        irrelevant_prefix = ["decoder_", "cross_attn", "use_cache"]
+        irrelevant_prefix = ["decoder_", "cross_attn", "use_cache", "past_key_values", "cache_params"]


Humm, I don't get why we need this suddenly? Even encoder may want the cache with EncoderDecoder cache no?

Cyrilvallez · 2026-03-06T10:52:43Z

src/transformers/backbone_utils.py

+    def __init_subclass__(cls, **kwargs):
+        super().__init_subclass__(**kwargs)
+
+        if "forward" in cls.__dict__:


Why not put it on individual models as decorators instead? Would be less surprising probably no?

Yea true, I was struggling a bit to make things properly here and this is the final solution in the end. Lemme change it to being explicit decorators

Cyrilvallez · 2026-03-06T10:57:10Z

src/transformers/models/zamba/modeling_zamba.py

-class ZambaHybridLayer(GradientCheckpointingLayer):
-    def __init__(self, shared_transf: ZambaAttentionDecoderLayer, linear: nn.Linear, mamba: ZambaMambaDecoderLayer):
+class ZambaMixedLayer(GradientCheckpointingLayer):
+    def __init__(


Why change the name here?? Let's keep the same for BC

Cyrilvallez · 2026-03-06T11:03:37Z

src/transformers/models/zamba/modeling_zamba.py

-            else:
-                layers.append(mamba)
+            layers.append(ZambaMixedLayer(shared_transformer, linear, mamba))


Why did we change the weight structure here? Should not be necessary for capture is it?

Cyrilvallez · 2026-03-06T11:06:04Z

src/transformers/models/dinov3_convnext/modeling_dinov3_convnext.py

-        self.config = config
-        self.stages = nn.ModuleList([DINOv3ConvNextStage(config, stage_idx) for stage_idx in range(config.num_stages)])


Same here, it should not be necessary to change the structure?

Cyrilvallez · 2026-03-06T11:06:32Z

src/transformers/models/dinov3_convnext/modeling_dinov3_convnext.py

-        self.stages = nn.ModuleList([DINOv3ConvNextStage(config, s) for s in range(config.num_stages)])
+        self.model = DINOv3ConvNextEncoder(config)


Cyrilvallez · 2026-03-06T11:07:04Z

src/transformers/models/dinov3_vit/modeling_dinov3_vit.py

        self.norm1 = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
        self.attention = DINOv3ViTAttention(config)
-        self.layer_scale1 = DINOv3ViTLayerScale(config)
+        self.scale1 = DINOv3ViTLayerScale(config)


Why do we change that???

Cyrilvallez · 2026-03-06T11:07:31Z

src/transformers/models/dinov3_vit/modeling_dinov3_vit.py

-        self.layer = nn.ModuleList([DINOv3ViTLayer(config) for _ in range(config.num_hidden_layers)])
+        self.model = DINOv3ViTEncoder(config)


Same here, should not be needed I believe!

github-actions · 2026-03-06T16:45:04Z

CI Results

Workflow Run ⚙️

Commit Info

Context	Commit	Description
RUN	5a789ed0	workflow commit (merge commit)
PR	433f8170	branch commit (from PR)
main	4f91111b	base commit (on `main`)

⚠️ Model CI failed to report results

The test failure analysis could not be completed. Please check the workflow run for details.

github-actions · 2026-03-06T18:34:35Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: aimv2, align, altclip, apertus, aria, audio_spectrogram_transformer, audioflamingo3, autoformer, aya_vision, bamba, bart, beit, bert, bert_generation, big_bird, bigbird_pegasus

first batch, let's see

e253ef2

molbap added 10 commits January 30, 2026 11:36

propagate changes

4183863

fixup broken tests

49b1547

simplify more models

9382257

hack

e6c4b89

fix attentions

6c002b5

Merge branch 'main' into update_all_decorators

0ca6f96

I'm bullied by the new fix-repo

b327b19

this one too

807b17f

fix-repo

f0916ea

...fix-repo?

24bf3b5

Sebmono mentioned this pull request Feb 1, 2026

Remove all output_attentions and the like Sandgarden-Demo/transformers#47

Open

molbap added 17 commits February 3, 2026 15:13

change up

15bf2d7

fixup merges with main

a0569c8

propagate changes again

8ae95c5

more changes

389dde9

fixes

21b65ef

revert CLAP to the back burner

37e2841

more changes

6affc73

some broken clap stuff

b5f3996

New batch of models

2b53f63

something forgotten?

7867d98

biiig batch + handle explicit None values for output_attentions in ge…

6c22616

…neric.py

make fixup

becb675

Merge branch 'main' into update_all_decorators

41257e2

remove dummy record outputs

7961fba

Merge branch 'main' into update_all_decorators

2147358

batch, with some difficult ones (dpt)

531919c

ugly fix

77e4af8

vasqu and others added 14 commits March 4, 2026 22:52

fix repo

0c1bf96

Merge branch 'main' into update_all_decorators

a051182

fixup weird hidden states treatment, backbones are inherently already…

c0fe07f

… tested --> only need backbone mixin tests

style

7280031

fix last tests

19d20c2

change backbones mixin instead with a wrapper always applied to the s…

aecbf09

…ubclasses

style

d719a26

force dict

074579f

cleanup

d5a88ef

missed this one

60f7b1f

wow this was nasty

e95ff50

refactor a few models

655bc47

fixups

3e40757

readd backbones to common tests

433f817

vasqu reviewed Mar 5, 2026

View reviewed changes

zucchini-nlp reviewed Mar 6, 2026

View reviewed changes

Cyrilvallez reviewed Mar 6, 2026

View reviewed changes

vasqu added 6 commits March 6, 2026 12:09

revert gc changes

59b18b1

fix repo

ce1c616

explicit decorators

f0cc979

revert zamba

b034aa5

fixup the unrelated renaming in dino 3 vit

9b7cef9

Merge branch 'main' into update_all_decorators

12298cd

	# Apply it on the top-level module in case the top-level modules supports it
	# for example, LongT5Stack inherits from `PreTrainedModel`.
	if hasattr(self, "gradient_checkpointing"):
	self._gradient_checkpointing_func = gradient_checkpointing_func
	self.gradient_checkpointing = enable
	is_gradient_checkpointing_set = True

	class GradientCheckpointingLayer(nn.Module):
	"""Base class for layers with gradient checkpointing.

	This class enables gradient checkpointing functionality for a layer. By default, gradient checkpointing is disabled
	(`gradient_checkpointing = False`). When `model.set_gradient_checkpointing()` is called, gradient checkpointing is
	enabled by setting `gradient_checkpointing = True` and assigning a checkpointing function to `_gradient_checkpointing_func`.

	Important:

	When using gradient checkpointing with `use_reentrant=True`, inputs that require gradients (e.g. hidden states)
	must be passed as positional arguments (`*args`) rather than keyword arguments to properly propagate gradients.

	Example:

	```python
	>>> # Correct - hidden_states passed as positional arg
	>>> out = self.layer(hidden_states, attention_mask=attention_mask)

	>>> # Incorrect - hidden_states passed as keyword arg
	>>> out = self.layer(hidden_states=hidden_states, attention_mask=attention_mask)
	```
	"""

	gradient_checkpointing = False

	# Setting the token_type_ids to the registered buffer in constructor where it is all zeros, which usually occurs
	# when its auto-generated, registered buffer helps users when tracing the model without passing token_type_ids, solves
	# issue #5664
	if token_type_ids is None:
	if hasattr(self, "token_type_ids"):
	buffered_token_type_ids = self.token_type_ids[:, :seq_length]
	buffered_token_type_ids_expanded = buffered_token_type_ids.expand(input_shape[0], seq_length)
	token_type_ids = buffered_token_type_ids_expanded
	else:
	token_type_ids = torch.zeros(input_shape, dtype=torch.long, device=self.position_ids.device)

		init.zeros_(module.position_embeddings)


		class DPTViTEncoder(nn.Module):

		self.config = config
		self.stages = nn.ModuleList([DINOv3ConvNextStage(config, stage_idx) for stage_idx in range(config.num_stages)])

		self.stages = nn.ModuleList([DINOv3ConvNextStage(config, s) for s in range(config.num_stages)])
		self.model = DINOv3ConvNextEncoder(config)

		self.layer = nn.ModuleList([DINOv3ViTLayer(config) for _ in range(config.num_hidden_layers)])
		self.model = DINOv3ViTEncoder(config)

Conversation

molbap commented Jan 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Background

What changes per model

Uh oh!

HuggingFaceDocBuilderDev commented Jan 29, 2026

Uh oh!

vasqu commented Mar 5, 2026

Uh oh!

github-actions bot commented Mar 5, 2026

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Mar 5, 2026

CI Results

Commit Info

Uh oh!

zucchini-nlp left a comment

Choose a reason for hiding this comment

Uh oh!

zucchini-nlp Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

molbap commented Jan 29, 2026 •

edited

Loading

zucchini-nlp Mar 6, 2026 •

edited

Loading

zucchini-nlp Mar 6, 2026 •

edited

Loading