🚨 Refactor DETR to updated standards by yonigozlan · Pull Request #41549 · huggingface/transformers

yonigozlan · 2025-10-13T16:57:33Z

What does this PR do?

This PR aims at refactoring DETR as part of an effort to standardize vision models in the library, in the same vein as #41546.
Expect to see much more PRs like this for vision models as we approach v5!

HuggingFaceDocBuilderDev · 2025-10-13T17:07:47Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…sks, vision input embeds and query embeds

yonigozlan · 2025-10-14T13:14:32Z

+            if not isinstance(line, str):
+                line = line.decode()


line was an str when I tried to use this, not sure why! I can open a separate PR for it though

Ah yes I can try to remove this, maybe it's not an issue anymore. Thanks for the reminder 😁

no worries haha :D can be removed for sure?

yonigozlan · 2025-10-14T13:21:12Z

+        if pixel_values is None and inputs_embeds is None:
+            raise ValueError("You have to specify either pixel_values or inputs_embeds")
+
+        if inputs_embeds is None:
+            batch_size, num_channels, height, width = pixel_values.shape
+            device = pixel_values.device
+
+            if pixel_mask is None:
+                pixel_mask = torch.ones(((batch_size, height, width)), device=device)
+            vision_features = self.backbone(pixel_values, pixel_mask)
+            feature_map, mask = vision_features[-1]
+
+            # Apply 1x1 conv to map (N, C, H, W) -> (N, d_model, H, W), then flatten to (N, HW, d_model)
+            # (feature map and position embeddings are flattened and permuted to (batch_size, sequence_length, hidden_size))
+            projected_feature_map = self.input_projection(feature_map)
+            flattened_features = projected_feature_map.flatten(2).permute(0, 2, 1)
+            spatial_position_embeddings = (
+                self.position_embedding(shape=feature_map.shape, device=device, dtype=pixel_values.dtype, mask=mask)
+                .flatten(2)
+                .permute(0, 2, 1)
+            )
+            flattened_mask = mask.flatten(1)
+        else:
+            batch_size = inputs_embeds.shape[0]
+            device = inputs_embeds.device
+            flattened_features = inputs_embeds
+            # When using inputs_embeds, we need to infer spatial dimensions for position embeddings
+            # Assume square feature map
+            seq_len = inputs_embeds.shape[1]
+            feat_dim = int(seq_len**0.5)
+            # Create position embeddings for the inferred spatial size
+            spatial_position_embeddings = (
+                self.position_embedding(
+                    shape=torch.Size([batch_size, self.config.d_model, feat_dim, feat_dim]),
+                    device=device,
+                    dtype=inputs_embeds.dtype,
+                )
+                .flatten(2)
+                .permute(0, 2, 1)
+            )
+            # If a pixel_mask is provided with inputs_embeds, interpolate it to feat_dim, then flatten.
+            if pixel_mask is not None:
+                mask = nn.functional.interpolate(pixel_mask[None].float(), size=(feat_dim, feat_dim)).to(torch.bool)[0]
+                flattened_mask = mask.flatten(1)
+            else:
+                # If no mask provided, assume all positions are valid
+                flattened_mask = torch.ones((batch_size, seq_len), device=device, dtype=torch.long)


Now truly supports passing input_embeds instead of silently doing nothing with it

yonigozlan · 2025-10-14T13:21:41Z

+        if decoder_inputs_embeds is not None:
+            queries = decoder_inputs_embeds
+        else:
+            queries = torch.zeros_like(object_queries_position_embeddings)


Same, truly supports decoder_inputs_embeds as input

yonigozlan · 2025-10-14T13:22:32Z

-            attention_mask=None,
-            object_queries=object_queries,
-            query_position_embeddings=query_position_embeddings,
+            attention_mask=decoder_attention_mask,


Supports masking of queries (as advertised)

yonigozlan · 2025-10-14T13:23:56Z

@@ -967,65 +960,36 @@ def forward(
        intermediate = () if self.config.auxiliary_loss else None

        # decoder layers
-        all_hidden_states = () if output_hidden_states else None
-        all_self_attns = () if output_attentions else None
-        all_cross_attentions = () if (output_attentions and encoder_hidden_states is not None) else None

        for idx, decoder_layer in enumerate(self.layers):
            # add LayerDrop (see https://huggingface.co/papers/1909.11556 for description)
-            if output_hidden_states:
-                all_hidden_states += (hidden_states,)
            if self.training:
                dropout_probability = torch.rand([])
                if dropout_probability < self.layerdrop:
                    continue

-            layer_outputs = decoder_layer(
+            hidden_states = decoder_layer(
                hidden_states,
-                combined_attention_mask,
-                object_queries,
-                query_position_embeddings,
+                attention_mask,
+                spatial_position_embeddings,
+                object_queries_position_embeddings,
                encoder_hidden_states,  # as a positional argument for gradient checkpointing
                encoder_attention_mask=encoder_attention_mask,
-                output_attentions=output_attentions,
+                **kwargs,


Truly supports attention mask on vision features (it was always None before)

yonigozlan · 2025-10-14T13:27:25Z

Hello @molbap @ArthurZucker!
The long overdue refactor of DETR is ready for a first review. I'm waiting for your reviews to run fix-copies as this will have a lot of impacts on other models (through # Copied from for now, modular later ;) )

vasqu

Some initial thoughts, focused on the masks/interface part

vasqu · 2025-10-15T16:57:42Z

+    ):
+        if use_attention_mask:
+            self.skipTest(
+                "This test uses attention masks which are not compatible with DETR. Skipping when use_attention_mask is True."


Hmm, why tho? Are the attention masks perhaps 3D instead?

It's more that _test_eager_matches_sdpa_inference is not adapted to the vision space (+object queries here). It tries to add a "decoder_input_ids" to the inputs, plus the seqlen created for the dummy masks were wrong. Seeing as the function is already quite cluttered and difficult to read, I figured trying to add support for vision model directly there would not be ideal. We can either override the tests in this model specifically, or try to have a more general test for vision models. Another option would be to be able to parameterize the tests by providing how to find the correct seqlen and input names.
I would love some help on this!

I see, is this specific to detr or will we encounter more so for other models in the vision family? It's best to not skip too much if it comes down the line. Depending on how many are affected by this, we either should

Fix the base test, e.g. with parametrization, splitting the test a bit (more models with similar problems)

Overwrite the test and make specific changes (low amount of models with similar problems)

The problem is with the test's base design indeed. It will lead to more skipped tests down the line because the division encoder/encoder-decoder/decoder isn't that clearly made. The amount of models with similar problems isn't "low" imo.

Yes I think it will increase too with us fixing the attention masks for vision models, so we definitely need to improve the base test

yonigozlan · 2025-10-16T10:34:40Z

Thanks for the review @vasqu ! I standardized attention and masking following your advice :)

vasqu

Looking good from my side, amazing work! Just left some smaller comments but nothing crazy

vasqu · 2025-10-16T12:10:45Z


+    _can_record_outputs = {
+        "hidden_states": DetrEncoderLayer,
+        "attentions": OutputRecorder(DetrSelfAttention, layer_name="self_attn", index=1),


Do we need the explicit output recorder, iirc DetrSelfAttention should work fine in itself

same question here out of curiosity :D

No indeed I can remove it :)

nudge as reminder

vasqu · 2025-10-16T12:14:16Z

+    ):
+        if use_attention_mask:
+            self.skipTest(
+                "This test uses attention masks which are not compatible with DETR. Skipping when use_attention_mask is True."


I see, is this specific to detr or will we encounter more so for other models in the vision family? It's best to not skip too much if it comes down the line. Depending on how many are affected by this, we either should

Fix the base test, e.g. with parametrization, splitting the test a bit (more models with similar problems)

Overwrite the test and make specific changes (low amount of models with similar problems)

molbap

Looks niiiice
For the unhappy CI, let's throw the Check Copies away!

molbap · 2025-10-17T15:53:25Z

+    ):
+        if use_attention_mask:
+            self.skipTest(
+                "This test uses attention masks which are not compatible with DETR. Skipping when use_attention_mask is True."


The problem is with the test's base design indeed. It will lead to more skipped tests down the line because the division encoder/encoder-decoder/decoder isn't that clearly made. The amount of models with similar problems isn't "low" imo.

molbap · 2025-10-17T15:53:53Z

    "qwen2_5_vl",
    "videollava",
    "vipllava",
+    "detr",


I'm not sure, do we need to add this here?

Yes that's what made me go crazy haha otherwise _checkpoint_conversion_mapping doesn't work.
Note that this is temporary and will be replaced by the new way to convert weights on the fly that @ArthurZucker and @Cyrilvallez are working on.

molbap · 2025-10-17T15:55:53Z

    def __init__(self, config: DetrConfig):
        super().__init__()
-        self.embed_dim = config.d_model
+        self.hidden_size = config.d_model


won't that break BC? (at least on the attribute names)

In what way? If users access it directly? In any case I think we really need to standardize these types of variable names, it might be worth slightly breaking BC imo

yeah in case of non-config access. I agree I prefer to standardize

molbap · 2025-10-17T16:00:22Z

            if self.training:
                dropout_probability = torch.rand([])
                if dropout_probability < self.layerdrop:
                    continue


not exactly the typical dropout interface, we can maybe take the occasion to update it?

Yes 😫, I was scared of breaking BC in that case, but maybe it's not so important. It would be great to get rid of non standards dropout elsewhere as well really

I think it's ok to break it in here, it does not affect inference and clearly it would be an improvement to get rid of it haha

molbap · 2025-10-17T16:01:18Z

    def freeze_backbone(self):
-        for name, param in self.backbone.conv_encoder.model.named_parameters():
+        for _, param in self.backbone.model.named_parameters():
            param.requires_grad_(False)

    def unfreeze_backbone(self):
-        for name, param in self.backbone.conv_encoder.model.named_parameters():
+        for _, param in self.backbone.model.named_parameters():
            param.requires_grad_(True)


these methods should really be user-side responsibilities 😨 I would be pro-removal! We can always communicate on it

Yes agreed, we could start a deprecation cycle, or just remove it for v5. It's present in several other vision models

Just asked @merveenoyan who's an avid finetuner and is not using these methods anymore, I think they were good initially but they're ok to go now. Agreed it's out of scope for current PR will create another to remove all of it (cc @ariG23498 as we chatted on finetuning too)

molbap · 2025-10-17T16:04:00Z

+    def forward(self, q, k, mask: Optional[torch.Tensor] = None):
        q = self.q_linear(q)
        k = nn.functional.conv2d(k, self.k_linear.weight.unsqueeze(-1).unsqueeze(-1), self.k_linear.bias)
        queries_per_head = q.view(q.shape[0], q.shape[1], self.num_heads, self.hidden_dim // self.num_heads)


on here my nit would be, if we can update a bit the single-letter variable names, that'd be great!

Yes I think we could even try to refactor this to use the standard attention module and only take the attention weights! It could be interesting to compare the performance of eager attention vs this implementation (conv2d instead of linear for key proj, and no multiplication by value) vs other attention impl.

ahah that's a tough one to benchmark but indeed sounds good, LMK if you want to do that in this PR or move to another

vasqu

Talked with @molbap internally and I think we agree that it doesn't make sense to force this merge just to split refactoring again. Let's aim for quality in this refactor

We will probably merge the model PR as is and add this to this refactor after merge. Otherwise, we will suffer on both sides - crunch time on the model PR and less quality on the refactor (e.g. another set of TODOs)

I've added a few smaller comments meanwhile

vasqu · 2026-01-21T12:04:42Z

            batch_size, num_queries, self.n_heads, self.n_levels * self.n_points
        )
-        attention_weights = F.softmax(attention_weights, -1).view(
+        attention_weights = softmax(attention_weights, -1).view(


This is a bit weird, would like to not have a direct import

Agreed, but I have issues with torch functional and torchvision functional aliases colliding in modular. I have this PR to fix it #43263, I'll change back when it's merged

vasqu · 2026-01-21T12:05:33Z


        hidden_states = inputs_embeds

        encoder_states = () if output_hidden_states else None


Hmm, can we not add this to _can_record_outputs

I haven't managed to get something clean that works for now, the issue is this line:
encoder_states = encoder_states + (hidden_states[enc_ind],)
So the encoder_states/hidden_states cannot automatically be recorded. I'll see if some refactoring of the code can fix this

vasqu · 2026-01-21T12:07:44Z

        # https://github.com/lyuwenyu/RT-DETR/blob/94f5e16708329d2f2716426868ec89aa774af016/rtdetr_pytorch/src/zoo/rtdetr/rtdetr_decoder.py#L412
        sources = []
-        for level, source in enumerate(encoder_outputs[0]):
+        for level, source in enumerate(encoder_outputs.last_hidden_state):


We should force return_dict=True then for the encoder

That would mean we need to force return_dict=True everytime we want to access a named parameter of a submodule output? It doesn't look like that's what we do in the library. From my understanding, return_dict=False is only applied to the top-module output, and the submodule use return_dict=True by default
Here, we pop return_dict in the top module call:

transformers/src/transformers/utils/generic.py

Line 832 in eebf856

return_dict_passed = kwargs.pop("return_dict", return_dict)

zucchini-nlp · 2026-01-22T14:54:01Z

@@ -1,15 +1,53 @@
+import math


btw, lets' bring back the header with license where missing

Indeed thanks for the heads up!

github-actions · 2026-01-24T15:38:58Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=41549&sha=7821c4

yonigozlan · 2026-01-28T00:44:16Z

Hey @molbap @vasqu ! I added a small refactor to rt_detr, and I think we can merge this for now before the PR gets too big. This should make for a good basis to continue the refactoring work on vision models :)

molbap · 2026-01-28T09:39:38Z

+class RTDetrV2AIFILayer(nn.Module):
+    """
+    AIFI (Attention-based Intra-scale Feature Interaction) layer used in RT-DETR hybrid encoder.
+    """


nice, so this can be reused in other derived models?

Yes that's the idea! Also it allows for automatically capturing hidden_states

yonigozlan · 2026-02-02T18:34:28Z

    x_max = x_coords_masked.flatten(start_dim=-2).max(dim=-1).values + 1
    x_min = (
-        torch.where(mask, x_coords_masked, torch.tensor(1e8, device=mask.device, dtype=dtype))
+        torch.where(mask, x_coords_masked, torch.tensor(torch.finfo(dtype).max))


Note: This was causing overflow issues in float16
Cc @zhang-prog @molbap

yonigozlan added 2 commits October 13, 2025 16:55

refactor attention and many other

18e7172

Merge remote-tracking branch 'upstream/main' into refactor-detr

a931c26

yonigozlan added 4 commits October 13, 2025 17:07

remove return_dict interface

f7536b9

improve variable names

c016021

use _can_record_outputs and add real support for pixel and queries ma…

7f21985

…sks, vision input embeds and query embeds

split self attention and cross attention

d0494e9

yonigozlan commented Oct 14, 2025

View reviewed changes

nits

2f0752f

yonigozlan commented Oct 14, 2025

View reviewed changes

yonigozlan changed the title ~~[WIP] Refactor DETR to updated standards~~ Refactor DETR to updated standards Oct 14, 2025

yonigozlan requested review from ArthurZucker and molbap October 14, 2025 13:24

yonigozlan changed the title ~~Refactor DETR to updated standards~~ 🚨 Refactor DETR to updated standards Oct 14, 2025

vasqu reviewed Oct 15, 2025

View reviewed changes

standardize mask handling

67a007a

vasqu reviewed Oct 16, 2025

View reviewed changes

ArthurZucker removed their request for review October 16, 2025 14:11

yonigozlan mentioned this pull request Oct 17, 2025

🚨 Refactor ViT to updated standards #41693

Open

77 tasks

molbap reviewed Oct 17, 2025

View reviewed changes

yonigozlan added 4 commits October 17, 2025 17:39

update DetrMHAttentionMap

eab4e85

refactor mlp detr

ac8d387

Merge remote-tracking branch 'upstream/main' into refactor-detr

b82b172

Merge remote-tracking branch 'upstream/main' into refactor-detr

4181e50

vasqu reviewed Jan 21, 2026

View reviewed changes

yonigozlan added 5 commits January 21, 2026 16:33

Merge remote-tracking branch 'upstream/main' into refactor-detr

63dc531

Improve DetrMHAttentionMap

65d9d1b

Merge remote-tracking branch 'upstream/main' into refactor-detr

ca317d1

Merge remote-tracking branch 'upstream/main' into refactor-detr

3ce2110

Fix torch functional import aliases

99b87d4

zucchini-nlp reviewed Jan 22, 2026

View reviewed changes

yonigozlan added 4 commits January 22, 2026 18:55

Merge remote-tracking branch 'upstream/main' into refactor-detr

4f2fa85

fix after merge with main

52827f4

Fix missing copyrights

41a10c2

Refactor HybridEncoder rt_detr

7821c47

yonigozlan added 2 commits January 28, 2026 00:25

Merge remote-tracking branch 'upstream/main' into refactor-detr

90741ba

tie weights fix deformable detr

175d4cd

molbap reviewed Jan 28, 2026

View reviewed changes

yonigozlan and others added 7 commits January 28, 2026 11:35

Merge branch 'main' into refactor-detr

66ec416

Merge branch 'main' into refactor-detr

b254841

Merge branch 'main' into refactor-detr

209f4e6

Merge remote-tracking branch 'upstream/main' into refactor-detr

5b7a3f1

Refactor pp docs layout + fix fp16 overflow

4fe521b

fix modular

08a7308

fix modular

45a9b62

yonigozlan commented Feb 2, 2026

View reviewed changes

yonigozlan and others added 2 commits February 2, 2026 21:12

fix deformable detr tests

f15aef7

Merge branch 'main' into refactor-detr

037bbc9

yonigozlan enabled auto-merge (squash) February 2, 2026 21:33

yonigozlan disabled auto-merge February 2, 2026 23:05

yonigozlan merged commit aefa23a into huggingface:main Feb 2, 2026
32 of 40 checks passed

NielsRogge mentioned this pull request Mar 5, 2026

RTDetrV2ForObjectDetection produces different outputs in Transformers v5 with identical inputs #43697

Closed

4 tasks


		hidden_states = inputs_embeds

		encoder_states = () if output_hidden_states else None

Conversation

yonigozlan commented Oct 13, 2025

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Oct 13, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Oct 14, 2025

Uh oh!

vasqu left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yonigozlan commented Oct 16, 2025

Uh oh!

vasqu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

molbap left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vasqu left a comment •

edited

Loading