Add RF-DETR #36895

sbucaille · 2025-03-21T22:28:16Z

What does this PR do?

Implements RF-DETR

Fixes #36879

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline,
Pull Request section?
Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

@qubvel

sbucaille · 2025-03-22T13:57:16Z

Just a small message to present the architecture and what it looks like from 🤗 transformers point of view :

RF-DETR is based on LW-DETR and DeformableDETR. The LW-DETR is based on DETR but modified the encoder to be a ViT instead of a CNN (like ResNet) and they added the appropriate MultiScaleProjector to make the link between the encoder and the decoder. RF-DETR changed in LW-DETR the encoder from a ViT to DinoV2WithRegisters with a "window" mechanism as well as changed the classical DETR decoder by a DeformableDETR decoder.

There is basically 2 things to write :

The RFDetrMultiScaleProjector which was originally implemented in LWDetr which RFDetr is based on but not present in the library.
The RFDetrBackbone with the underlying classes built on top of DinoV2WithRegisters

One difficulty I may see in advance is the following :
Am I right saying that there should be only one XXPreTrainedModel class per modeling file ?
In our case, we will need to create a RFDetrBackbone class which requires a PreTrainedModel to be considered as an AutoBackbone, presumably inheritating from DinoV2WithRegistersPreTrainedModel. We will also need to create RFDetrModel and RFDetrForObjectDetection which both require a PreTrainedModel, which will likely inherit from DeformableDetrPreTrainedModel.
If yes then I need to "merge" both classes but _supports_flash_attn_2 is not the same for both, DinoV2 supports it but not DeformableDetr.

I noticed your PR about refactoring attention in ViTs, is there any plan for other models such as Detr, RTDetr etc to add FlashAttention ?
I guess for now I'll just set _supports_flash_attn_2 as false.

Let me know what you guys think

qubvel · 2025-03-22T14:14:35Z

Hi @sbucaille, thanks for the detailed write-up!

Am I right saying that there should be only one XXPreTrainedModel class per modeling file ? In our case, we will need to create a RFDetrBackbone class which requires a PreTrainedModel to be considered as an AutoBackbone, presumably inheritating from DinoV2WithRegistersPreTrainedModel. We will also need to create RFDetrModel and RFDetrForObjectDetection which both require a PreTrainedModel, which will likely inherit from DeformableDetrPreTrainedModel.

We can add DinoV2WithRegistersBackbone class directly into the dino_v2_with_registers model, would it work?

I noticed your #36545 about refactoring attention in ViTs, is there any plan for other models such as Detr, RTDetr etc to add FlashAttention ?

Not at the moment, from my experiments it was not required for detr-based models and did not give any speedup. However, it might be more relevant for transformer-based encoder. Let's keep it simple initially and set it to False as you suggested

sbucaille · 2025-03-22T14:21:44Z

We can't use the DinoV2WithRegistersBackbone as there is the "window" mechanism in the middle of the forward methods, here is an example (not final) :

class RFDetrBackboneLayer(Dinov2WithRegistersLayer):
    def __init__(self, config):
        super(Dinov2WithRegistersLayer).__init__(config)

        self.num_windows = config.num_windows

    def forward(
        self,
        hidden_states: torch.Tensor,
        head_mask: Optional[torch.Tensor] = None,
        output_attentions: bool = False,
        run_full_attention: bool = False,
    ) -> Union[Tuple[torch.Tensor, torch.Tensor], Tuple[torch.Tensor]]:
        assert head_mask is None, "head_mask is not supported for windowed attention"
        assert not output_attentions, "output_attentions is not supported for windowed attention"
        shortcut = hidden_states
        if run_full_attention:
            # reshape x to remove windows
            B, HW, C = hidden_states.shape
            num_windows_squared = self.num_windows**2
            hidden_states = hidden_states.view(B // num_windows_squared, num_windows_squared * HW, C)

        self_attention_outputs = self.attention(
            self.norm1(hidden_states),  # in Dinov2WithRegisters, layernorm is applied before self-attention
            head_mask,
            output_attentions=output_attentions,
        )
        attention_output = self_attention_outputs[0]

        if run_full_attention:
            # reshape x to add windows back
            B, HW, C = hidden_states.shape
            num_windows_squared = self.num_windows**2
            # hidden_states = hidden_states.view(B * num_windows_squared, HW // num_windows_squared, C)
            attention_output = attention_output.view(B * num_windows_squared, HW // num_windows_squared, C)

        attention_output = self.layer_scale1(attention_output)
        outputs = self_attention_outputs[1:]  # add self attentions if we output attention weights

        # first residual connection
        hidden_states = self.drop_path(attention_output) + shortcut

        # in Dinov2WithRegisters, layernorm is also applied after self-attention
        layer_output = self.norm2(hidden_states)
        layer_output = self.mlp(layer_output)
        layer_output = self.layer_scale2(layer_output)

        # second residual connection
        layer_output = self.drop_path(layer_output) + hidden_states

        outputs = (layer_output,) + outputs

        return outputs

That's why I think we necessarily need a custom Backbone class for that 🤔

qubvel · 2025-03-24T10:35:42Z

Hmm, am I correct that this part was added?

if run_full_attention:
            # reshape x to add windows back
            B, HW, C = hidden_states.shape
            num_windows_squared = self.num_windows**2
            # hidden_states = hidden_states.view(B * num_windows_squared, HW // num_windows_squared, C)
            attention_output = attention_output.view(B * num_windows_squared, HW // num_windows_squared, C)

It looks like it is a reshape only operation, we can return attention_output as is, and reshape all layers output later, right?

sbucaille · 2025-03-24T17:30:37Z

You are right, but it is not the only example. I'll stick to my original plan until I have something running with actual results and I'll take care of refactoring this part later, I'll ping you when it's ready.

sbucaille · 2025-03-25T09:46:07Z

Hey @qubvel, in the end I made modeling files follow the rt_detr folder structure with modeling_rf_detr_dinov2_with_registers.py being like the modeling_rt_detr_resnet.py where the backbone is defined and modeling_rf_detr.py like modeling_rt_detr.py where the encoder/decoder is defined with the use of any possible backbone.
What do you think ? I'll continue tonight

Also I had issues with the modular mechanism where utils/modular_model_converter.py always enforced the name RfDetr... instead of RFDetr... as I wanted to follow the naming convention used for RTDetr. So I ended up using the copied from mechanism.

qubvel · 2025-03-25T10:29:58Z

Hey, let's use RfDert name + modular, it's ok! RfDetr is a correct naming format while RTDetr is an exception made before modular was introduced

sbucaille · 2025-03-25T21:05:36Z

Ok sorry I confused the problems I had, I didn't have a problem with the capital letters of Rf or RF but rather a problem with the prefix RfDetr used in modular file with DeformableDetr
Using :

class RfDetrModel(DeformableDetrModel):
    pass

generates a bunch of Rf... classes like RfConvEncoder instead of RfDetrConvEncoder. I supposed RfDetr and DeformableDetr share Detr in their name which the modular script failing, I don't have the case when using RfDetrDinov2WithRegisters(Dinov2WithRegisters) naturally. The way I can avoid this is by forcing the naming of these classes by overwriting the __init__ method like this :

class RfDetrConvEncoder(DeformableDetrConvEncoder):
    pass

class RfDetrModel(DeformableDetrModel):
    def __init__(self, config: RfDetrConfig):
        super().__init__(config)

        backbone = RfDetrConvEncoder(config)
        ...

But the problem also appears for ModelOutput's, so I'm forced to rewrite the whole forward methods for many classes, which makes using modular a bit useless in my opinion...
So for modeling_rf_detr_dinov2_with_registers.py, I can keep a modular file but not for modeling_rf_detr.py, I think I'll need to use the copied from mechanism.

Should I open an issue ? Maybe @ArthurZucker have some insights on this problem ?

qubvel · 2025-03-25T22:13:31Z

cc @Cyrilvallez re modular you faced somthing similar

konstantinos-p · 2025-04-08T15:52:35Z

I'm also facing a similar issue with @sbucaille while working on DinoDetr.

Cyrilvallez · 2025-04-18T14:57:37Z

Hey! Super super sorry, I missed the ping! Indeed, model sharing the last part of their names can introduce issues. I think I have an idea that should fix it, while keeping general prefix renaming sound (which is very hard in practice)! I'll try to tackle it asap and will come back to you!

Cyrilvallez · 2025-04-28T14:48:52Z

Hey @sbucaille @konstantinos-p! It will be solved by #37829 🤗 I will merge asap! Sorry for the wait on this!

EDIT: Just merged the PR!

sbucaille · 2025-09-26T22:59:09Z

@yonigozlan Ready for a first review, this branch is based on the LWDetr until it will be merged

yonigozlan

Hey @sbucaille , very nice PR! I mentioned some small things to change, but once lw detr is merged we should be able to merge this quickly!

yonigozlan · 2025-10-29T16:10:04Z

src/transformers/models/rf_detr/modular_rf_detr_dinov2_with_registers.py

+        self.register_tokens = (
+            nn.Parameter(torch.zeros(1, config.num_register_tokens, config.hidden_size))
+            if config.num_register_tokens > 0
+            else None
+        )


It looks like num_register_tokens is 0 for all models in the convert file. So do we really need all the logic associated to it? Can we inherit from dinov2 instead?

Indeed, fixed in b89a4af

yonigozlan · 2025-10-29T16:13:59Z

src/transformers/models/rf_detr/modular_rf_detr_dinov2.py

+        window_block_indexes = set(range(self._out_indices[-1] + 1))
+        window_block_indexes.difference_update(self._out_indices)
+        window_block_indexes = list(window_block_indexes)
+        self.window_block_indexes = window_block_indexes


this is a bit verbose and hard to read. Instead, let's hardcode the window_block_indices in the config like we do for vitdet for example

Hmmm, in my opinion, I think it is still better this way as window_block_indices consists of the inverse set of out_indices, this way we don't have to check whether the provided window_block_indices are valid or not and raise an error if it is not the case.

yonigozlan · 2025-10-29T16:18:14Z

src/transformers/models/rf_detr/modular_rf_detr_dinov2_with_registers.py

+                batch_size * num_windows**2, num_h_patches_per_window * num_w_patches_per_window, -1
+            )
+            windowed_cls_token_with_pos_embed = cls_token_with_pos_embed.repeat(num_windows**2, 1, 1)
+            embeddings = torch.cat((windowed_cls_token_with_pos_embed, windowed_pixel_tokens), dim=1)


Let's put all that in a window_partition utility, like we do for vitdet

Done in 68c8918

yonigozlan · 2025-10-29T16:22:17Z

src/transformers/models/rf_detr/modular_rf_detr_dinov2_with_registers.py

+    def forward(
+        self,
+        hidden_states: torch.Tensor,
+        remove_windows: bool = False,


let's add a use_global_attention attribute to the layers when we instantiate them instead of passing an arg to the forward

Refactored in 6a345f8

yonigozlan · 2025-10-29T16:22:54Z

src/transformers/models/rf_detr/modular_rf_detr_dinov2_with_registers.py

+        if remove_windows:
+            # reshape x to remove windows
+            B, HW, C = hidden_states.shape
+            num_windows_squared = self.num_windows**2
+            hidden_states = hidden_states.view(B // num_windows_squared, num_windows_squared * HW, C)
+
+        hidden_states_norm = self.norm1(hidden_states)
+        self_attention_output = self.attention(hidden_states_norm)
+
+        if remove_windows:
+            # reshape x to add windows back
+            B, HW, C = hidden_states.shape
+            num_windows_squared = self.num_windows**2
+            # hidden_states = hidden_states.view(B * num_windows_squared, HW // num_windows_squared, C)
+            self_attention_output = self_attention_output.view(B * num_windows_squared, HW // num_windows_squared, C)


Let's use window_partition and window_unpartition utilities here as well

Done in 68c8918

yonigozlan · 2025-10-29T16:27:26Z

src/transformers/models/rf_detr/run_rfdetr.py

+import io
+
+import requests
+from PIL import Image
+
+from transformers import AutoImageProcessor, RFDetrBackbone, RFDetrConfig
+
+
+images = ["https://media.roboflow.com/notebooks/examples/dog-2.jpeg"]
+
+images = [Image.open(io.BytesIO(requests.get(url).content)) for url in images]
+
+processor = AutoImageProcessor.from_pretrained("facebook/detr-resnet-50")
+inputs = processor(images, return_tensors="pt")
+
+config = RFDetrConfig()
+backbone = RFDetrBackbone(config=config.backbone_config)
+# model = RFDetrForObjectDetection.from_config()


Ouupps, removed in 2151d0d

yonigozlan · 2025-10-29T16:39:01Z

src/transformers/models/rf_detr/modular_rf_detr.py

@@ -0,0 +1,357 @@
+from typing import Optional


Very nice use of modular 🤗

Thanks, you guys have done an insane work on this feature, quality of life when adding models compared to before is on another level

yonigozlan · 2025-10-29T16:41:34Z

src/transformers/models/rf_detr/modular_rf_detr.py

+    def num_key_value_heads(self) -> int:
+        return self.decoder_self_attention_heads
+
+    @property


Why is this needed?

Leftovers from LW Detr, I'll remove them on the other PR so when I'll rebase in the future it will be gone

yonigozlan · 2025-10-29T16:42:22Z

src/transformers/models/rf_detr/modular_rf_detr.py

+logger = logging.get_logger(__name__)
+
+
+class RfDetrConfig(PretrainedConfig):


We can still inherit form LwDetrConfig for the properties

Fixed in 0697a29

yonigozlan · 2025-10-29T16:48:33Z

src/transformers/models/auto/image_processing_auto.py

-            ("llava_next_video", ("LlavaNextImageProcessor", "LlavaNextImageProcessorFast")),
+            ("llava_next_video", ("LlavaNextVideoImageProcessor", None)),
            ("llava_onevision", ("LlavaOnevisionImageProcessor", "LlavaOnevisionImageProcessorFast")),
+            ("lw_detr", ("LwDetrImageProcessor", "LwDetrImageProcessorFast")),


we still need a mapping for RFDETR as I mentioned in the LW-DETR PR

Added in a6d5e2c

sbucaille · 2026-01-23T00:32:12Z

@molbap @vasqu nvm it's ready for a review
cc @stevhliu for the docs 🙂

I have a question regarding the WeightTransforms, should they be only in the convert script and saved on the hub without reverse mapping like I did, or should they be in the conversion_mapping file with the weights saved as the original ?

stevhliu

docs lgtm, thanks!

docs/source/en/model_doc/rf_detr.md

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

github-actions · 2026-01-25T17:20:15Z

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, lw_detr, rf_detr

github-actions · 2026-01-25T17:27:51Z

View the CircleCI Test Summary for this PR:

https://huggingface.co/spaces/transformers-community/circle-ci-viz?pr=36895&sha=1599f8

qubvel added New model Vision labels Mar 24, 2025

sbucaille force-pushed the add_rf_detr branch from 16455a1 to e9aefc8 Compare March 26, 2025 22:58

konstantinos-p mentioned this pull request Apr 10, 2025

[WIP] Add DINO DETR Model to HuggingFace Transformers #36711

Open

Cyrilvallez mentioned this pull request Apr 28, 2025

[modular] Fix the prefix-based renaming if the old and new model share a common name suffix #37829

Merged

ZaraCook mentioned this pull request Jun 18, 2025

Add RF-DETR #38887

Closed

sbucaille mentioned this pull request Sep 19, 2025

Add LWDetr model #40991

Merged

sbucaille force-pushed the add_rf_detr branch 2 times, most recently from 2b88599 to 50829ac Compare September 26, 2025 22:54

sbucaille marked this pull request as ready for review September 26, 2025 22:57

github-actions bot requested review from ArthurZucker and Rocketknight1 September 26, 2025 22:57

sbucaille mentioned this pull request Sep 26, 2025

Adding RF-DETR in 🤗 Transformers roboflow/rf-detr#372

Open

2 tasks

sbucaille force-pushed the add_rf_detr branch 2 times, most recently from 46bd06b to 105f3da Compare October 22, 2025 02:23

yonigozlan reviewed Oct 29, 2025

View reviewed changes

sbucaille force-pushed the add_rf_detr branch from 105f3da to 0697a29 Compare November 7, 2025 23:31

sbucaille added 21 commits January 23, 2026 01:19

chore: make quality

d7cac5a

fix: rebased on lw_detr branch

a30a422

chore: moved RfDetrDinov2 to a single modular_rf_detr.py file

8017a12

fix: fixed backbone name in weight for convert script

310d93a

refactor: fixed modular file after fix

681c9e3

docs: fix licences

b6d2bca

fix: moved variables to config for ScaleProjector

7491860

tests: remove deprecated tests

98abf6b

docs: remove unnecessary utf-8

8ba3382

docs: add RF-DETR to toctree

7f83d9a

style

6ddfcdd

tests: add cpu device in expectations

b5a8a78

style: reordered classes

245b13f

fix: removed unnecessary variable

07b926f

style: make style

82e112a

fix: removed backbone API related attributes from config

e0ad128

fix: added class_loss_coefficient to LWDetr config and loss

b1fc550

feat: add RfDetrForInstanceSegmentation model

403f8cf

docs: polish docs

28f35f1

fix: lwdetr unnecessary config attribute

613c2f0

chore: apply modular to rfdetr

fb0198a

sbucaille force-pushed the add_rf_detr branch from 4d1f9e3 to fb0198a Compare January 23, 2026 00:28

stevhliu approved these changes Jan 23, 2026

View reviewed changes

docs/source/en/model_doc/rf_detr.md Outdated Show resolved Hide resolved

sbucaille and others added 4 commits January 23, 2026 16:24

fix: change from deformabledetr to detr image processor

e2f06dc

Update docs/source/en/model_doc/rf_detr.md

d55ca21

Co-authored-by: Steven Liu <59462357+stevhliu@users.noreply.github.com>

docs: update convert script docs

0c8202d

feat: add new segmentation models

1599f8b

		logger = logging.get_logger(__name__)


		class RfDetrConfig(PretrainedConfig):

Add RF-DETR #36895

Are you sure you want to change the base?

Add RF-DETR #36895

Uh oh!

Conversation

sbucaille commented Mar 21, 2025

What does this PR do?

Before submitting

Who can review?

Uh oh!

sbucaille commented Mar 22, 2025

Uh oh!

qubvel commented Mar 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbucaille commented Mar 22, 2025

Uh oh!

qubvel commented Mar 24, 2025

Uh oh!

sbucaille commented Mar 24, 2025

Uh oh!

sbucaille commented Mar 25, 2025

Uh oh!

qubvel commented Mar 25, 2025

Uh oh!

sbucaille commented Mar 25, 2025

Uh oh!

qubvel commented Mar 25, 2025

Uh oh!

konstantinos-p commented Apr 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Cyrilvallez commented Apr 18, 2025

Uh oh!

Cyrilvallez commented Apr 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sbucaille commented Sep 26, 2025

Uh oh!

yonigozlan left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qubvel commented Mar 22, 2025 •

edited

Loading

konstantinos-p commented Apr 8, 2025 •

edited

Loading

Cyrilvallez commented Apr 28, 2025 •

edited

Loading

sbucaille commented Jan 23, 2026 •

edited

Loading