Add dinov3 autobackbone#41276

Merged

yonigozlan merged 16 commits intohuggingface:mainfrom

vijayabhaskar-ev:add-dinov3-autobackbone

Nov 11, 2025

Contributor

vijayabhaskar-ev commented Oct 2, 2025

This PR implements DINOv3 support for AutoBackbone.

What's Implemented:

DINOv3ViTConfig: Complete configuration class with all DINOv3-specific parameters
DINOv3ViTModel: Full model implementation with RoPE embeddings, register tokens, and SwiGLU MLP
DINOv3ViTBackbone: AutoBackbone integration with multi-scale feature extraction
AutoBackbone Integration: Added to MODEL_FOR_BACKBONE_MAPPING_NAMES in modeling_auto.py
Facebook API Compatibility: get_intermediate_layers method matching Facebook Research implementation
Multi-scale Features: Support for object detection and segmentation downstream tasks

Fixes #40323

vijayabhaskar-ev added 6 commits

October 2, 2025 11:20


          feat: Add DINOv3 support to AutoBackbone [DRAFT]

8be3288

- Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone
- Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES
- Support get_intermediate_layers for Facebook compatibility
- Enable multi-scale feature extraction for detection/segmentation

Note: Tests and documentation coming in follow-up commits
Addresses huggingface#40323


          Updated import structure of get_aligned_output_features_output_indices

689c55e


          Added test for DINOv3ViTBackbone

7c16359


          Add DINOv3ViTBackbone to model documentation

e90c61b


          Refactored the code to adhere to the Transformers principles

1951d6a


          Generated modeling_dinov3_vit.py

1a4b95d

Contributor Author

vijayabhaskar-ev commented Oct 2, 2025

@qubvel @merveenoyan I closed the previous PR and created a new one, since the old PR ended up with too many reviewers after I rebased from main and resolved merge conflicts.


          DINOv3ViT backbone: keep hidden_states with return_dict=False, add @c…

231cdf1

…heck_model_inputs and polish docs

- Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording.
- Preserve hidden_states when return_dict=False by appending them to the tuple output when requested.
- Clean up config docstring formatting (consistent indentation and use list[...] types).

Member

Rocketknight1 commented Oct 6, 2025

cc @molbap @yonigozlan

This was referenced Oct 9, 2025

Add DINOv3ViTForImageClassification support #41224

Open

Add DINOv3Backbone for ConvNext variant #40651

Merged

molbap self-requested a review

October 10, 2025 06:37

yonigozlan reviewed

View reviewed changes

Member

yonigozlan left a comment

Hello @vijayabhaskar-ev ! Thanks for working on this, I left a few comments on things to modify :)

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

Comment on lines +399 to +400

		head_mask: Optional[torch.Tensor] = None,
		output_hidden_states: Optional[bool] = None,

Member

yonigozlan Oct 12, 2025

No need for these

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

Comment on lines +409 to +410

		if output_hidden_states is None:
		output_hidden_states = self.config.output_hidden_states

Member

yonigozlan Oct 12, 2025

Not needed

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

+                      if output_hidden_states is None:
+                          output_hidden_states = self.config.output_hidden_states
+                      if pixel_values is None:

Member

yonigozlan Oct 12, 2025

pixel_values is an arg so not necessary, let's not clutter the modeling file

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

Comment on lines +420 to +430

+                      if output_hidden_states:
+                          collected_hidden_states = [hidden_states]
                       for i, layer_module in enumerate(self.layer):
                           hidden_states = layer_module(
                               hidden_states,
                               position_embeddings=position_embeddings,
                           )
+                          if output_hidden_states:
+                              collected_hidden_states.append(hidden_states)

Member

yonigozlan Oct 12, 2025

Not needed, hidden_states are captured by the check_model_inputs decorators

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

                       return BaseModelOutputWithPooling(
                           last_hidden_state=sequence_output,
                           pooler_output=pooled_output,
+                          hidden_states=tuple(collected_hidden_states) if output_hidden_states else None,

Member

yonigozlan Oct 12, 2025

same as above

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

Comment on lines +457 to +460

+                  def _tokens_to_bchw(self, tokens: torch.Tensor, H: int, W: int) -> torch.Tensor:
+                      # tokens: [B, N, C] -> [B, C, H, W], where N == H*W
+                      B, N, C = tokens.shape
+                      return tokens.reshape(B, H, W, C).permute(0, 3, 1, 2).contiguous()

Member

yonigozlan Oct 12, 2025

Let's just do it directly in forward, no need for a separate function

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

+                  ) -> BackboneOutput:
+                      return_dict = kwargs.get("return_dict", getattr(self.config, "use_return_dict", True))
+                      outputs = self.dinov3(pixel_values, output_hidden_states=True)

Member

yonigozlan Oct 12, 2025

To be unrolled then

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

Comment on lines +474 to +477

+                      B, C_in, H_img, W_img = pixel_values.shape
+                      patch = self.config.patch_size
+                      H = H_img // patch
+                      W = W_img // patch

Member

yonigozlan Oct 12, 2025

Let's use more explicit variable names, and no one letter variables

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

Comment on lines +495 to +499

+                      if not return_dict:
+                          output = (tuple(feature_maps),)
+                          if output_hidden_states:
+                              output = output + (hidden_states,)
+                          return output

Member

yonigozlan Oct 12, 2025

Let's use the can_return_tuple decorator on the forward function instead

src/transformers/models/dinov3_vit/modular_dinov3_vit.py Outdated

+                      return BackboneOutput(
+                          feature_maps=tuple(feature_maps),
+                          hidden_states=hidden_states if output_hidden_states else None,

Member

yonigozlan Oct 12, 2025

Not needed as captured by check_model_inputs

vijayabhaskar-ev added 4 commits

October 22, 2025 01:29


          Restructure DINOv3 backbone and update its tests


          Merge branch 'main' into add-dinov3-autobackbone

413dcc5


          Resolved merge conflicts

b8dbe24


          Resolved failing testcase

cb12245

Contributor Author

vijayabhaskar-ev commented Oct 21, 2025 •

edited

Loading

@yonigozlan I have addressed all the comments, updated the modeling file accordingly and pushed the latest changes.
Please let me know if there’s anything else you would like adjusted.

vijayabhaskar-ev requested a review from yonigozlan

November 2, 2025 04:54

yonigozlan reviewed

View reviewed changes

Member

yonigozlan left a comment

Hey @vijayabhaskar-ev ! Thanks for iterating, almost there! There is some confusion on the layer norms, but once that's fixed, we'll be able to merge :)

src/transformers/models/dinov3_vit/configuration_dinov3_vit.py Outdated

               from typing import Optional
-              from ...configuration_utils import PreTrainedConfig
+              from ...configuration_utils import PretrainedConfig

Member

yonigozlan Nov 3, 2025

Suggested change

      
            from ...configuration_utils import PretrainedConfig
          
            from ...configuration_utils import PreTrainedConfig

src/transformers/models/dinov3_vit/configuration_dinov3_vit.py Outdated



		class DINOv3ViTConfig(PreTrainedConfig):
		class DINOv3ViTConfig(BackboneConfigMixin, PretrainedConfig):

Member

yonigozlan Nov 3, 2025

Suggested change

      
            class DINOv3ViTConfig(BackboneConfigMixin, PretrainedConfig):
          
            class DINOv3ViTConfig(BackboneConfigMixin, PreTrainedConfig):

src/transformers/models/dinov3_vit/modeling_dinov3_vit.py Outdated

+                      self.gradient_checkpointing = False
+                      self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)]
+                      self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)

Member

yonigozlan Nov 3, 2025

We already have self.norm here, we should use it instead of self.layernorm (the weights will be loaded in self.norm and not self.layernorm)

src/transformers/models/dinov3_vit/modeling_dinov3_vit.py Outdated

+                          hidden_states = layer_module(hidden_states, position_embeddings=position_embeddings)
+                          stage_hidden_states.append(hidden_states)
+                      sequence_output = self.norm(hidden_states)

Member

yonigozlan Nov 3, 2025

this should go in the loop below, amd use self.norm instead of self.layernorm

Contributor Author

vijayabhaskar-ev Nov 9, 2025

Updated to use self norm and added inside the loop. Please let me know if any further chnages required.

src/transformers/models/dinov3_vit/modeling_dinov3_vit.py Outdated

+                      for stage_name, hidden_state in zip(self.stage_names, stage_hidden_states):
+                          if stage_name in self.out_features:
+                              if self.config.apply_layernorm:
+                                  hidden_state = self.layernorm(hidden_state)

Member

yonigozlan Nov 3, 2025

Suggested change

      
                                hidden_state = self.layernorm(hidden_state)
          
                                hidden_state = self.norm(hidden_state)

vijayabhaskar-ev added 2 commits

November 8, 2025 12:11


          Merge branch 'huggingface:main' into add-dinov3-autobackbone

fe0fd69


          Fix DINOv3 backbone to use self.norm for feature maps

f859356

vijayabhaskar-ev requested a review from yonigozlan

November 9, 2025 02:21

vijayabhaskar-ev and others added 3 commits

November 11, 2025 13:09


          Merge branch 'main' into add-dinov3-autobackbone

1d8f808


          Merge remote-tracking branch 'upstream/main' into add-dinov3-autoback…

123601a

…bone


          Merge remote-tracking branch 'upstream/main' into add-dinov3-autoback…

67de1dd

…bone

HuggingFaceDocBuilderDev commented Nov 11, 2025

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

yonigozlan approved these changes

View reviewed changes

Member

yonigozlan left a comment

Very nice @vijayabhaskar-ev , thanks for iterating, LGTM!

yonigozlan enabled auto-merge (squash)

November 11, 2025 16:11

Contributor

github-actions bot commented Nov 11, 2025

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, dinov3_vit

yonigozlan merged commit 496c283 into huggingface:main

23 checks passed

SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request


          Add dinov3 autobackbone (huggingface#41276)

65259a8

* feat: Add DINOv3 support to AutoBackbone [DRAFT]

- Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone
- Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES
- Support get_intermediate_layers for Facebook compatibility
- Enable multi-scale feature extraction for detection/segmentation

Note: Tests and documentation coming in follow-up commits
Addresses huggingface#40323

* Updated import structure of get_aligned_output_features_output_indices

* Added test for DINOv3ViTBackbone

* Add DINOv3ViTBackbone to model documentation

* Refactored the code to adhere to the Transformers principles

* Generated modeling_dinov3_vit.py

* DINOv3ViT backbone: keep hidden_states with return_dict=False, add @check_model_inputs and polish docs

- Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording.
- Preserve hidden_states when return_dict=False by appending them to the tuple output when requested.
- Clean up config docstring formatting (consistent indentation and use list[...] types).

* Restructure DINOv3 backbone and update its tests

* Resolved merge conflicts

* Resolved failing testcase

* Fix DINOv3 backbone to use self.norm for feature maps

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet