Skip to content

Add dinov3 autobackbone#41276

Merged
yonigozlan merged 16 commits intohuggingface:mainfrom
vijayabhaskar-ev:add-dinov3-autobackbone
Nov 11, 2025
Merged

Add dinov3 autobackbone#41276
yonigozlan merged 16 commits intohuggingface:mainfrom
vijayabhaskar-ev:add-dinov3-autobackbone

Conversation

@vijayabhaskar-ev
Copy link
Contributor

This PR implements DINOv3 support for AutoBackbone.

What's Implemented:

DINOv3ViTConfig: Complete configuration class with all DINOv3-specific parameters
DINOv3ViTModel: Full model implementation with RoPE embeddings, register tokens, and SwiGLU MLP
DINOv3ViTBackbone: AutoBackbone integration with multi-scale feature extraction
AutoBackbone Integration: Added to MODEL_FOR_BACKBONE_MAPPING_NAMES in modeling_auto.py
Facebook API Compatibility: get_intermediate_layers method matching Facebook Research implementation
Multi-scale Features: Support for object detection and segmentation downstream tasks

Fixes #40323

- Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone
- Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES
- Support get_intermediate_layers for Facebook compatibility
- Enable multi-scale feature extraction for detection/segmentation

Note: Tests and documentation coming in follow-up commits
Addresses huggingface#40323
@vijayabhaskar-ev
Copy link
Contributor Author

@qubvel @merveenoyan I closed the previous PR and created a new one, since the old PR ended up with too many reviewers after I rebased from main and resolved merge conflicts.

…heck_model_inputs and polish docs

- Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording.
- Preserve hidden_states when return_dict=False by appending them to the tuple output when requested.
- Clean up config docstring formatting (consistent indentation and use list[...] types).
@Rocketknight1
Copy link
Member

cc @molbap @yonigozlan

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @vijayabhaskar-ev ! Thanks for working on this, I left a few comments on things to modify :)

Comment on lines +399 to +400
head_mask: Optional[torch.Tensor] = None,
output_hidden_states: Optional[bool] = None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need for these

Comment on lines +409 to +410
if output_hidden_states is None:
output_hidden_states = self.config.output_hidden_states
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed

if output_hidden_states is None:
output_hidden_states = self.config.output_hidden_states

if pixel_values is None:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pixel_values is an arg so not necessary, let's not clutter the modeling file

Comment on lines +420 to +430

if output_hidden_states:
collected_hidden_states = [hidden_states]

for i, layer_module in enumerate(self.layer):
hidden_states = layer_module(
hidden_states,
position_embeddings=position_embeddings,
)
if output_hidden_states:
collected_hidden_states.append(hidden_states)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed, hidden_states are captured by the check_model_inputs decorators

return BaseModelOutputWithPooling(
last_hidden_state=sequence_output,
pooler_output=pooled_output,
hidden_states=tuple(collected_hidden_states) if output_hidden_states else None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same as above

Comment on lines +457 to +460
def _tokens_to_bchw(self, tokens: torch.Tensor, H: int, W: int) -> torch.Tensor:
# tokens: [B, N, C] -> [B, C, H, W], where N == H*W
B, N, C = tokens.shape
return tokens.reshape(B, H, W, C).permute(0, 3, 1, 2).contiguous()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's just do it directly in forward, no need for a separate function

) -> BackboneOutput:
return_dict = kwargs.get("return_dict", getattr(self.config, "use_return_dict", True))

outputs = self.dinov3(pixel_values, output_hidden_states=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To be unrolled then

Comment on lines +474 to +477
B, C_in, H_img, W_img = pixel_values.shape
patch = self.config.patch_size
H = H_img // patch
W = W_img // patch
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use more explicit variable names, and no one letter variables

Comment on lines +495 to +499
if not return_dict:
output = (tuple(feature_maps),)
if output_hidden_states:
output = output + (hidden_states,)
return output
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's use the can_return_tuple decorator on the forward function instead


return BackboneOutput(
feature_maps=tuple(feature_maps),
hidden_states=hidden_states if output_hidden_states else None,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not needed as captured by check_model_inputs

@vijayabhaskar-ev
Copy link
Contributor Author

vijayabhaskar-ev commented Oct 21, 2025

@yonigozlan I have addressed all the comments, updated the modeling file accordingly and pushed the latest changes.
Please let me know if there’s anything else you would like adjusted.

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @vijayabhaskar-ev ! Thanks for iterating, almost there! There is some confusion on the layer norms, but once that's fixed, we'll be able to merge :)

from typing import Optional

from ...configuration_utils import PreTrainedConfig
from ...configuration_utils import PretrainedConfig
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
from ...configuration_utils import PretrainedConfig
from ...configuration_utils import PreTrainedConfig



class DINOv3ViTConfig(PreTrainedConfig):
class DINOv3ViTConfig(BackboneConfigMixin, PretrainedConfig):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
class DINOv3ViTConfig(BackboneConfigMixin, PretrainedConfig):
class DINOv3ViTConfig(BackboneConfigMixin, PreTrainedConfig):

self.gradient_checkpointing = False

self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)]
self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have self.norm here, we should use it instead of self.layernorm (the weights will be loaded in self.norm and not self.layernorm)

hidden_states = layer_module(hidden_states, position_embeddings=position_embeddings)
stage_hidden_states.append(hidden_states)

sequence_output = self.norm(hidden_states)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should go in the loop below, amd use self.norm instead of self.layernorm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated to use self norm and added inside the loop. Please let me know if any further chnages required.

for stage_name, hidden_state in zip(self.stage_names, stage_hidden_states):
if stage_name in self.out_features:
if self.config.apply_layernorm:
hidden_state = self.layernorm(hidden_state)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
hidden_state = self.layernorm(hidden_state)
hidden_state = self.norm(hidden_state)

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@yonigozlan yonigozlan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice @vijayabhaskar-ev , thanks for iterating, LGTM!

@yonigozlan yonigozlan enabled auto-merge (squash) November 11, 2025 16:11
@github-actions
Copy link
Contributor

[For maintainers] Suggested jobs to run (before merge)

run-slow: auto, dinov3_vit

@yonigozlan yonigozlan merged commit 496c283 into huggingface:main Nov 11, 2025
23 checks passed
SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
* feat: Add DINOv3 support to AutoBackbone [DRAFT]

- Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone
- Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES
- Support get_intermediate_layers for Facebook compatibility
- Enable multi-scale feature extraction for detection/segmentation

Note: Tests and documentation coming in follow-up commits
Addresses huggingface#40323

* Updated import structure of get_aligned_output_features_output_indices

* Added test for DINOv3ViTBackbone

* Add DINOv3ViTBackbone to model documentation

* Refactored the code to adhere to the Transformers principles

* Generated modeling_dinov3_vit.py

* DINOv3ViT backbone: keep hidden_states with return_dict=False, add @check_model_inputs and polish docs

- Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording.
- Preserve hidden_states when return_dict=False by appending them to the tuple output when requested.
- Clean up config docstring formatting (consistent indentation and use list[...] types).

* Restructure DINOv3 backbone and update its tests

* Resolved merge conflicts

* Resolved failing testcase

* Fix DINOv3 backbone to use self.norm for feature maps

---------

Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Is there a plan to add DINOv3 into AutoBackbone?

4 participants