Add dinov3 autobackbone#41276
Conversation
- Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone - Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES - Support get_intermediate_layers for Facebook compatibility - Enable multi-scale feature extraction for detection/segmentation Note: Tests and documentation coming in follow-up commits Addresses huggingface#40323
|
@qubvel @merveenoyan I closed the previous PR and created a new one, since the old PR ended up with too many reviewers after I rebased from main and resolved merge conflicts. |
…heck_model_inputs and polish docs - Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording. - Preserve hidden_states when return_dict=False by appending them to the tuple output when requested. - Clean up config docstring formatting (consistent indentation and use list[...] types).
yonigozlan
left a comment
There was a problem hiding this comment.
Hello @vijayabhaskar-ev ! Thanks for working on this, I left a few comments on things to modify :)
| head_mask: Optional[torch.Tensor] = None, | ||
| output_hidden_states: Optional[bool] = None, |
| if output_hidden_states is None: | ||
| output_hidden_states = self.config.output_hidden_states |
| if output_hidden_states is None: | ||
| output_hidden_states = self.config.output_hidden_states | ||
|
|
||
| if pixel_values is None: |
There was a problem hiding this comment.
pixel_values is an arg so not necessary, let's not clutter the modeling file
|
|
||
| if output_hidden_states: | ||
| collected_hidden_states = [hidden_states] | ||
|
|
||
| for i, layer_module in enumerate(self.layer): | ||
| hidden_states = layer_module( | ||
| hidden_states, | ||
| position_embeddings=position_embeddings, | ||
| ) | ||
| if output_hidden_states: | ||
| collected_hidden_states.append(hidden_states) |
There was a problem hiding this comment.
Not needed, hidden_states are captured by the check_model_inputs decorators
| return BaseModelOutputWithPooling( | ||
| last_hidden_state=sequence_output, | ||
| pooler_output=pooled_output, | ||
| hidden_states=tuple(collected_hidden_states) if output_hidden_states else None, |
| def _tokens_to_bchw(self, tokens: torch.Tensor, H: int, W: int) -> torch.Tensor: | ||
| # tokens: [B, N, C] -> [B, C, H, W], where N == H*W | ||
| B, N, C = tokens.shape | ||
| return tokens.reshape(B, H, W, C).permute(0, 3, 1, 2).contiguous() |
There was a problem hiding this comment.
Let's just do it directly in forward, no need for a separate function
| ) -> BackboneOutput: | ||
| return_dict = kwargs.get("return_dict", getattr(self.config, "use_return_dict", True)) | ||
|
|
||
| outputs = self.dinov3(pixel_values, output_hidden_states=True) |
| B, C_in, H_img, W_img = pixel_values.shape | ||
| patch = self.config.patch_size | ||
| H = H_img // patch | ||
| W = W_img // patch |
There was a problem hiding this comment.
Let's use more explicit variable names, and no one letter variables
| if not return_dict: | ||
| output = (tuple(feature_maps),) | ||
| if output_hidden_states: | ||
| output = output + (hidden_states,) | ||
| return output |
There was a problem hiding this comment.
Let's use the can_return_tuple decorator on the forward function instead
|
|
||
| return BackboneOutput( | ||
| feature_maps=tuple(feature_maps), | ||
| hidden_states=hidden_states if output_hidden_states else None, |
There was a problem hiding this comment.
Not needed as captured by check_model_inputs
|
@yonigozlan I have addressed all the comments, updated the modeling file accordingly and pushed the latest changes. |
yonigozlan
left a comment
There was a problem hiding this comment.
Hey @vijayabhaskar-ev ! Thanks for iterating, almost there! There is some confusion on the layer norms, but once that's fixed, we'll be able to merge :)
| from typing import Optional | ||
|
|
||
| from ...configuration_utils import PreTrainedConfig | ||
| from ...configuration_utils import PretrainedConfig |
There was a problem hiding this comment.
| from ...configuration_utils import PretrainedConfig | |
| from ...configuration_utils import PreTrainedConfig |
|
|
||
|
|
||
| class DINOv3ViTConfig(PreTrainedConfig): | ||
| class DINOv3ViTConfig(BackboneConfigMixin, PretrainedConfig): |
There was a problem hiding this comment.
| class DINOv3ViTConfig(BackboneConfigMixin, PretrainedConfig): | |
| class DINOv3ViTConfig(BackboneConfigMixin, PreTrainedConfig): |
| self.gradient_checkpointing = False | ||
|
|
||
| self.num_features = [config.hidden_size for _ in range(config.num_hidden_layers + 1)] | ||
| self.layernorm = nn.LayerNorm(config.hidden_size, eps=config.layer_norm_eps) |
There was a problem hiding this comment.
We already have self.norm here, we should use it instead of self.layernorm (the weights will be loaded in self.norm and not self.layernorm)
| hidden_states = layer_module(hidden_states, position_embeddings=position_embeddings) | ||
| stage_hidden_states.append(hidden_states) | ||
|
|
||
| sequence_output = self.norm(hidden_states) |
There was a problem hiding this comment.
this should go in the loop below, amd use self.norm instead of self.layernorm
There was a problem hiding this comment.
Updated to use self norm and added inside the loop. Please let me know if any further chnages required.
| for stage_name, hidden_state in zip(self.stage_names, stage_hidden_states): | ||
| if stage_name in self.out_features: | ||
| if self.config.apply_layernorm: | ||
| hidden_state = self.layernorm(hidden_state) |
There was a problem hiding this comment.
| hidden_state = self.layernorm(hidden_state) | |
| hidden_state = self.norm(hidden_state) |
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
yonigozlan
left a comment
There was a problem hiding this comment.
Very nice @vijayabhaskar-ev , thanks for iterating, LGTM!
|
[For maintainers] Suggested jobs to run (before merge) run-slow: auto, dinov3_vit |
* feat: Add DINOv3 support to AutoBackbone [DRAFT] - Implement DINOv3ViTConfig, DINOv3ViTModel, and DINOv3ViTBackbone - Add DINOv3 to MODEL_FOR_BACKBONE_MAPPING_NAMES - Support get_intermediate_layers for Facebook compatibility - Enable multi-scale feature extraction for detection/segmentation Note: Tests and documentation coming in follow-up commits Addresses huggingface#40323 * Updated import structure of get_aligned_output_features_output_indices * Added test for DINOv3ViTBackbone * Add DINOv3ViTBackbone to model documentation * Refactored the code to adhere to the Transformers principles * Generated modeling_dinov3_vit.py * DINOv3ViT backbone: keep hidden_states with return_dict=False, add @check_model_inputs and polish docs - Add @check_model_inputs to DINOv3ViTBackbone.forward to normalize flags and enable output recording. - Preserve hidden_states when return_dict=False by appending them to the tuple output when requested. - Clean up config docstring formatting (consistent indentation and use list[...] types). * Restructure DINOv3 backbone and update its tests * Resolved merge conflicts * Resolved failing testcase * Fix DINOv3 backbone to use self.norm for feature maps --------- Co-authored-by: yonigozlan <yoni.gozlan@huggingface.co>
This PR implements DINOv3 support for AutoBackbone.
What's Implemented:
DINOv3ViTConfig: Complete configuration class with all DINOv3-specific parameters
DINOv3ViTModel: Full model implementation with RoPE embeddings, register tokens, and SwiGLU MLP
DINOv3ViTBackbone: AutoBackbone integration with multi-scale feature extraction
AutoBackbone Integration: Added to MODEL_FOR_BACKBONE_MAPPING_NAMES in modeling_auto.py
Facebook API Compatibility: get_intermediate_layers method matching Facebook Research implementation
Multi-scale Features: Support for object detection and segmentation downstream tasks
Fixes #40323