diff --git a/README.md b/README.md index 416702f20b07..83ac0ba7a6df 100644 --- a/README.md +++ b/README.md @@ -336,6 +336,7 @@ Current number of checkpoints: ![](https://img.shields.io/endpoint?url=https://h 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MaskformerSwin](https://huggingface.co/docs/transformers/main/model_doc/maskformer-swin)** (from ) released with the paper []() by . 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. diff --git a/README_es.md b/README_es.md index 7aa72c2f3e34..b10c0c870a4a 100644 --- a/README_es.md +++ b/README_es.md @@ -336,6 +336,7 @@ Número actual de puntos de control: ![](https://img.shields.io/endpoint?url=htt 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MaskformerSwin](https://huggingface.co/docs/transformers/main/model_doc/maskformer-swin)** (from ) released with the paper []() by . 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. diff --git a/README_ja.md b/README_ja.md index 6f5a8d94a200..24410a273516 100644 --- a/README_ja.md +++ b/README_ja.md @@ -371,6 +371,7 @@ Flax、PyTorch、TensorFlowをcondaでインストールする方法は、それ 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MaskformerSwin](https://huggingface.co/docs/transformers/main/model_doc/maskformer-swin)** (from ) released with the paper []() by . 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. diff --git a/README_ko.md b/README_ko.md index 8d65e33a744f..f8506d322aaa 100644 --- a/README_ko.md +++ b/README_ko.md @@ -286,6 +286,7 @@ Flax, PyTorch, TensorFlow 설치 페이지에서 이들을 conda로 설치하는 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MaskformerSwin](https://huggingface.co/docs/transformers/main/model_doc/maskformer-swin)** (from ) released with the paper []() by . 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. diff --git a/README_zh-hans.md b/README_zh-hans.md index 6b053d7bfe8a..14138e3bbb6e 100644 --- a/README_zh-hans.md +++ b/README_zh-hans.md @@ -310,6 +310,7 @@ conda install -c huggingface transformers 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** 用 [OPUS](http://opus.nlpl.eu/) 数据训练的机器翻译模型由 Jörg Tiedemann 发布。[Marian Framework](https://marian-nmt.github.io/) 由微软翻译团队开发。 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (来自 Microsoft Research Asia) 伴随论文 [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) 由 Junlong Li, Yiheng Xu, Lei Cui, Furu Wei 发布。 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov >>>>>>> Fix rebase +1. **[MaskformerSwin](https://huggingface.co/docs/transformers/main/model_doc/maskformer-swin)** (from ) released with the paper []() by . 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) 由 Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer 发布。 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (来自 Facebook) 伴随论文 [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) 由 Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan 发布。 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (来自 NVIDIA) 伴随论文 [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) 由 Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro 发布。 diff --git a/README_zh-hant.md b/README_zh-hant.md index 7c1a31eb7fa6..3c198278731e 100644 --- a/README_zh-hant.md +++ b/README_zh-hant.md @@ -322,6 +322,7 @@ conda install -c huggingface transformers 1. **[MarianMT](https://huggingface.co/docs/transformers/model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](https://huggingface.co/docs/transformers/model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[MaskFormer](https://huggingface.co/docs/transformers/model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov +1. **[MaskformerSwin](https://huggingface.co/docs/transformers/main/model_doc/maskformer-swin)** (from ) released with the paper []() by . 1. **[mBART](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](https://huggingface.co/docs/transformers/model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. 1. **[Megatron-BERT](https://huggingface.co/docs/transformers/model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. diff --git a/docs/source/en/index.mdx b/docs/source/en/index.mdx index 14856033c85c..b4d97e95ba54 100644 --- a/docs/source/en/index.mdx +++ b/docs/source/en/index.mdx @@ -124,6 +124,7 @@ The documentation is organized into five sections: 1. **[MarianMT](model_doc/marian)** Machine translation models trained using [OPUS](http://opus.nlpl.eu/) data by Jörg Tiedemann. The [Marian Framework](https://marian-nmt.github.io/) is being developed by the Microsoft Translator Team. 1. **[MarkupLM](model_doc/markuplm)** (from Microsoft Research Asia) released with the paper [MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document Understanding](https://arxiv.org/abs/2110.08518) by Junlong Li, Yiheng Xu, Lei Cui, Furu Wei. 1. **[MaskFormer](model_doc/maskformer)** (from Meta and UIUC) released with the paper [Per-Pixel Classification is Not All You Need for Semantic Segmentation](https://arxiv.org/abs/2107.06278) by Bowen Cheng, Alexander G. Schwing, Alexander Kirillov. +1. **[MaskformerSwin](model_doc/maskformer-swin)** (from ) released with the paper []() by . 1. **[mBART](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Denoising Pre-training for Neural Machine Translation](https://arxiv.org/abs/2001.08210) by Yinhan Liu, Jiatao Gu, Naman Goyal, Xian Li, Sergey Edunov, Marjan Ghazvininejad, Mike Lewis, Luke Zettlemoyer. 1. **[mBART-50](model_doc/mbart)** (from Facebook) released with the paper [Multilingual Translation with Extensible Multilingual Pretraining and Finetuning](https://arxiv.org/abs/2008.00401) by Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, Angela Fan. 1. **[Megatron-BERT](model_doc/megatron-bert)** (from NVIDIA) released with the paper [Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism](https://arxiv.org/abs/1909.08053) by Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper and Bryan Catanzaro. @@ -278,6 +279,7 @@ Flax), PyTorch, and/or TensorFlow. | Marian | ✅ | ❌ | ✅ | ✅ | ✅ | | MarkupLM | ✅ | ✅ | ✅ | ❌ | ❌ | | MaskFormer | ❌ | ❌ | ✅ | ❌ | ❌ | +| MaskformerSwin | ❌ | ❌ | ✅ | ❌ | ❌ | | mBART | ✅ | ✅ | ✅ | ✅ | ✅ | | Megatron-BERT | ❌ | ❌ | ✅ | ❌ | ❌ | | MobileBERT | ✅ | ✅ | ✅ | ✅ | ❌ | diff --git a/docs/source/en/model_doc/auto.mdx b/docs/source/en/model_doc/auto.mdx index 79ad20bd8045..1e91e81d7cdf 100644 --- a/docs/source/en/model_doc/auto.mdx +++ b/docs/source/en/model_doc/auto.mdx @@ -186,6 +186,10 @@ Likewise, if your `NewModel` is a subclass of [`PreTrainedModel`], make sure its [[autodoc]] AutoModelForZeroShotObjectDetection +## AutoBackbone + +[[autodoc]] AutoBackbone + ## TFAutoModel [[autodoc]] TFAutoModel diff --git a/docs/source/en/model_doc/maskformer.mdx b/docs/source/en/model_doc/maskformer.mdx index 34414dbd8f2c..f674070c7a64 100644 --- a/docs/source/en/model_doc/maskformer.mdx +++ b/docs/source/en/model_doc/maskformer.mdx @@ -70,3 +70,17 @@ This model was contributed by [francesco](https://huggingface.co/francesco). The [[autodoc]] MaskFormerForInstanceSegmentation - forward + +## MaskFormerSwinConfig + +[[autodoc]] MaskFormerSwinConfig + +## MaskFormerSwinModel + +[[autodoc]] MaskFormerSwinModel + - forward + +## MaskFormerSwinBackbone + +[[autodoc]] MaskFormerSwinBackbone + - forward diff --git a/docs/source/en/model_doc/resnet.mdx b/docs/source/en/model_doc/resnet.mdx index 3c8af6227d19..6f306d601783 100644 --- a/docs/source/en/model_doc/resnet.mdx +++ b/docs/source/en/model_doc/resnet.mdx @@ -50,6 +50,12 @@ This model was contributed by [Francesco](https://huggingface.co/Francesco). The - forward +## ResNetBackbone + +[[autodoc]] ResNetBackbone + - forward + + ## TFResNetModel [[autodoc]] TFResNetModel diff --git a/src/transformers/__init__.py b/src/transformers/__init__.py index c07836075b3f..8e0fda565605 100644 --- a/src/transformers/__init__.py +++ b/src/transformers/__init__.py @@ -282,7 +282,7 @@ "MarkupLMProcessor", "MarkupLMTokenizer", ], - "models.maskformer": ["MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerConfig"], + "models.maskformer": ["MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerConfig", "MaskFormerSwinConfig"], "models.mbart": ["MBartConfig"], "models.mbart50": [], "models.mctct": ["MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP", "MCTCTConfig", "MCTCTProcessor"], @@ -929,6 +929,7 @@ "MODEL_WITH_LM_HEAD_MAPPING", "MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING", "AutoModel", + "AutoBackbone", "AutoModelForAudioClassification", "AutoModelForAudioFrameClassification", "AutoModelForAudioXVector", @@ -1593,6 +1594,9 @@ "MaskFormerForInstanceSegmentation", "MaskFormerModel", "MaskFormerPreTrainedModel", + "MaskFormerSwinModel", + "MaskFormerSwinPreTrainedModel", + "MaskFormerSwinBackbone", ] ) _import_structure["models.markuplm"].extend( @@ -1865,6 +1869,7 @@ "ResNetForImageClassification", "ResNetModel", "ResNetPreTrainedModel", + "ResNetBackbone", ] ) _import_structure["models.retribert"].extend( @@ -3377,7 +3382,7 @@ MarkupLMProcessor, MarkupLMTokenizer, ) - from .models.maskformer import MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, MaskFormerConfig + from .models.maskformer import MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, MaskFormerConfig, MaskformerSwinConfig from .models.mbart import MBartConfig from .models.mctct import MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP, MCTCTConfig, MCTCTProcessor from .models.megatron_bert import MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP, MegatronBertConfig @@ -3911,6 +3916,7 @@ MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING, MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, + AutoBackbone, AutoModel, AutoModelForAudioClassification, AutoModelForAudioFrameClassification, @@ -4458,6 +4464,9 @@ MaskFormerForInstanceSegmentation, MaskFormerModel, MaskFormerPreTrainedModel, + MaskFormerSwinBackbone, + MaskFormerSwinModel, + MaskFormerSwinPreTrainedModel, ) from .models.mbart import ( MBartForCausalLM, @@ -4678,6 +4687,7 @@ ) from .models.resnet import ( RESNET_PRETRAINED_MODEL_ARCHIVE_LIST, + ResNetBackbone, ResNetForImageClassification, ResNetModel, ResNetPreTrainedModel, diff --git a/src/transformers/backbone.py b/src/transformers/backbone.py new file mode 100644 index 000000000000..bcfbf05210c1 --- /dev/null +++ b/src/transformers/backbone.py @@ -0,0 +1,71 @@ +# Copyright (c) Facebook, Inc. and its affiliates. +from abc import abstractmethod +from dataclasses import dataclass +from typing import Dict, Optional + +from torch import nn + + +@dataclass +class ShapeSpec: + """ + A simple structure that contains basic shape specification about a tensor. It is often used as the auxiliary + inputs/outputs of models, to complement the lack of shape inference ability among pytorch modules. + """ + + channels: Optional[int] = None + height: Optional[int] = None + width: Optional[int] = None + stride: Optional[int] = None + + +class Backbone(nn.Module): + """ + Abstract base class for network backbones. + """ + + @abstractmethod + def forward(self): + """ + Subclasses must override this method, but adhere to the same return type. + + Returns: + dict[str->Tensor]: mapping from feature name (e.g., "res2") to tensor + """ + pass + + @property + def size_divisibility(self) -> int: + """ + Some backbones require the input height and width to be divisible by a specific integer. This is typically true + for encoder / decoder type networks with lateral connection (e.g., FPN) for which feature maps need to match + dimension in the "bottom up" and "top down" paths. Set to 0 if no specific input size divisibility is required. + """ + return 0 + + @property + def padding_constraints(self) -> Dict[str, int]: + """ + This property is a generalization of size_divisibility. Some backbones and training recipes require specific + padding constraints, such as enforcing divisibility by a specific integer (e.g., FPN) or padding to a square + (e.g., ViTDet with large-scale jitter in :paper:vitdet). `padding_constraints` contains these optional items + like: { + "size_divisibility": int, "square_size": int, # Future options are possible + } `size_divisibility` will read from here if presented and `square_size` indicates the square padding size if + `square_size` > 0. + + TODO: use type of Dict[str, int] to avoid torchscipt issues. The type of padding_constraints could be + generalized as TypedDict (Python 3.8+) to support more types in the future. + """ + return {} + + def output_shape(self): + """ + Returns: + dict[str->ShapeSpec] + """ + # this is a backward-compatible default + return { + name: ShapeSpec(channels=self._out_feature_channels[name], stride=self._out_feature_strides[name]) + for name in self._out_features + } diff --git a/src/transformers/models/auto/__init__.py b/src/transformers/models/auto/__init__.py index a1a1356eb006..718f4a221454 100644 --- a/src/transformers/models/auto/__init__.py +++ b/src/transformers/models/auto/__init__.py @@ -73,6 +73,7 @@ "MODEL_WITH_LM_HEAD_MAPPING", "MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING", "AutoModel", + "AutoBackbone", "AutoModelForAudioClassification", "AutoModelForAudioFrameClassification", "AutoModelForAudioXVector", @@ -225,6 +226,7 @@ MODEL_FOR_ZERO_SHOT_OBJECT_DETECTION_MAPPING, MODEL_MAPPING, MODEL_WITH_LM_HEAD_MAPPING, + AutoBackbone, AutoModel, AutoModelForAudioClassification, AutoModelForAudioFrameClassification, diff --git a/src/transformers/models/auto/configuration_auto.py b/src/transformers/models/auto/configuration_auto.py index e92faa4d041a..af991b38d5a4 100644 --- a/src/transformers/models/auto/configuration_auto.py +++ b/src/transformers/models/auto/configuration_auto.py @@ -95,6 +95,7 @@ ("marian", "MarianConfig"), ("markuplm", "MarkupLMConfig"), ("maskformer", "MaskFormerConfig"), + ("maskformer-swin", "MaskFormerSwinConfig"), ("mbart", "MBartConfig"), ("mctct", "MCTCTConfig"), ("megatron-bert", "MegatronBertConfig"), @@ -234,6 +235,7 @@ ("m2m_100", "M2M_100_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("markuplm", "MARKUPLM_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("maskformer", "MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP"), + ("maskformer-swin", "MASKFORMER_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mbart", "MBART_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("mctct", "MCTCT_PRETRAINED_CONFIG_ARCHIVE_MAP"), ("megatron-bert", "MEGATRON_BERT_PRETRAINED_CONFIG_ARCHIVE_MAP"), @@ -378,6 +380,7 @@ ("marian", "Marian"), ("markuplm", "MarkupLM"), ("maskformer", "MaskFormer"), + ("maskformer-swin", "MaskformerSwin"), ("mbart", "mBART"), ("mbart50", "mBART-50"), ("mctct", "M-CTC-T"), @@ -470,6 +473,7 @@ ("data2vec-text", "data2vec"), ("data2vec-vision", "data2vec"), ("donut-swin", "donut"), + ("maskformer-swin", "maskformer"), ("xclip", "x_clip"), ] ) diff --git a/src/transformers/models/auto/modeling_auto.py b/src/transformers/models/auto/modeling_auto.py index 93355ddb5649..d2b825336448 100644 --- a/src/transformers/models/auto/modeling_auto.py +++ b/src/transformers/models/auto/modeling_auto.py @@ -94,6 +94,7 @@ ("marian", "MarianModel"), ("markuplm", "MarkupLMModel"), ("maskformer", "MaskFormerModel"), + ("maskformer-swin", "MaskFormerSwinModel"), ("mbart", "MBartModel"), ("mctct", "MCTCTModel"), ("megatron-bert", "MegatronBertModel"), @@ -827,6 +828,16 @@ ] ) +BACKBONE_MAPPING_NAMES = OrderedDict( + [ + # Model for Instance Segmentation mapping + ("maskformer-swin", "MaskFormerSwinBackbone"), + ("resnet", "ResNetBackbone"), + ("swin", "MaskFormerSwinBackbone"), # for backward compatibility + ] +) + +BACKBONE_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, BACKBONE_MAPPING_NAMES) MODEL_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_MAPPING_NAMES) MODEL_FOR_PRETRAINING_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_PRETRAINING_MAPPING_NAMES) MODEL_WITH_LM_HEAD_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_WITH_LM_HEAD_MAPPING_NAMES) @@ -893,6 +904,11 @@ CONFIG_MAPPING_NAMES, MODEL_FOR_AUDIO_FRAME_CLASSIFICATION_MAPPING_NAMES ) MODEL_FOR_AUDIO_XVECTOR_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, MODEL_FOR_AUDIO_XVECTOR_MAPPING_NAMES) +BACKBOONE_MAPPING = _LazyAutoMapping(CONFIG_MAPPING_NAMES, BACKBONE_MAPPING_NAMES) + + +class AutoBackbone(_BaseAutoModelClass): + _model_mapping = BACKBOONE_MAPPING class AutoModel(_BaseAutoModelClass): diff --git a/src/transformers/models/maskformer/__init__.py b/src/transformers/models/maskformer/__init__.py index 4234f76dc565..4beb0b0a482f 100644 --- a/src/transformers/models/maskformer/__init__.py +++ b/src/transformers/models/maskformer/__init__.py @@ -20,7 +20,10 @@ from ...utils import OptionalDependencyNotAvailable, _LazyModule, is_torch_available, is_vision_available -_import_structure = {"configuration_maskformer": ["MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerConfig"]} +_import_structure = { + "configuration_maskformer": ["MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerConfig"], + "configuration_maskformer_swin": ["MASKFORMER_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP", "MaskFormerSwinConfig"], +} try: if not is_vision_available(): @@ -43,9 +46,15 @@ "MaskFormerModel", "MaskFormerPreTrainedModel", ] + _import_structure["modeling_maskformer_swin"] = [ + "MaskFormerSwinModel", + "MaskFormerSwinPreTrainedModel", + "MaskFormerSwinBackbone", + ] if TYPE_CHECKING: from .configuration_maskformer import MASKFORMER_PRETRAINED_CONFIG_ARCHIVE_MAP, MaskFormerConfig + from .configuration_maskformer_swin import MASKFORMER_SWIN_PRETRAINED_CONFIG_ARCHIVE_MAP, MaskformerSwinConfig try: if not is_vision_available(): @@ -66,6 +75,11 @@ MaskFormerModel, MaskFormerPreTrainedModel, ) + from .modeling_maskformer_swin import ( + MaskFormerSwinBackbone, + MaskFormerSwinModel, + MaskFormerSwinPreTrainedModel, + ) else: diff --git a/src/transformers/models/maskformer/configuration_maskformer.py b/src/transformers/models/maskformer/configuration_maskformer.py index 8c0187d81b99..7b42515883fa 100644 --- a/src/transformers/models/maskformer/configuration_maskformer.py +++ b/src/transformers/models/maskformer/configuration_maskformer.py @@ -16,9 +16,10 @@ import copy from typing import Dict, Optional +from transformers import CONFIG_MAPPING + from ...configuration_utils import PretrainedConfig from ...utils import logging -from ..auto.configuration_auto import AutoConfig from ..detr import DetrConfig from ..swin import SwinConfig @@ -97,16 +98,16 @@ class MaskFormerConfig(PretrainedConfig): """ model_type = "maskformer" attribute_map = {"hidden_size": "mask_feature_size"} - backbones_supported = ["swin"] + backbones_supported = ["swin", "resnet"] decoders_supported = ["detr"] def __init__( self, + backbone_config: Optional[Dict] = None, fpn_feature_size: int = 256, mask_feature_size: int = 256, no_object_weight: float = 0.1, use_auxiliary_loss: bool = False, - backbone_config: Optional[Dict] = None, decoder_config: Optional[Dict] = None, init_std: float = 0.02, init_xavier_std: float = 1.0, @@ -129,25 +130,34 @@ def __init__( drop_path_rate=0.3, ) else: - backbone_model_type = backbone_config.pop("model_type") + # verify that the backbone is supported + backbone_model_type = ( + backbone_config.pop("model_type") if isinstance(backbone_config, dict) else backbone_config.model_type + ) if backbone_model_type not in self.backbones_supported: raise ValueError( f"Backbone {backbone_model_type} not supported, please use one of" f" {','.join(self.backbones_supported)}" ) - backbone_config = AutoConfig.for_model(backbone_model_type, **backbone_config) + if isinstance(backbone_config, dict): + config_class = CONFIG_MAPPING[backbone_model_type]() + backbone_config = config_class.from_dict(backbone_config) if decoder_config is None: # fall back to https://huggingface.co/facebook/detr-resnet-50 decoder_config = DetrConfig() else: - decoder_type = decoder_config.pop("model_type") + decoder_type = ( + decoder_config.pop("model_type") if isinstance(decoder_config, dict) else decoder_config.model_type + ) if decoder_type not in self.decoders_supported: raise ValueError( f"Transformer Decoder {decoder_type} not supported, please use one of" f" {','.join(self.decoders_supported)}" ) - decoder_config = AutoConfig.for_model(decoder_type, **decoder_config) + if isinstance(decoder_config, dict): + config_class = CONFIG_MAPPING[decoder_type]() + decoder_config = config_class.from_dict(decoder_config) self.backbone_config = backbone_config self.decoder_config = decoder_config @@ -186,8 +196,8 @@ def from_backbone_and_decoder_configs( [`MaskFormerConfig`]: An instance of a configuration object """ return cls( - backbone_config=backbone_config.to_dict(), - decoder_config=decoder_config.to_dict(), + backbone_config=backbone_config, + decoder_config=decoder_config, **kwargs, ) diff --git a/src/transformers/models/maskformer/configuration_maskformer_swin.py b/src/transformers/models/maskformer/configuration_maskformer_swin.py new file mode 100644 index 000000000000..be19804473e1 --- /dev/null +++ b/src/transformers/models/maskformer/configuration_maskformer_swin.py @@ -0,0 +1,140 @@ +# coding=utf-8 +# Copyright 2022 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" MaskFormer Swin Transformer model configuration""" + +from ...configuration_utils import PretrainedConfig +from ...utils import logging + + +logger = logging.get_logger(__name__) + + +class MaskFormerSwinConfig(PretrainedConfig): + r""" + This is the configuration class to store the configuration of a [`MaskFormerSwinModel`]. It is used to instantiate + a Donut model according to the specified arguments, defining the model architecture. Instantiating a configuration + with the defaults will yield a similar configuration to that of the Swin + [microsoft/swin-tiny-patch4-window7-224(https://huggingface.co/microsoft/swin-tiny-patch4-window7-224) + architecture. + + Configuration objects inherit from [`PretrainedConfig`] and can be used to control the model outputs. Read the + documentation from [`PretrainedConfig`] for more information. + + Args: + image_size (`int`, *optional*, defaults to 224): + The size (resolution) of each image. + patch_size (`int`, *optional*, defaults to 4): + The size (resolution) of each patch. + num_channels (`int`, *optional*, defaults to 3): + The number of input channels. + embed_dim (`int`, *optional*, defaults to 96): + Dimensionality of patch embedding. + depths (`list(int)`, *optional*, defaults to [2, 2, 6, 2]): + Depth of each layer in the Transformer encoder. + num_heads (`list(int)`, *optional*, defaults to [3, 6, 12, 24]): + Number of attention heads in each layer of the Transformer encoder. + window_size (`int`, *optional*, defaults to 7): + Size of windows. + mlp_ratio (`float`, *optional*, defaults to 4.0): + Ratio of MLP hidden dimensionality to embedding dimensionality. + qkv_bias (`bool`, *optional*, defaults to True): + Whether or not a learnable bias should be added to the queries, keys and values. + hidden_dropout_prob (`float`, *optional*, defaults to 0.0): + The dropout probability for all fully connected layers in the embeddings and encoder. + attention_probs_dropout_prob (`float`, *optional*, defaults to 0.0): + The dropout ratio for the attention probabilities. + drop_path_rate (`float`, *optional*, defaults to 0.1): + Stochastic depth rate. + hidden_act (`str` or `function`, *optional*, defaults to `"gelu"`): + The non-linear activation function (function or string) in the encoder. If string, `"gelu"`, `"relu"`, + `"selu"` and `"gelu_new"` are supported. + use_absolute_embeddings (`bool`, *optional*, defaults to False): + Whether or not to add absolute position embeddings to the patch embeddings. + patch_norm (`bool`, *optional*, defaults to True): + Whether or not to add layer normalization after patch embedding. + initializer_range (`float`, *optional*, defaults to 0.02): + The standard deviation of the truncated_normal_initializer for initializing all weight matrices. + layer_norm_eps (`float`, *optional*, defaults to 1e-12): + The epsilon used by the layer normalization layers. + out_features (`List[str]`, *optional*): + If used as a backbone, list of feature names to output, e.g. ["stem", "stage1"]. + + Example: + + ```python + >>> from transformers import MaskFormerSwinConfig, MaskFormerSwinModel + + >>> # Initializing a Donut naver-clova-ix/donut-base style configuration + >>> configuration = MaskFormerSwinConfig() + + >>> # Randomly initializing a model from the naver-clova-ix/donut-base style configuration + >>> model = MaskFormerSwinModel(configuration) + + >>> # Accessing the model configuration + >>> configuration = model.config + ```""" + model_type = "maskformer-swin" + + attribute_map = { + "num_attention_heads": "num_heads", + "num_hidden_layers": "num_layers", + } + + def __init__( + self, + image_size=224, + patch_size=4, + num_channels=3, + embed_dim=96, + depths=[2, 2, 6, 2], + num_heads=[3, 6, 12, 24], + window_size=7, + mlp_ratio=4.0, + qkv_bias=True, + hidden_dropout_prob=0.0, + attention_probs_dropout_prob=0.0, + drop_path_rate=0.1, + hidden_act="gelu", + use_absolute_embeddings=False, + patch_norm=True, + initializer_range=0.02, + layer_norm_eps=1e-5, + out_features=None, + **kwargs + ): + super().__init__(**kwargs) + + self.image_size = image_size + self.patch_size = patch_size + self.num_channels = num_channels + self.embed_dim = embed_dim + self.depths = depths + self.num_layers = len(depths) + self.num_heads = num_heads + self.window_size = window_size + self.mlp_ratio = mlp_ratio + self.qkv_bias = qkv_bias + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.drop_path_rate = drop_path_rate + self.hidden_act = hidden_act + self.use_absolute_embeddings = use_absolute_embeddings + self.path_norm = patch_norm + self.layer_norm_eps = layer_norm_eps + self.initializer_range = initializer_range + # we set the hidden_size attribute in order to make Swin work with VisionEncoderDecoderModel + # this indicates the channel dimension after the last stage of the model + self.hidden_size = int(embed_dim * 2 ** (len(depths) - 1)) + self.out_features = out_features diff --git a/src/transformers/models/maskformer/convert_maskformer_resnet_to_pytorch.py b/src/transformers/models/maskformer/convert_maskformer_resnet_to_pytorch.py new file mode 100644 index 000000000000..d21565fba255 --- /dev/null +++ b/src/transformers/models/maskformer/convert_maskformer_resnet_to_pytorch.py @@ -0,0 +1,321 @@ +# coding=utf-8 +# Copyright 2022 The HuggingFace Inc. team. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +"""Convert MaskFormer checkpoints from the original repository. URL: https://github.com/facebookresearch/MaskFormer""" + + +import argparse +import pickle +from pathlib import Path + +import torch +from PIL import Image + +import requests +from transformers import MaskFormerConfig, MaskFormerFeatureExtractor, MaskFormerForInstanceSegmentation, ResNetConfig +from transformers.utils import logging + + +logging.set_verbosity_info() +logger = logging.get_logger(__name__) + + +def get_maskformer_config(model_name: str): + if "resnet101-c" in model_name: + # TODO use ResNet-C model here instead + backbone_config = ResNetConfig.from_pretrained("microsoft/resnet-101", use_deeplab_stem=True) + elif "resnet101" in model_name: + backbone_config = ResNetConfig.from_pretrained( + "microsoft/resnet-101", out_features=["stage1", "stage2", "stage3", "stage4"] + ) + else: + backbone_config = ResNetConfig.from_pretrained( + "microsoft/resnet-50", out_features=["stage1", "stage2", "stage3", "stage4"] + ) + config = MaskFormerConfig(backbone_config=backbone_config) + + # TODO id2label mappings + config.num_labels = 150 + + return config + + +def create_rename_keys(config): + rename_keys = [] + # stem + # fmt: off + rename_keys.append(("backbone.stem.conv1.weight", "model.pixel_level_module.encoder.resnet.embedder.embedder.convolution.weight")) + rename_keys.append(("backbone.stem.conv1.norm.weight", "model.pixel_level_module.encoder.resnet.embedder.embedder.normalization.weight")) + rename_keys.append(("backbone.stem.conv1.norm.bias", "model.pixel_level_module.encoder.resnet.embedder.embedder.normalization.bias")) + rename_keys.append(("backbone.stem.conv1.norm.running_mean", "model.pixel_level_module.encoder.resnet.embedder.embedder.normalization.running_mean")) + rename_keys.append(("backbone.stem.conv1.norm.running_var", "model.pixel_level_module.encoder.resnet.embedder.embedder.normalization.running_var")) + # fmt: on + # stages + for stage_idx in range(len(config.backbone_config.depths)): + for layer_idx in range(config.backbone_config.depths[stage_idx]): + # shortcut + if layer_idx == 0: + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.shortcut.weight", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.shortcut.convolution.weight", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.shortcut.norm.weight", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.shortcut.normalization.weight", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.shortcut.norm.bias", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.shortcut.normalization.bias", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.shortcut.norm.running_mean", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.shortcut.normalization.running_mean", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.shortcut.norm.running_var", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.shortcut.normalization.running_var", + ) + ) + # 3 convs + for i in range(3): + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.conv{i+1}.weight", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.layer.{i}.convolution.weight", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.conv{i+1}.norm.weight", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.layer.{i}.normalization.weight", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.conv{i+1}.norm.bias", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.layer.{i}.normalization.bias", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.conv{i+1}.norm.running_mean", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.layer.{i}.normalization.running_mean", + ) + ) + rename_keys.append( + ( + f"backbone.res{stage_idx + 2}.{layer_idx}.conv{i+1}.norm.running_var", + f"model.pixel_level_module.encoder.resnet.encoder.stages.{stage_idx}.layers.{layer_idx}.layer.{i}.normalization.running_var", + ) + ) + + # FPN + # fmt: off + rename_keys.append(("sem_seg_head.layer_4.weight", "model.pixel_level_module.decoder.fpn.stem.0.weight")) + rename_keys.append(("sem_seg_head.layer_4.norm.weight", "model.pixel_level_module.decoder.fpn.stem.1.weight")) + rename_keys.append(("sem_seg_head.layer_4.norm.bias", "model.pixel_level_module.decoder.fpn.stem.1.bias")) + for source_index, target_index in zip(range(3, 0, -1), range(0, 3)): + rename_keys.append((f"sem_seg_head.adapter_{source_index}.weight", f"model.pixel_level_module.decoder.fpn.layers.{target_index}.proj.0.weight")) + rename_keys.append((f"sem_seg_head.adapter_{source_index}.norm.weight", f"model.pixel_level_module.decoder.fpn.layers.{target_index}.proj.1.weight")) + rename_keys.append((f"sem_seg_head.adapter_{source_index}.norm.bias", f"model.pixel_level_module.decoder.fpn.layers.{target_index}.proj.1.bias")) + rename_keys.append((f"sem_seg_head.layer_{source_index}.weight", f"model.pixel_level_module.decoder.fpn.layers.{target_index}.block.0.weight")) + rename_keys.append((f"sem_seg_head.layer_{source_index}.norm.weight", f"model.pixel_level_module.decoder.fpn.layers.{target_index}.block.1.weight")) + rename_keys.append((f"sem_seg_head.layer_{source_index}.norm.bias", f"model.pixel_level_module.decoder.fpn.layers.{target_index}.block.1.bias")) + rename_keys.append(("sem_seg_head.mask_features.weight", "model.pixel_level_module.decoder.mask_projection.weight")) + rename_keys.append(("sem_seg_head.mask_features.bias", "model.pixel_level_module.decoder.mask_projection.bias")) + # fmt: on + + # Transformer decoder + # fmt: off + for idx in range(config.decoder_config.decoder_layers): + # self-attention out projection + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.self_attn.out_proj.weight", f"model.transformer_module.decoder.layers.{idx}.self_attn.out_proj.weight")) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.self_attn.out_proj.bias", f"model.transformer_module.decoder.layers.{idx}.self_attn.out_proj.bias")) + # cross-attention out projection + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.multihead_attn.out_proj.weight", f"model.transformer_module.decoder.layers.{idx}.encoder_attn.out_proj.weight")) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.multihead_attn.out_proj.bias", f"model.transformer_module.decoder.layers.{idx}.encoder_attn.out_proj.bias")) + # MLP 1 + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.linear1.weight", f"model.transformer_module.decoder.layers.{idx}.fc1.weight")) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.linear1.bias", f"model.transformer_module.decoder.layers.{idx}.fc1.bias")) + # MLP 2 + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.linear2.weight", f"model.transformer_module.decoder.layers.{idx}.fc2.weight")) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.linear2.bias", f"model.transformer_module.decoder.layers.{idx}.fc2.bias")) + # layernorm 1 (self-attention layernorm) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.norm1.weight", f"model.transformer_module.decoder.layers.{idx}.self_attn_layer_norm.weight")) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.norm1.bias", f"model.transformer_module.decoder.layers.{idx}.self_attn_layer_norm.bias")) + # layernorm 2 (cross-attention layernorm) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.norm2.weight", f"model.transformer_module.decoder.layers.{idx}.encoder_attn_layer_norm.weight")) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.norm2.bias", f"model.transformer_module.decoder.layers.{idx}.encoder_attn_layer_norm.bias")) + # layernorm 3 (final layernorm) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.norm3.weight", f"model.transformer_module.decoder.layers.{idx}.final_layer_norm.weight")) + rename_keys.append((f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.norm3.bias", f"model.transformer_module.decoder.layers.{idx}.final_layer_norm.bias")) + + rename_keys.append(("sem_seg_head.predictor.transformer.decoder.norm.weight", "model.transformer_module.decoder.layernorm.weight")) + rename_keys.append(("sem_seg_head.predictor.transformer.decoder.norm.bias", "model.transformer_module.decoder.layernorm.bias")) + # fmt: on + + # heads on top + # fmt: off + rename_keys.append(("sem_seg_head.predictor.query_embed.weight", "model.transformer_module.queries_embedder.weight")) + + rename_keys.append(("sem_seg_head.predictor.input_proj.weight", "model.transformer_module.input_projection.weight")) + rename_keys.append(("sem_seg_head.predictor.input_proj.bias", "model.transformer_module.input_projection.bias")) + + rename_keys.append(("sem_seg_head.predictor.class_embed.weight", "class_predictor.weight")) + rename_keys.append(("sem_seg_head.predictor.class_embed.bias", "class_predictor.bias")) + + for i in range(3): + rename_keys.append((f"sem_seg_head.predictor.mask_embed.layers.{i}.weight", f"mask_embedder.{i}.0.weight")) + rename_keys.append((f"sem_seg_head.predictor.mask_embed.layers.{i}.bias", f"mask_embedder.{i}.0.bias")) + # fmt: on + + return rename_keys + + +def rename_key(dct, old, new): + val = dct.pop(old) + dct[new] = val + + +# we split up the matrix of each encoder layer into queries, keys and values +def read_in_decoder_q_k_v(state_dict, config): + # fmt: off + hidden_size = config.decoder_config.hidden_size + for idx in range(config.decoder_config.decoder_layers): + # read in weights + bias of self-attention input projection layer (in the original implementation, this is a single matrix + bias) + in_proj_weight = state_dict.pop(f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.self_attn.in_proj_weight") + in_proj_bias = state_dict.pop(f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.self_attn.in_proj_bias") + # next, add query, keys and values (in that order) to the state dict + state_dict[f"model.transformer_module.decoder.layers.{idx}.self_attn.q_proj.weight"] = in_proj_weight[: hidden_size, :] + state_dict[f"model.transformer_module.decoder.layers.{idx}.self_attn.q_proj.bias"] = in_proj_bias[:config.hidden_size] + state_dict[f"model.transformer_module.decoder.layers.{idx}.self_attn.k_proj.weight"] = in_proj_weight[hidden_size : hidden_size * 2, :] + state_dict[f"model.transformer_module.decoder.layers.{idx}.self_attn.k_proj.bias"] = in_proj_bias[hidden_size : hidden_size * 2] + state_dict[f"model.transformer_module.decoder.layers.{idx}.self_attn.v_proj.weight"] = in_proj_weight[-hidden_size :, :] + state_dict[f"model.transformer_module.decoder.layers.{idx}.self_attn.v_proj.bias"] = in_proj_bias[-hidden_size :] + # read in weights + bias of cross-attention input projection layer (in the original implementation, this is a single matrix + bias) + in_proj_weight = state_dict.pop(f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.multihead_attn.in_proj_weight") + in_proj_bias = state_dict.pop(f"sem_seg_head.predictor.transformer.decoder.layers.{idx}.multihead_attn.in_proj_bias") + # next, add query, keys and values (in that order) to the state dict + state_dict[f"model.transformer_module.decoder.layers.{idx}.encoder_attn.q_proj.weight"] = in_proj_weight[: hidden_size, :] + state_dict[f"model.transformer_module.decoder.layers.{idx}.encoder_attn.q_proj.bias"] = in_proj_bias[:config.hidden_size] + state_dict[f"model.transformer_module.decoder.layers.{idx}.encoder_attn.k_proj.weight"] = in_proj_weight[hidden_size : hidden_size * 2, :] + state_dict[f"model.transformer_module.decoder.layers.{idx}.encoder_attn.k_proj.bias"] = in_proj_bias[hidden_size : hidden_size * 2] + state_dict[f"model.transformer_module.decoder.layers.{idx}.encoder_attn.v_proj.weight"] = in_proj_weight[-hidden_size :, :] + state_dict[f"model.transformer_module.decoder.layers.{idx}.encoder_attn.v_proj.bias"] = in_proj_bias[-hidden_size :] + # fmt: on + + +# We will verify our results on an image of cute cats +def prepare_img() -> torch.Tensor: + url = "http://images.cocodataset.org/val2017/000000039769.jpg" + im = Image.open(requests.get(url, stream=True).raw) + return im + + +@torch.no_grad() +def convert_maskformer_checkpoint( + model_name: str, checkpoint_path: str, pytorch_dump_folder_path: str, push_to_hub: bool = False +): + """ + Copy/paste/tweak model's weights to our MaskFormer structure. + """ + config = get_maskformer_config(model_name) + + # load original state_dict + with open(checkpoint_path, "rb") as f: + data = pickle.load(f) + state_dict = data["model"] + + # rename keys + rename_keys = create_rename_keys(config) + for src, dest in rename_keys: + rename_key(state_dict, src, dest) + read_in_decoder_q_k_v(state_dict, config) + + # update to torch tensors + for key, value in state_dict.items(): + state_dict[key] = torch.from_numpy(value) + + # load 🤗 model + model = MaskFormerForInstanceSegmentation(config) + model.eval() + + for name, param in model.named_parameters(): + print(name, param.shape) + + model.load_state_dict(state_dict) + + # verify results + image = prepare_img() + feature_extractor = MaskFormerFeatureExtractor(ignore_index=255, reduce_labels=True) + + inputs = feature_extractor(image, return_tensors="pt") + + outputs = model(**inputs) + expected_logits = torch.tensor( + [[6.7710, -0.1452, -3.5687], [1.9165, -1.0010, -1.8614], [3.6209, -0.2950, -1.3813]] + ) + assert torch.allclose(outputs.class_queries_logits[0, :3, :3], expected_logits, atol=1e-4) + print("Looks ok!") + + if pytorch_dump_folder_path is not None: + Path(pytorch_dump_folder_path).mkdir(exist_ok=True) + print(f"Saving model {model_name} to {pytorch_dump_folder_path}") + model.save_pretrained(pytorch_dump_folder_path) + print(f"Saving feature extractor to {pytorch_dump_folder_path}") + # feature_extractor.save_pretrained(pytorch_dump_folder_path) + + if push_to_hub: + print("Pushing to the hub...") + model.push_to_hub(f"nielsr/{model_name}") + # feature_extractor.push_to_hub(model_name, organization="hustvl") + + +if __name__ == "__main__": + parser = argparse.ArgumentParser() + # Required parameters + parser.add_argument( + "--model_name", + default="maskformer-resnet50-ade", + type=str, + help=("Name of the MaskFormer model you'd like to convert",), + ) + parser.add_argument( + "--checkpoint_path", + default=( + "/Users/nielsrogge/Documents/MaskFormer_checkpoints/MaskFormer-ResNet-50-ADE20k/model_final_d8dbeb.pkl" + ), + type=str, + help="Path to the original state dict (.pth file).", + ) + parser.add_argument( + "--pytorch_dump_folder_path", default=None, type=str, help="Path to the output PyTorch model directory." + ) + parser.add_argument( + "--push_to_hub", action="store_true", help="Whether or not to push the converted model to the 🤗 hub." + ) + + args = parser.parse_args() + convert_maskformer_checkpoint( + args.model_name, args.checkpoint_path, args.pytorch_dump_folder_path, args.push_to_hub + ) diff --git a/src/transformers/models/maskformer/modeling_maskformer.py b/src/transformers/models/maskformer/modeling_maskformer.py index 9908bb29f34e..3e5a2eb8b2de 100644 --- a/src/transformers/models/maskformer/modeling_maskformer.py +++ b/src/transformers/models/maskformer/modeling_maskformer.py @@ -14,7 +14,6 @@ # limitations under the License. """ PyTorch MaskFormer model.""" -import collections.abc import math import random from dataclasses import dataclass @@ -25,12 +24,12 @@ import torch from torch import Tensor, nn +from transformers import AutoBackbone from transformers.utils import logging from ...activations import ACT2FN from ...modeling_outputs import BaseModelOutputWithCrossAttentions -from ...modeling_utils import ModuleUtilsMixin, PreTrainedModel -from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer +from ...modeling_utils import PreTrainedModel from ...utils import ( ModelOutput, add_code_sample_docstrings, @@ -43,6 +42,7 @@ from ..detr import DetrConfig from ..swin import SwinConfig from .configuration_maskformer import MaskFormerConfig +from .modeling_maskformer_swin import MaskFormerSwinEncoder if is_scipy_available(): @@ -61,71 +61,6 @@ ] -@dataclass -class MaskFormerSwinModelOutputWithPooling(ModelOutput): - """ - Class for MaskFormerSwinModel's outputs that also contains the spatial dimensions of the hidden states. - - Args: - last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): - Sequence of hidden-states at the output of the last layer of the model. - pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): - Last layer hidden-state after a mean pooling operation. - hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): - Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of - shape `(batch_size, sequence_length, hidden_size)`. - - Hidden-states of the model at the output of each layer plus the initial embedding outputs. - hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*): - A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to - `batch, channels, height, width`. Due to padding, their spatial size cannot be inferred before the - `forward` method. - attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): - Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, - sequence_length)`. - - Attentions weights after the attention softmax, used to compute the weighted average in the self-attention - heads. - """ - - last_hidden_state: torch.FloatTensor = None - pooler_output: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - hidden_states_spatial_dimensions: Tuple[Tuple[int, int]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None - - -@dataclass -class MaskFormerSwinBaseModelOutput(ModelOutput): - """ - Class for SwinEncoder's outputs. - - Args: - last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): - Sequence of hidden-states at the output of the last layer of the model. - hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): - Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of - shape `(batch_size, sequence_length, hidden_size)`. - - Hidden-states of the model at the output of each layer plus the initial embedding outputs. - hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*): - A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to - `batch, channels, height, width`. Due to padding, their spatial size cannot inferred before the `forward` - method. - attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): - Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, - sequence_length)`. - - Attentions weights after the attention softmax, used to compute the weighted average in the self-attention - heads. - """ - - last_hidden_state: torch.FloatTensor = None - hidden_states: Optional[Tuple[torch.FloatTensor]] = None - hidden_states_spatial_dimensions: Tuple[Tuple[int, int]] = None - attentions: Optional[Tuple[torch.FloatTensor]] = None - - @dataclass # Copied from transformers.models.detr.modeling_detr.DetrDecoderOutput class DetrDecoderOutput(BaseModelOutputWithCrossAttentions): @@ -472,714 +407,6 @@ def pair_wise_sigmoid_focal_loss(inputs: Tensor, labels: Tensor, alpha: float = return loss / height_and_width -# Copied from transformers.models.swin.modeling_swin.window_partition -def window_partition(input_feature, window_size): - """ - Partitions the given input into windows. - """ - batch_size, height, width, num_channels = input_feature.shape - input_feature = input_feature.view( - batch_size, height // window_size, window_size, width // window_size, window_size, num_channels - ) - windows = input_feature.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, num_channels) - return windows - - -# Copied from transformers.models.swin.modeling_swin.window_reverse -def window_reverse(windows, window_size, height, width): - """ - Merges windows to produce higher resolution features. - """ - num_channels = windows.shape[-1] - windows = windows.view(-1, height // window_size, width // window_size, window_size, window_size, num_channels) - windows = windows.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, height, width, num_channels) - return windows - - -# Copied from transformers.models.swin.modeling_swin.drop_path -def drop_path(input, drop_prob=0.0, training=False, scale_by_keep=True): - """ - Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). - - Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks, - however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper... - See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the - layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the - argument. - """ - if drop_prob == 0.0 or not training: - return input - keep_prob = 1 - drop_prob - shape = (input.shape[0],) + (1,) * (input.ndim - 1) # work with diff dim tensors, not just 2D ConvNets - random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device) - random_tensor.floor_() # binarize - output = input.div(keep_prob) * random_tensor - return output - - -class MaskFormerSwinEmbeddings(nn.Module): - """ - Construct the patch and position embeddings. - """ - - def __init__(self, config): - super().__init__() - - self.patch_embeddings = MaskFormerSwinPatchEmbeddings(config) - num_patches = self.patch_embeddings.num_patches - self.patch_grid = self.patch_embeddings.grid_size - - if config.use_absolute_embeddings: - self.position_embeddings = nn.Parameter(torch.zeros(1, num_patches + 1, config.embed_dim)) - else: - self.position_embeddings = None - - self.norm = nn.LayerNorm(config.embed_dim) - self.dropout = nn.Dropout(config.hidden_dropout_prob) - - def forward(self, pixel_values): - embeddings, output_dimensions = self.patch_embeddings(pixel_values) - embeddings = self.norm(embeddings) - - if self.position_embeddings is not None: - embeddings = embeddings + self.position_embeddings - - embeddings = self.dropout(embeddings) - - return embeddings, output_dimensions - - -class MaskFormerSwinPatchEmbeddings(nn.Module): - """ - Image to Patch Embedding, including padding. - """ - - def __init__(self, config): - super().__init__() - image_size, patch_size = config.image_size, config.patch_size - num_channels, hidden_size = config.num_channels, config.embed_dim - - image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size) - patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size) - num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) - self.image_size = image_size - self.patch_size = patch_size - self.num_channels = num_channels - self.num_patches = num_patches - self.grid_size = (image_size[0] // patch_size[0], image_size[1] // patch_size[1]) - - self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size) - - def maybe_pad(self, pixel_values, height, width): - if width % self.patch_size[1] != 0: - pad_values = (0, self.patch_size[1] - width % self.patch_size[1]) - pixel_values = nn.functional.pad(pixel_values, pad_values) - if height % self.patch_size[0] != 0: - pad_values = (0, 0, 0, self.patch_size[0] - height % self.patch_size[0]) - pixel_values = nn.functional.pad(pixel_values, pad_values) - return pixel_values - - def forward(self, pixel_values): - _, num_channels, height, width = pixel_values.shape - if num_channels != self.num_channels: - raise ValueError( - "Make sure that the channel dimension of the pixel values match with the one set in the configuration." - ) - # pad the input to be divisible by self.patch_size, if needed - pixel_values = self.maybe_pad(pixel_values, height, width) - embeddings = self.projection(pixel_values) - _, _, height, width = embeddings.shape - output_dimensions = (height, width) - embeddings_flat = embeddings.flatten(2).transpose(1, 2) - - return embeddings_flat, output_dimensions - - -class MaskFormerSwinPatchMerging(nn.Module): - """ - Patch Merging Layer for maskformer model. - - Args: - input_resolution (`Tuple[int]`): - Resolution of input feature. - dim (`int`): - Number of input channels. - norm_layer (`nn.Module`, *optional*, defaults to `nn.LayerNorm`): - Normalization layer class. - """ - - def __init__(self, input_resolution, dim, norm_layer=nn.LayerNorm): - super().__init__() - self.dim = dim - self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False) - self.norm = norm_layer(4 * dim) - - def maybe_pad(self, input_feature, width, height): - should_pad = (height % 2 == 1) or (width % 2 == 1) - if should_pad: - pad_values = (0, 0, 0, width % 2, 0, height % 2) - input_feature = nn.functional.pad(input_feature, pad_values) - - return input_feature - - def forward(self, input_feature, input_dimensions): - height, width = input_dimensions - # `dim` is height * width - batch_size, dim, num_channels = input_feature.shape - - input_feature = input_feature.view(batch_size, height, width, num_channels) - # pad input to be disible by width and height, if needed - input_feature = self.maybe_pad(input_feature, height, width) - # [batch_size, height/2, width/2, num_channels] - input_feature_0 = input_feature[:, 0::2, 0::2, :] - # [batch_size, height/2, width/2, num_channels] - input_feature_1 = input_feature[:, 1::2, 0::2, :] - # [batch_size, height/2, width/2, num_channels] - input_feature_2 = input_feature[:, 0::2, 1::2, :] - # [batch_size, height/2, width/2, num_channels] - input_feature_3 = input_feature[:, 1::2, 1::2, :] - # batch_size height/2 width/2 4*num_channels - input_feature = torch.cat([input_feature_0, input_feature_1, input_feature_2, input_feature_3], -1) - input_feature = input_feature.view(batch_size, -1, 4 * num_channels) # batch_size height/2*width/2 4*C - - input_feature = self.norm(input_feature) - input_feature = self.reduction(input_feature) - - return input_feature - - -# Copied from transformers.models.swin.modeling_swin.SwinDropPath with Swin->MaskFormerSwin -class MaskFormerSwinDropPath(nn.Module): - """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).""" - - def __init__(self, drop_prob: Optional[float] = None) -> None: - super().__init__() - self.drop_prob = drop_prob - - def forward(self, x: torch.Tensor) -> torch.Tensor: - return drop_path(x, self.drop_prob, self.training) - - def extra_repr(self) -> str: - return "p={}".format(self.drop_prob) - - -# Copied from transformers.models.swin.modeling_swin.SwinSelfAttention with Swin->MaskFormerSwin -class MaskFormerSwinSelfAttention(nn.Module): - def __init__(self, config, dim, num_heads, window_size): - super().__init__() - if dim % num_heads != 0: - raise ValueError( - f"The hidden size ({dim}) is not a multiple of the number of attention heads ({num_heads})" - ) - - self.num_attention_heads = num_heads - self.attention_head_size = int(dim / num_heads) - self.all_head_size = self.num_attention_heads * self.attention_head_size - self.window_size = ( - window_size if isinstance(window_size, collections.abc.Iterable) else (window_size, window_size) - ) - - self.relative_position_bias_table = nn.Parameter( - torch.zeros((2 * self.window_size[0] - 1) * (2 * self.window_size[1] - 1), num_heads) - ) - - # get pair-wise relative position index for each token inside the window - coords_h = torch.arange(self.window_size[0]) - coords_w = torch.arange(self.window_size[1]) - coords = torch.stack(torch.meshgrid([coords_h, coords_w])) - coords_flatten = torch.flatten(coords, 1) - relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :] - relative_coords = relative_coords.permute(1, 2, 0).contiguous() - relative_coords[:, :, 0] += self.window_size[0] - 1 - relative_coords[:, :, 1] += self.window_size[1] - 1 - relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1 - relative_position_index = relative_coords.sum(-1) - self.register_buffer("relative_position_index", relative_position_index) - - self.query = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias) - self.key = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias) - self.value = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias) - - self.dropout = nn.Dropout(config.attention_probs_dropout_prob) - - def transpose_for_scores(self, x): - new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size) - x = x.view(new_x_shape) - return x.permute(0, 2, 1, 3) - - def forward( - self, - hidden_states: torch.Tensor, - attention_mask: Optional[torch.FloatTensor] = None, - head_mask: Optional[torch.FloatTensor] = None, - output_attentions: Optional[bool] = False, - ) -> Tuple[torch.Tensor]: - batch_size, dim, num_channels = hidden_states.shape - mixed_query_layer = self.query(hidden_states) - - key_layer = self.transpose_for_scores(self.key(hidden_states)) - value_layer = self.transpose_for_scores(self.value(hidden_states)) - query_layer = self.transpose_for_scores(mixed_query_layer) - - # Take the dot product between "query" and "key" to get the raw attention scores. - attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) - - attention_scores = attention_scores / math.sqrt(self.attention_head_size) - - relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)] - relative_position_bias = relative_position_bias.view( - self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1 - ) - - relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous() - attention_scores = attention_scores + relative_position_bias.unsqueeze(0) - - if attention_mask is not None: - # Apply the attention mask is (precomputed for all layers in MaskFormerSwinModel forward() function) - mask_shape = attention_mask.shape[0] - attention_scores = attention_scores.view( - batch_size // mask_shape, mask_shape, self.num_attention_heads, dim, dim - ) - attention_scores = attention_scores + attention_mask.unsqueeze(1).unsqueeze(0) - attention_scores = attention_scores.view(-1, self.num_attention_heads, dim, dim) - - # Normalize the attention scores to probabilities. - attention_probs = nn.functional.softmax(attention_scores, dim=-1) - - # This is actually dropping out entire tokens to attend to, which might - # seem a bit unusual, but is taken from the original Transformer paper. - attention_probs = self.dropout(attention_probs) - - # Mask heads if we want to - if head_mask is not None: - attention_probs = attention_probs * head_mask - - context_layer = torch.matmul(attention_probs, value_layer) - context_layer = context_layer.permute(0, 2, 1, 3).contiguous() - new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) - context_layer = context_layer.view(new_context_layer_shape) - - outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) - - return outputs - - -# Copied from transformers.models.swin.modeling_swin.SwinSelfOutput with Swin->MaskFormerSwin -class MaskFormerSwinSelfOutput(nn.Module): - def __init__(self, config, dim): - super().__init__() - self.dense = nn.Linear(dim, dim) - self.dropout = nn.Dropout(config.attention_probs_dropout_prob) - - def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: - hidden_states = self.dense(hidden_states) - hidden_states = self.dropout(hidden_states) - - return hidden_states - - -# Copied from transformers.models.swin.modeling_swin.SwinAttention with Swin->MaskFormerSwin -class MaskFormerSwinAttention(nn.Module): - def __init__(self, config, dim, num_heads, window_size): - super().__init__() - self.self = MaskFormerSwinSelfAttention(config, dim, num_heads, window_size) - self.output = MaskFormerSwinSelfOutput(config, dim) - self.pruned_heads = set() - - def prune_heads(self, heads): - if len(heads) == 0: - return - heads, index = find_pruneable_heads_and_indices( - heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads - ) - - # Prune linear layers - self.self.query = prune_linear_layer(self.self.query, index) - self.self.key = prune_linear_layer(self.self.key, index) - self.self.value = prune_linear_layer(self.self.value, index) - self.output.dense = prune_linear_layer(self.output.dense, index, dim=1) - - # Update hyper params and store pruned heads - self.self.num_attention_heads = self.self.num_attention_heads - len(heads) - self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads - self.pruned_heads = self.pruned_heads.union(heads) - - def forward( - self, - hidden_states: torch.Tensor, - attention_mask: Optional[torch.FloatTensor] = None, - head_mask: Optional[torch.FloatTensor] = None, - output_attentions: Optional[bool] = False, - ) -> Tuple[torch.Tensor]: - self_outputs = self.self(hidden_states, attention_mask, head_mask, output_attentions) - attention_output = self.output(self_outputs[0], hidden_states) - outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them - return outputs - - -# Copied from transformers.models.swin.modeling_swin.SwinIntermediate with Swin->MaskFormerSwin -class MaskFormerSwinIntermediate(nn.Module): - def __init__(self, config, dim): - super().__init__() - self.dense = nn.Linear(dim, int(config.mlp_ratio * dim)) - if isinstance(config.hidden_act, str): - self.intermediate_act_fn = ACT2FN[config.hidden_act] - else: - self.intermediate_act_fn = config.hidden_act - - def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: - hidden_states = self.dense(hidden_states) - hidden_states = self.intermediate_act_fn(hidden_states) - return hidden_states - - -# Copied from transformers.models.swin.modeling_swin.SwinOutput with Swin->MaskFormerSwin -class MaskFormerSwinOutput(nn.Module): - def __init__(self, config, dim): - super().__init__() - self.dense = nn.Linear(int(config.mlp_ratio * dim), dim) - self.dropout = nn.Dropout(config.hidden_dropout_prob) - - def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: - hidden_states = self.dense(hidden_states) - hidden_states = self.dropout(hidden_states) - return hidden_states - - -class MaskFormerSwinLayer(nn.Module): - def __init__(self, config, dim, input_resolution, num_heads, shift_size=0): - super().__init__() - self.chunk_size_feed_forward = config.chunk_size_feed_forward - self.shift_size = shift_size - self.window_size = config.window_size - self.input_resolution = input_resolution - self.layernorm_before = nn.LayerNorm(dim, eps=config.layer_norm_eps) - self.attention = MaskFormerSwinAttention(config, dim, num_heads, self.window_size) - self.drop_path = ( - MaskFormerSwinDropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity() - ) - self.layernorm_after = nn.LayerNorm(dim, eps=config.layer_norm_eps) - self.intermediate = MaskFormerSwinIntermediate(config, dim) - self.output = MaskFormerSwinOutput(config, dim) - - def get_attn_mask(self, input_resolution): - if self.shift_size > 0: - # calculate attention mask for SW-MSA - height, width = input_resolution - img_mask = torch.zeros((1, height, width, 1)) - height_slices = ( - slice(0, -self.window_size), - slice(-self.window_size, -self.shift_size), - slice(-self.shift_size, None), - ) - width_slices = ( - slice(0, -self.window_size), - slice(-self.window_size, -self.shift_size), - slice(-self.shift_size, None), - ) - count = 0 - for height_slice in height_slices: - for width_slice in width_slices: - img_mask[:, height_slice, width_slice, :] = count - count += 1 - - mask_windows = window_partition(img_mask, self.window_size) - mask_windows = mask_windows.view(-1, self.window_size * self.window_size) - attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2) - attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0)) - else: - attn_mask = None - return attn_mask - - def maybe_pad(self, hidden_states, height, width): - pad_left = pad_top = 0 - pad_rigth = (self.window_size - width % self.window_size) % self.window_size - pad_bottom = (self.window_size - height % self.window_size) % self.window_size - pad_values = (0, 0, pad_left, pad_rigth, pad_top, pad_bottom) - hidden_states = nn.functional.pad(hidden_states, pad_values) - return hidden_states, pad_values - - def forward(self, hidden_states, input_dimensions, head_mask=None, output_attentions=False): - height, width = input_dimensions - batch_size, dim, channels = hidden_states.size() - shortcut = hidden_states - - hidden_states = self.layernorm_before(hidden_states) - hidden_states = hidden_states.view(batch_size, height, width, channels) - # pad hidden_states to multiples of window size - hidden_states, pad_values = self.maybe_pad(hidden_states, height, width) - - _, height_pad, width_pad, _ = hidden_states.shape - # cyclic shift - if self.shift_size > 0: - shifted_hidden_states = torch.roll(hidden_states, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2)) - else: - shifted_hidden_states = hidden_states - - # partition windows - hidden_states_windows = window_partition(shifted_hidden_states, self.window_size) - hidden_states_windows = hidden_states_windows.view(-1, self.window_size * self.window_size, channels) - attn_mask = self.get_attn_mask((height_pad, width_pad)) - if attn_mask is not None: - attn_mask = attn_mask.to(hidden_states_windows.device) - - self_attention_outputs = self.attention( - hidden_states_windows, attn_mask, head_mask, output_attentions=output_attentions - ) - - attention_output = self_attention_outputs[0] - - outputs = self_attention_outputs[1:] # add self attentions if we output attention weights - - attention_windows = attention_output.view(-1, self.window_size, self.window_size, channels) - shifted_windows = window_reverse( - attention_windows, self.window_size, height_pad, width_pad - ) # B height' width' C - - # reverse cyclic shift - if self.shift_size > 0: - attention_windows = torch.roll(shifted_windows, shifts=(self.shift_size, self.shift_size), dims=(1, 2)) - else: - attention_windows = shifted_windows - - was_padded = pad_values[3] > 0 or pad_values[5] > 0 - if was_padded: - attention_windows = attention_windows[:, :height, :width, :].contiguous() - - attention_windows = attention_windows.view(batch_size, height * width, channels) - - hidden_states = shortcut + self.drop_path(attention_windows) - - layer_output = self.layernorm_after(hidden_states) - layer_output = self.intermediate(layer_output) - layer_output = hidden_states + self.output(layer_output) - - outputs = (layer_output,) + outputs - - return outputs - - -class MaskFormerSwinStage(nn.Module): - # Copied from transformers.models.swin.modeling_swin.SwinStage.__init__ with Swin->MaskFormerSwin - def __init__(self, config, dim, input_resolution, depth, num_heads, drop_path, downsample): - super().__init__() - self.config = config - self.dim = dim - self.blocks = nn.ModuleList( - [ - MaskFormerSwinLayer( - config=config, - dim=dim, - input_resolution=input_resolution, - num_heads=num_heads, - shift_size=0 if (i % 2 == 0) else config.window_size // 2, - ) - for i in range(depth) - ] - ) - - # patch merging layer - if downsample is not None: - self.downsample = downsample(input_resolution, dim=dim, norm_layer=nn.LayerNorm) - else: - self.downsample = None - - self.pointing = False - - def forward( - self, hidden_states, input_dimensions, head_mask=None, output_attentions=False, output_hidden_states=False - ): - all_hidden_states = () if output_hidden_states else None - - height, width = input_dimensions - for i, block_module in enumerate(self.blocks): - if output_hidden_states: - all_hidden_states = all_hidden_states + (hidden_states,) - - layer_head_mask = head_mask[i] if head_mask is not None else None - - block_hidden_states = block_module(hidden_states, input_dimensions, layer_head_mask, output_attentions) - - hidden_states = block_hidden_states[0] - - if output_hidden_states: - all_hidden_states += (hidden_states,) - - if self.downsample is not None: - height_downsampled, width_downsampled = (height + 1) // 2, (width + 1) // 2 - output_dimensions = (height, width, height_downsampled, width_downsampled) - hidden_states = self.downsample(hidden_states, input_dimensions) - else: - output_dimensions = (height, width, height, width) - - return hidden_states, output_dimensions, all_hidden_states - - -class MaskFormerSwinEncoder(nn.Module): - # Copied from transformers.models.swin.modeling_swin.SwinEncoder.__init__ with Swin->MaskFormerSwin - def __init__(self, config, grid_size): - super().__init__() - self.num_layers = len(config.depths) - self.config = config - dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] - self.layers = nn.ModuleList( - [ - MaskFormerSwinStage( - config=config, - dim=int(config.embed_dim * 2**i_layer), - input_resolution=(grid_size[0] // (2**i_layer), grid_size[1] // (2**i_layer)), - depth=config.depths[i_layer], - num_heads=config.num_heads[i_layer], - drop_path=dpr[sum(config.depths[:i_layer]) : sum(config.depths[: i_layer + 1])], - downsample=MaskFormerSwinPatchMerging if (i_layer < self.num_layers - 1) else None, - ) - for i_layer in range(self.num_layers) - ] - ) - - self.gradient_checkpointing = False - - def forward( - self, - hidden_states, - input_dimensions, - head_mask=None, - output_attentions=False, - output_hidden_states=False, - return_dict=True, - ): - all_hidden_states = () if output_hidden_states else None - all_input_dimensions = () - all_self_attentions = () if output_attentions else None - - if output_hidden_states: - all_hidden_states = all_hidden_states + (hidden_states,) - - for i, layer_module in enumerate(self.layers): - layer_head_mask = head_mask[i] if head_mask is not None else None - - if self.gradient_checkpointing and self.training: - - def create_custom_forward(module): - def custom_forward(*inputs): - return module(*inputs, output_attentions) - - return custom_forward - - layer_hidden_states, output_dimensions, layer_all_hidden_states = torch.utils.checkpoint.checkpoint( - create_custom_forward(layer_module), hidden_states, layer_head_mask - ) - else: - layer_hidden_states, output_dimensions, layer_all_hidden_states = layer_module( - hidden_states, - input_dimensions, - layer_head_mask, - output_attentions, - output_hidden_states, - ) - - input_dimensions = (output_dimensions[-2], output_dimensions[-1]) - all_input_dimensions += (input_dimensions,) - if output_hidden_states: - all_hidden_states += (layer_all_hidden_states,) - - hidden_states = layer_hidden_states - - if output_attentions: - all_self_attentions = all_self_attentions + (layer_all_hidden_states[1],) - - if not return_dict: - return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None) - - return MaskFormerSwinBaseModelOutput( - last_hidden_state=hidden_states, - hidden_states=all_hidden_states, - hidden_states_spatial_dimensions=all_input_dimensions, - attentions=all_self_attentions, - ) - - -class MaskFormerSwinModel(nn.Module, ModuleUtilsMixin): - def __init__(self, config, add_pooling_layer=True): - super().__init__() - self.config = config - self.num_layers = len(config.depths) - self.num_features = int(config.embed_dim * 2 ** (self.num_layers - 1)) - - self.embeddings = MaskFormerSwinEmbeddings(config) - self.encoder = MaskFormerSwinEncoder(config, self.embeddings.patch_grid) - - self.layernorm = nn.LayerNorm(self.num_features, eps=config.layer_norm_eps) - self.pooler = nn.AdaptiveAvgPool1d(1) if add_pooling_layer else None - - def get_input_embeddings(self): - return self.embeddings.patch_embeddings - - def _prune_heads(self, heads_to_prune): - """ - Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base - class PreTrainedModel - """ - for layer, heads in heads_to_prune.items(): - self.encoder.layer[layer].attention.prune_heads(heads) - - def forward( - self, - pixel_values=None, - head_mask=None, - output_attentions=None, - output_hidden_states=None, - return_dict=None, - ): - output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions - output_hidden_states = ( - output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states - ) - return_dict = return_dict if return_dict is not None else self.config.use_return_dict - - if pixel_values is None: - raise ValueError("You have to specify pixel_values") - - # Prepare head mask if needed - # 1.0 in head_mask indicate we keep the head - # attention_probs has shape bsz x n_heads x N x N - # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] - # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] - head_mask = self.get_head_mask(head_mask, len(self.config.depths)) - - embedding_output, input_dimensions = self.embeddings(pixel_values) - - encoder_outputs = self.encoder( - embedding_output, - input_dimensions, - head_mask=head_mask, - output_attentions=output_attentions, - output_hidden_states=output_hidden_states, - return_dict=return_dict, - ) - - sequence_output = encoder_outputs.last_hidden_state - sequence_output = self.layernorm(sequence_output) - - pooled_output = None - if self.pooler is not None: - pooled_output = self.pooler(sequence_output.transpose(1, 2)) - pooled_output = torch.flatten(pooled_output, 1) - - if not return_dict: - return (sequence_output, pooled_output) + encoder_outputs[1:] - - hidden_states_spatial_dimensions = (input_dimensions,) + encoder_outputs.hidden_states_spatial_dimensions - - return MaskFormerSwinModelOutputWithPooling( - last_hidden_state=sequence_output, - pooler_output=pooled_output, - hidden_states=encoder_outputs.hidden_states, - hidden_states_spatial_dimensions=hidden_states_spatial_dimensions, - attentions=encoder_outputs.attentions, - ) - - # Copied from transformers.models.detr.modeling_detr.DetrAttention class DetrAttention(nn.Module): """ @@ -1924,52 +1151,6 @@ def get_num_masks(self, class_labels: torch.Tensor, device: torch.device) -> tor return num_masks_pt -class MaskFormerSwinTransformerBackbone(nn.Module): - """ - This class uses [`MaskFormerSwinModel`] to reshape its `hidden_states` from (`batch_size, sequence_length, - hidden_size)` to (`batch_size, num_channels, height, width)`). - - Args: - config (`SwinConfig`): - The configuration used by [`MaskFormerSwinModel`]. - """ - - def __init__(self, config: SwinConfig): - super().__init__() - self.model = MaskFormerSwinModel(config) - self.hidden_states_norms = nn.ModuleList([nn.LayerNorm(out_shape) for out_shape in self.outputs_shapes]) - - def forward(self, *args, **kwargs) -> List[Tensor]: - output = self.model(*args, **kwargs, output_hidden_states=True) - hidden_states_permuted: List[Tensor] = [] - # we need to reshape the hidden state to their original spatial dimensions - # skipping the embeddings - hidden_states: Tuple[Tuple[Tensor]] = output.hidden_states[1:] - # spatial dimensions contains all the heights and widths of each stage, including after the embeddings - spatial_dimensions: Tuple[Tuple[int, int]] = output.hidden_states_spatial_dimensions - for i, (hidden_state, (height, width)) in enumerate(zip(hidden_states, spatial_dimensions)): - norm = self.hidden_states_norms[i] - # the last element corespond to the layer's last block output but before patch merging - hidden_state_unpolled = hidden_state[-1] - hidden_state_norm = norm(hidden_state_unpolled) - # our pixel decoder (FPN) expect 3D tensors (features) - batch_size, _, hidden_size = hidden_state_norm.shape - # reshape our tensor "b (h w) d -> b d h w" - hidden_state_permuted = ( - hidden_state_norm.permute(0, 2, 1).view((batch_size, hidden_size, height, width)).contiguous() - ) - hidden_states_permuted.append(hidden_state_permuted) - return hidden_states_permuted - - @property - def input_resolutions(self) -> List[int]: - return [layer.input_resolution for layer in self.model.encoder.layers] - - @property - def outputs_shapes(self) -> List[int]: - return [layer.dim for layer in self.model.encoder.layers] - - class MaskFormerFPNConvLayer(nn.Module): def __init__(self, in_features: int, out_features: int, kernel_size: int = 3, padding: int = 1): """ @@ -2066,7 +1247,7 @@ class MaskFormerPixelDecoder(nn.Module): def __init__(self, *args, feature_size: int = 256, mask_feature_size: int = 256, **kwargs): """ Pixel Decoder Module proposed in [Per-Pixel Classification is Not All You Need for Semantic - Segmentation](https://arxiv.org/abs/2107.06278). It first runs the backbone's feature into a Feature Pyramid + Segmentation](https://arxiv.org/abs/2107.06278). It first runs the backbone's features into a Feature Pyramid Network creating a list of feature maps. Then, it projects the last one to the correct `mask_size`. Args: @@ -2080,7 +1261,7 @@ def __init__(self, *args, feature_size: int = 256, mask_feature_size: int = 256, self.mask_projection = nn.Conv2d(feature_size, mask_feature_size, kernel_size=3, padding=1) def forward(self, features: List[Tensor], output_hidden_states: bool = False) -> MaskFormerPixelDecoderOutput: - fpn_features: List[Tensor] = self.fpn(features) + fpn_features = self.fpn(features) # we use the last feature map last_feature_projected = self.mask_projection(fpn_features[-1]) return MaskFormerPixelDecoderOutput( @@ -2194,17 +1375,28 @@ def __init__(self, config: MaskFormerConfig): The configuration used to instantiate this model. """ super().__init__() - self.encoder = MaskFormerSwinTransformerBackbone(config.backbone_config) + # TODD: add method to load pretrained weights of backbone + backbone_config = config.backbone_config + if isinstance(backbone_config, SwinConfig): + # for backwards compatibility + backbone_config.out_features = ["stage1", "stage2", "stage3", "stage4"] + self.encoder = AutoBackbone.from_config(config.backbone_config) + + input_shape = self.encoder.output_shape() + feature_channels = [v.channels for k, v in input_shape.items()] + self.decoder = MaskFormerPixelDecoder( - in_features=self.encoder.outputs_shapes[-1], + in_features=feature_channels[-1], feature_size=config.fpn_feature_size, mask_feature_size=config.mask_feature_size, - lateral_widths=self.encoder.outputs_shapes[:-1], + lateral_widths=feature_channels[:-1], ) def forward(self, pixel_values: Tensor, output_hidden_states: bool = False) -> MaskFormerPixelLevelModuleOutput: - features: List[Tensor] = self.encoder(pixel_values) - decoder_output: MaskFormerPixelDecoderOutput = self.decoder(features, output_hidden_states) + features = self.encoder(pixel_values) + features = list(features.values()) + decoder_output = self.decoder(features, output_hidden_states) + return MaskFormerPixelLevelModuleOutput( # the last feature is actually the output from the last layer encoder_last_hidden_state=features[-1], @@ -2350,8 +1542,9 @@ class MaskFormerModel(MaskFormerPreTrainedModel): def __init__(self, config: MaskFormerConfig): super().__init__(config) self.pixel_level_module = MaskFormerPixelLevelModule(config) + last_stage_name = self.pixel_level_module.encoder.stage_names[-1] self.transformer_module = MaskFormerTransformerModule( - in_features=self.pixel_level_module.encoder.outputs_shapes[-1], config=config + in_features=self.pixel_level_module.encoder.output_shape()[last_stage_name].channels, config=config ) self.post_init() @@ -2387,9 +1580,7 @@ def forward( if pixel_mask is None: pixel_mask = torch.ones((batch_size, height, width), device=pixel_values.device) - pixel_level_module_output: MaskFormerPixelLevelModuleOutput = self.pixel_level_module( - pixel_values, output_hidden_states - ) + pixel_level_module_output = self.pixel_level_module(pixel_values, output_hidden_states) image_features = pixel_level_module_output.encoder_last_hidden_state pixel_embeddings = pixel_level_module_output.decoder_last_hidden_state diff --git a/src/transformers/models/maskformer/modeling_maskformer_swin.py b/src/transformers/models/maskformer/modeling_maskformer_swin.py new file mode 100644 index 000000000000..2ab03a9b71f7 --- /dev/null +++ b/src/transformers/models/maskformer/modeling_maskformer_swin.py @@ -0,0 +1,908 @@ +# coding=utf-8 +# Copyright 2022 Meta Platforms, Inc. and The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. + +"""MaskFormer Swin backbone. The reason Swin is implemented here is because MaskFormer uses the hidden states before +downsampling.""" + +import collections.abc +import math +from dataclasses import dataclass +from typing import List, Optional, Tuple + +import torch +from torch import Tensor, nn + +from ...activations import ACT2FN +from ...backbone import Backbone, ShapeSpec +from ...file_utils import ModelOutput +from ...modeling_utils import PreTrainedModel +from ...pytorch_utils import find_pruneable_heads_and_indices, prune_linear_layer +from .configuration_maskformer_swin import MaskFormerSwinConfig + + +@dataclass +class MaskFormerSwinModelOutputWithPooling(ModelOutput): + """ + Class for MaskFormerSwinModel's outputs that also contains the spatial dimensions of the hidden states. + + Args: + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + pooler_output (`torch.FloatTensor` of shape `(batch_size, hidden_size)`): + Last layer hidden-state after a mean pooling operation. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of + shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the initial embedding outputs. + hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*): + A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to + `batch, channels, height, width`. Due to padding, their spatial size cannot be inferred before the + `forward` method. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + last_hidden_state: torch.FloatTensor = None + pooler_output: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + hidden_states_spatial_dimensions: Tuple[Tuple[int, int]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +@dataclass +class MaskFormerSwinBaseModelOutput(ModelOutput): + """ + Class for SwinEncoder's outputs. + + Args: + last_hidden_state (`torch.FloatTensor` of shape `(batch_size, sequence_length, hidden_size)`): + Sequence of hidden-states at the output of the last layer of the model. + hidden_states (`tuple(torch.FloatTensor)`, *optional*, returned when `output_hidden_states=True` is passed or when `config.output_hidden_states=True`): + Tuple of `torch.FloatTensor` (one for the output of the embeddings + one for the output of each layer) of + shape `(batch_size, sequence_length, hidden_size)`. + + Hidden-states of the model at the output of each layer plus the initial embedding outputs. + hidden_states_spatial_dimensions (`tuple(tuple(int, int))`, *optional*): + A tuple containing the spatial dimension of each `hidden_state` needed to reshape the `hidden_states` to + `batch, channels, height, width`. Due to padding, their spatial size cannot inferred before the `forward` + method. + attentions (`tuple(torch.FloatTensor)`, *optional*, returned when `output_attentions=True` is passed or when `config.output_attentions=True`): + Tuple of `torch.FloatTensor` (one for each layer) of shape `(batch_size, num_heads, sequence_length, + sequence_length)`. + + Attentions weights after the attention softmax, used to compute the weighted average in the self-attention + heads. + """ + + last_hidden_state: torch.FloatTensor = None + hidden_states: Optional[Tuple[torch.FloatTensor]] = None + hidden_states_spatial_dimensions: Tuple[Tuple[int, int]] = None + attentions: Optional[Tuple[torch.FloatTensor]] = None + + +# Copied from transformers.models.swin.modeling_swin.window_partition +def window_partition(input_feature, window_size): + """ + Partitions the given input into windows. + """ + batch_size, height, width, num_channels = input_feature.shape + input_feature = input_feature.view( + batch_size, height // window_size, window_size, width // window_size, window_size, num_channels + ) + windows = input_feature.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, window_size, window_size, num_channels) + return windows + + +# Copied from transformers.models.swin.modeling_swin.window_reverse +def window_reverse(windows, window_size, height, width): + """ + Merges windows to produce higher resolution features. + """ + num_channels = windows.shape[-1] + windows = windows.view(-1, height // window_size, width // window_size, window_size, window_size, num_channels) + windows = windows.permute(0, 1, 3, 2, 4, 5).contiguous().view(-1, height, width, num_channels) + return windows + + +# Copied from transformers.models.swin.modeling_swin.drop_path +def drop_path(input, drop_prob=0.0, training=False, scale_by_keep=True): + """ + Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks). + + Comment by Ross Wightman: This is the same as the DropConnect impl I created for EfficientNet, etc networks, + however, the original name is misleading as 'Drop Connect' is a different form of dropout in a separate paper... + See discussion: https://github.com/tensorflow/tpu/issues/494#issuecomment-532968956 ... I've opted for changing the + layer and argument names to 'drop path' rather than mix DropConnect as a layer name and use 'survival rate' as the + argument. + """ + if drop_prob == 0.0 or not training: + return input + keep_prob = 1 - drop_prob + shape = (input.shape[0],) + (1,) * (input.ndim - 1) # work with diff dim tensors, not just 2D ConvNets + random_tensor = keep_prob + torch.rand(shape, dtype=input.dtype, device=input.device) + random_tensor.floor_() # binarize + output = input.div(keep_prob) * random_tensor + return output + + +class MaskFormerSwinEmbeddings(nn.Module): + """ + Construct the patch and position embeddings. + """ + + def __init__(self, config): + super().__init__() + + self.patch_embeddings = MaskFormerSwinPatchEmbeddings(config) + num_patches = self.patch_embeddings.num_patches + self.patch_grid = self.patch_embeddings.grid_size + + if config.use_absolute_embeddings: + self.position_embeddings = nn.Parameter(torch.zeros(1, num_patches + 1, config.embed_dim)) + else: + self.position_embeddings = None + + self.norm = nn.LayerNorm(config.embed_dim) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, pixel_values): + embeddings, output_dimensions = self.patch_embeddings(pixel_values) + embeddings = self.norm(embeddings) + + if self.position_embeddings is not None: + embeddings = embeddings + self.position_embeddings + + embeddings = self.dropout(embeddings) + + return embeddings, output_dimensions + + +# Copied from transformers.models.swin.modeling_swin.SwinPatchEmbeddings +class MaskFormerSwinPatchEmbeddings(nn.Module): + """ + This class turns `pixel_values` of shape `(batch_size, num_channels, height, width)` into the initial + `hidden_states` (patch embeddings) of shape `(batch_size, seq_length, hidden_size)` to be consumed by a + Transformer. + """ + + def __init__(self, config): + super().__init__() + image_size, patch_size = config.image_size, config.patch_size + num_channels, hidden_size = config.num_channels, config.embed_dim + image_size = image_size if isinstance(image_size, collections.abc.Iterable) else (image_size, image_size) + patch_size = patch_size if isinstance(patch_size, collections.abc.Iterable) else (patch_size, patch_size) + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + self.image_size = image_size + self.patch_size = patch_size + self.num_channels = num_channels + self.num_patches = num_patches + self.grid_size = (image_size[0] // patch_size[0], image_size[1] // patch_size[1]) + + self.projection = nn.Conv2d(num_channels, hidden_size, kernel_size=patch_size, stride=patch_size) + + def maybe_pad(self, pixel_values, height, width): + if width % self.patch_size[1] != 0: + pad_values = (0, self.patch_size[1] - width % self.patch_size[1]) + pixel_values = nn.functional.pad(pixel_values, pad_values) + if height % self.patch_size[0] != 0: + pad_values = (0, 0, 0, self.patch_size[0] - height % self.patch_size[0]) + pixel_values = nn.functional.pad(pixel_values, pad_values) + return pixel_values + + def forward(self, pixel_values: Optional[torch.FloatTensor]) -> Tuple[torch.Tensor, Tuple[int]]: + _, num_channels, height, width = pixel_values.shape + if num_channels != self.num_channels: + raise ValueError( + "Make sure that the channel dimension of the pixel values match with the one set in the configuration." + ) + # pad the input to be divisible by self.patch_size, if needed + pixel_values = self.maybe_pad(pixel_values, height, width) + embeddings = self.projection(pixel_values) + _, _, height, width = embeddings.shape + output_dimensions = (height, width) + embeddings = embeddings.flatten(2).transpose(1, 2) + + return embeddings, output_dimensions + + +# Copied from transformers.models.swin.modeling_swin.SwinPatchMerging +class MaskFormerSwinPatchMerging(nn.Module): + """ + Patch Merging Layer. + + Args: + input_resolution (`Tuple[int]`): + Resolution of input feature. + dim (`int`): + Number of input channels. + norm_layer (`nn.Module`, *optional*, defaults to `nn.LayerNorm`): + Normalization layer class. + """ + + def __init__(self, input_resolution: Tuple[int], dim: int, norm_layer: nn.Module = nn.LayerNorm) -> None: + super().__init__() + self.input_resolution = input_resolution + self.dim = dim + self.reduction = nn.Linear(4 * dim, 2 * dim, bias=False) + self.norm = norm_layer(4 * dim) + + def maybe_pad(self, input_feature, height, width): + should_pad = (height % 2 == 1) or (width % 2 == 1) + if should_pad: + pad_values = (0, 0, 0, width % 2, 0, height % 2) + input_feature = nn.functional.pad(input_feature, pad_values) + + return input_feature + + def forward(self, input_feature: torch.Tensor, input_dimensions: Tuple[int, int]) -> torch.Tensor: + height, width = input_dimensions + # `dim` is height * width + batch_size, dim, num_channels = input_feature.shape + + input_feature = input_feature.view(batch_size, height, width, num_channels) + # pad input to be disible by width and height, if needed + input_feature = self.maybe_pad(input_feature, height, width) + # [batch_size, height/2, width/2, num_channels] + input_feature_0 = input_feature[:, 0::2, 0::2, :] + # [batch_size, height/2, width/2, num_channels] + input_feature_1 = input_feature[:, 1::2, 0::2, :] + # [batch_size, height/2, width/2, num_channels] + input_feature_2 = input_feature[:, 0::2, 1::2, :] + # [batch_size, height/2, width/2, num_channels] + input_feature_3 = input_feature[:, 1::2, 1::2, :] + # batch_size height/2 width/2 4*num_channels + input_feature = torch.cat([input_feature_0, input_feature_1, input_feature_2, input_feature_3], -1) + input_feature = input_feature.view(batch_size, -1, 4 * num_channels) # batch_size height/2*width/2 4*C + + input_feature = self.norm(input_feature) + input_feature = self.reduction(input_feature) + + return input_feature + + +# Copied from transformers.models.swin.modeling_swin.SwinDropPath with Swin->MaskFormerSwin +class MaskFormerSwinDropPath(nn.Module): + """Drop paths (Stochastic Depth) per sample (when applied in main path of residual blocks).""" + + def __init__(self, drop_prob: Optional[float] = None) -> None: + super().__init__() + self.drop_prob = drop_prob + + def forward(self, x: torch.Tensor) -> torch.Tensor: + return drop_path(x, self.drop_prob, self.training) + + def extra_repr(self) -> str: + return "p={}".format(self.drop_prob) + + +# Copied from transformers.models.swin.modeling_swin.SwinSelfAttention with Swin->MaskFormerSwin +class MaskFormerSwinSelfAttention(nn.Module): + def __init__(self, config, dim, num_heads, window_size): + super().__init__() + if dim % num_heads != 0: + raise ValueError( + f"The hidden size ({dim}) is not a multiple of the number of attention heads ({num_heads})" + ) + + self.num_attention_heads = num_heads + self.attention_head_size = int(dim / num_heads) + self.all_head_size = self.num_attention_heads * self.attention_head_size + self.window_size = ( + window_size if isinstance(window_size, collections.abc.Iterable) else (window_size, window_size) + ) + + self.relative_position_bias_table = nn.Parameter( + torch.zeros((2 * self.window_size[0] - 1) * (2 * self.window_size[1] - 1), num_heads) + ) + + # get pair-wise relative position index for each token inside the window + coords_h = torch.arange(self.window_size[0]) + coords_w = torch.arange(self.window_size[1]) + coords = torch.stack(torch.meshgrid([coords_h, coords_w])) + coords_flatten = torch.flatten(coords, 1) + relative_coords = coords_flatten[:, :, None] - coords_flatten[:, None, :] + relative_coords = relative_coords.permute(1, 2, 0).contiguous() + relative_coords[:, :, 0] += self.window_size[0] - 1 + relative_coords[:, :, 1] += self.window_size[1] - 1 + relative_coords[:, :, 0] *= 2 * self.window_size[1] - 1 + relative_position_index = relative_coords.sum(-1) + self.register_buffer("relative_position_index", relative_position_index) + + self.query = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias) + self.key = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias) + self.value = nn.Linear(self.all_head_size, self.all_head_size, bias=config.qkv_bias) + + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + + def transpose_for_scores(self, x): + new_x_shape = x.size()[:-1] + (self.num_attention_heads, self.attention_head_size) + x = x.view(new_x_shape) + return x.permute(0, 2, 1, 3) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + batch_size, dim, num_channels = hidden_states.shape + mixed_query_layer = self.query(hidden_states) + + key_layer = self.transpose_for_scores(self.key(hidden_states)) + value_layer = self.transpose_for_scores(self.value(hidden_states)) + query_layer = self.transpose_for_scores(mixed_query_layer) + + # Take the dot product between "query" and "key" to get the raw attention scores. + attention_scores = torch.matmul(query_layer, key_layer.transpose(-1, -2)) + + attention_scores = attention_scores / math.sqrt(self.attention_head_size) + + relative_position_bias = self.relative_position_bias_table[self.relative_position_index.view(-1)] + relative_position_bias = relative_position_bias.view( + self.window_size[0] * self.window_size[1], self.window_size[0] * self.window_size[1], -1 + ) + + relative_position_bias = relative_position_bias.permute(2, 0, 1).contiguous() + attention_scores = attention_scores + relative_position_bias.unsqueeze(0) + + if attention_mask is not None: + # Apply the attention mask is (precomputed for all layers in MaskFormerSwinModel forward() function) + mask_shape = attention_mask.shape[0] + attention_scores = attention_scores.view( + batch_size // mask_shape, mask_shape, self.num_attention_heads, dim, dim + ) + attention_scores = attention_scores + attention_mask.unsqueeze(1).unsqueeze(0) + attention_scores = attention_scores.view(-1, self.num_attention_heads, dim, dim) + + # Normalize the attention scores to probabilities. + attention_probs = nn.functional.softmax(attention_scores, dim=-1) + + # This is actually dropping out entire tokens to attend to, which might + # seem a bit unusual, but is taken from the original Transformer paper. + attention_probs = self.dropout(attention_probs) + + # Mask heads if we want to + if head_mask is not None: + attention_probs = attention_probs * head_mask + + context_layer = torch.matmul(attention_probs, value_layer) + context_layer = context_layer.permute(0, 2, 1, 3).contiguous() + new_context_layer_shape = context_layer.size()[:-2] + (self.all_head_size,) + context_layer = context_layer.view(new_context_layer_shape) + + outputs = (context_layer, attention_probs) if output_attentions else (context_layer,) + + return outputs + + +# Copied from transformers.models.swin.modeling_swin.SwinSelfOutput with Swin->MaskFormerSwin +class MaskFormerSwinSelfOutput(nn.Module): + def __init__(self, config, dim): + super().__init__() + self.dense = nn.Linear(dim, dim) + self.dropout = nn.Dropout(config.attention_probs_dropout_prob) + + def forward(self, hidden_states: torch.Tensor, input_tensor: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + + return hidden_states + + +# Copied from transformers.models.swin.modeling_swin.SwinAttention with Swin->MaskFormerSwin +class MaskFormerSwinAttention(nn.Module): + def __init__(self, config, dim, num_heads, window_size): + super().__init__() + self.self = MaskFormerSwinSelfAttention(config, dim, num_heads, window_size) + self.output = MaskFormerSwinSelfOutput(config, dim) + self.pruned_heads = set() + + def prune_heads(self, heads): + if len(heads) == 0: + return + heads, index = find_pruneable_heads_and_indices( + heads, self.self.num_attention_heads, self.self.attention_head_size, self.pruned_heads + ) + + # Prune linear layers + self.self.query = prune_linear_layer(self.self.query, index) + self.self.key = prune_linear_layer(self.self.key, index) + self.self.value = prune_linear_layer(self.self.value, index) + self.output.dense = prune_linear_layer(self.output.dense, index, dim=1) + + # Update hyper params and store pruned heads + self.self.num_attention_heads = self.self.num_attention_heads - len(heads) + self.self.all_head_size = self.self.attention_head_size * self.self.num_attention_heads + self.pruned_heads = self.pruned_heads.union(heads) + + def forward( + self, + hidden_states: torch.Tensor, + attention_mask: Optional[torch.FloatTensor] = None, + head_mask: Optional[torch.FloatTensor] = None, + output_attentions: Optional[bool] = False, + ) -> Tuple[torch.Tensor]: + self_outputs = self.self(hidden_states, attention_mask, head_mask, output_attentions) + attention_output = self.output(self_outputs[0], hidden_states) + outputs = (attention_output,) + self_outputs[1:] # add attentions if we output them + return outputs + + +# Copied from transformers.models.swin.modeling_swin.SwinIntermediate with Swin->MaskFormerSwin +class MaskFormerSwinIntermediate(nn.Module): + def __init__(self, config, dim): + super().__init__() + self.dense = nn.Linear(dim, int(config.mlp_ratio * dim)) + if isinstance(config.hidden_act, str): + self.intermediate_act_fn = ACT2FN[config.hidden_act] + else: + self.intermediate_act_fn = config.hidden_act + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.intermediate_act_fn(hidden_states) + return hidden_states + + +# Copied from transformers.models.swin.modeling_swin.SwinOutput with Swin->MaskFormerSwin +class MaskFormerSwinOutput(nn.Module): + def __init__(self, config, dim): + super().__init__() + self.dense = nn.Linear(int(config.mlp_ratio * dim), dim) + self.dropout = nn.Dropout(config.hidden_dropout_prob) + + def forward(self, hidden_states: torch.Tensor) -> torch.Tensor: + hidden_states = self.dense(hidden_states) + hidden_states = self.dropout(hidden_states) + return hidden_states + + +class MaskFormerSwinLayer(nn.Module): + def __init__(self, config, dim, input_resolution, num_heads, shift_size=0): + super().__init__() + self.chunk_size_feed_forward = config.chunk_size_feed_forward + self.shift_size = shift_size + self.window_size = config.window_size + self.input_resolution = input_resolution + self.layernorm_before = nn.LayerNorm(dim, eps=config.layer_norm_eps) + self.attention = MaskFormerSwinAttention(config, dim, num_heads, self.window_size) + self.drop_path = ( + MaskFormerSwinDropPath(config.drop_path_rate) if config.drop_path_rate > 0.0 else nn.Identity() + ) + self.layernorm_after = nn.LayerNorm(dim, eps=config.layer_norm_eps) + self.intermediate = MaskFormerSwinIntermediate(config, dim) + self.output = MaskFormerSwinOutput(config, dim) + + def get_attn_mask(self, input_resolution): + if self.shift_size > 0: + # calculate attention mask for SW-MSA + height, width = input_resolution + img_mask = torch.zeros((1, height, width, 1)) + height_slices = ( + slice(0, -self.window_size), + slice(-self.window_size, -self.shift_size), + slice(-self.shift_size, None), + ) + width_slices = ( + slice(0, -self.window_size), + slice(-self.window_size, -self.shift_size), + slice(-self.shift_size, None), + ) + count = 0 + for height_slice in height_slices: + for width_slice in width_slices: + img_mask[:, height_slice, width_slice, :] = count + count += 1 + + mask_windows = window_partition(img_mask, self.window_size) + mask_windows = mask_windows.view(-1, self.window_size * self.window_size) + attn_mask = mask_windows.unsqueeze(1) - mask_windows.unsqueeze(2) + attn_mask = attn_mask.masked_fill(attn_mask != 0, float(-100.0)).masked_fill(attn_mask == 0, float(0.0)) + else: + attn_mask = None + return attn_mask + + def maybe_pad(self, hidden_states, height, width): + pad_left = pad_top = 0 + pad_rigth = (self.window_size - width % self.window_size) % self.window_size + pad_bottom = (self.window_size - height % self.window_size) % self.window_size + pad_values = (0, 0, pad_left, pad_rigth, pad_top, pad_bottom) + hidden_states = nn.functional.pad(hidden_states, pad_values) + return hidden_states, pad_values + + def forward(self, hidden_states, input_dimensions, head_mask=None, output_attentions=False): + height, width = input_dimensions + batch_size, dim, channels = hidden_states.size() + shortcut = hidden_states + + hidden_states = self.layernorm_before(hidden_states) + hidden_states = hidden_states.view(batch_size, height, width, channels) + # pad hidden_states to multiples of window size + hidden_states, pad_values = self.maybe_pad(hidden_states, height, width) + + _, height_pad, width_pad, _ = hidden_states.shape + # cyclic shift + if self.shift_size > 0: + shifted_hidden_states = torch.roll(hidden_states, shifts=(-self.shift_size, -self.shift_size), dims=(1, 2)) + else: + shifted_hidden_states = hidden_states + + # partition windows + hidden_states_windows = window_partition(shifted_hidden_states, self.window_size) + hidden_states_windows = hidden_states_windows.view(-1, self.window_size * self.window_size, channels) + attn_mask = self.get_attn_mask((height_pad, width_pad)) + if attn_mask is not None: + attn_mask = attn_mask.to(hidden_states_windows.device) + + self_attention_outputs = self.attention( + hidden_states_windows, attn_mask, head_mask, output_attentions=output_attentions + ) + + attention_output = self_attention_outputs[0] + + outputs = self_attention_outputs[1:] # add self attentions if we output attention weights + + attention_windows = attention_output.view(-1, self.window_size, self.window_size, channels) + shifted_windows = window_reverse( + attention_windows, self.window_size, height_pad, width_pad + ) # B height' width' C + + # reverse cyclic shift + if self.shift_size > 0: + attention_windows = torch.roll(shifted_windows, shifts=(self.shift_size, self.shift_size), dims=(1, 2)) + else: + attention_windows = shifted_windows + + was_padded = pad_values[3] > 0 or pad_values[5] > 0 + if was_padded: + attention_windows = attention_windows[:, :height, :width, :].contiguous() + + attention_windows = attention_windows.view(batch_size, height * width, channels) + + hidden_states = shortcut + self.drop_path(attention_windows) + + layer_output = self.layernorm_after(hidden_states) + layer_output = self.intermediate(layer_output) + layer_output = hidden_states + self.output(layer_output) + + outputs = (layer_output,) + outputs + + return outputs + + +class MaskFormerSwinStage(nn.Module): + # Copied from transformers.models.swin.modeling_swin.SwinStage.__init__ with Swin->MaskFormerSwin + def __init__(self, config, dim, input_resolution, depth, num_heads, drop_path, downsample): + super().__init__() + self.config = config + self.dim = dim + self.blocks = nn.ModuleList( + [ + MaskFormerSwinLayer( + config=config, + dim=dim, + input_resolution=input_resolution, + num_heads=num_heads, + shift_size=0 if (i % 2 == 0) else config.window_size // 2, + ) + for i in range(depth) + ] + ) + + # patch merging layer + if downsample is not None: + self.downsample = downsample(input_resolution, dim=dim, norm_layer=nn.LayerNorm) + else: + self.downsample = None + + self.pointing = False + + def forward( + self, hidden_states, input_dimensions, head_mask=None, output_attentions=False, output_hidden_states=False + ): + all_hidden_states = () if output_hidden_states else None + + height, width = input_dimensions + for i, block_module in enumerate(self.blocks): + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + layer_head_mask = head_mask[i] if head_mask is not None else None + + block_hidden_states = block_module(hidden_states, input_dimensions, layer_head_mask, output_attentions) + + hidden_states = block_hidden_states[0] + + if output_hidden_states: + all_hidden_states += (hidden_states,) + + if self.downsample is not None: + height_downsampled, width_downsampled = (height + 1) // 2, (width + 1) // 2 + output_dimensions = (height, width, height_downsampled, width_downsampled) + hidden_states = self.downsample(hidden_states, input_dimensions) + else: + output_dimensions = (height, width, height, width) + + return hidden_states, output_dimensions, all_hidden_states + + +class MaskFormerSwinEncoder(nn.Module): + # Copied from transformers.models.swin.modeling_swin.SwinEncoder.__init__ with Swin->MaskFormerSwin + def __init__(self, config, grid_size): + super().__init__() + self.num_layers = len(config.depths) + self.config = config + dpr = [x.item() for x in torch.linspace(0, config.drop_path_rate, sum(config.depths))] + self.layers = nn.ModuleList( + [ + MaskFormerSwinStage( + config=config, + dim=int(config.embed_dim * 2**i_layer), + input_resolution=(grid_size[0] // (2**i_layer), grid_size[1] // (2**i_layer)), + depth=config.depths[i_layer], + num_heads=config.num_heads[i_layer], + drop_path=dpr[sum(config.depths[:i_layer]) : sum(config.depths[: i_layer + 1])], + downsample=MaskFormerSwinPatchMerging if (i_layer < self.num_layers - 1) else None, + ) + for i_layer in range(self.num_layers) + ] + ) + + self.gradient_checkpointing = False + + def forward( + self, + hidden_states, + input_dimensions, + head_mask=None, + output_attentions=False, + output_hidden_states=False, + return_dict=True, + ): + all_hidden_states = () if output_hidden_states else None + all_input_dimensions = () + all_self_attentions = () if output_attentions else None + + if output_hidden_states: + all_hidden_states = all_hidden_states + (hidden_states,) + + for i, layer_module in enumerate(self.layers): + layer_head_mask = head_mask[i] if head_mask is not None else None + + if self.gradient_checkpointing and self.training: + + def create_custom_forward(module): + def custom_forward(*inputs): + return module(*inputs, output_attentions) + + return custom_forward + + layer_hidden_states, output_dimensions, layer_all_hidden_states = torch.utils.checkpoint.checkpoint( + create_custom_forward(layer_module), hidden_states, layer_head_mask + ) + else: + layer_hidden_states, output_dimensions, layer_all_hidden_states = layer_module( + hidden_states, + input_dimensions, + layer_head_mask, + output_attentions, + output_hidden_states, + ) + + input_dimensions = (output_dimensions[-2], output_dimensions[-1]) + all_input_dimensions += (input_dimensions,) + if output_hidden_states: + all_hidden_states += (layer_all_hidden_states,) + + hidden_states = layer_hidden_states + + if output_attentions: + all_self_attentions = all_self_attentions + (layer_all_hidden_states[1],) + + if not return_dict: + return tuple(v for v in [hidden_states, all_hidden_states, all_self_attentions] if v is not None) + + return MaskFormerSwinBaseModelOutput( + last_hidden_state=hidden_states, + hidden_states=all_hidden_states, + hidden_states_spatial_dimensions=all_input_dimensions, + attentions=all_self_attentions, + ) + + +# Copied from transformers.models.swin.modeling_swin.SwinPreTrainedModel with Swin->MaskFormerSwin, swin->model +class MaskFormerSwinPreTrainedModel(PreTrainedModel): + """ + An abstract class to handle weights initialization and a simple interface for downloading and loading pretrained + models. + """ + + config_class = MaskFormerSwinConfig + base_model_prefix = "model" + main_input_name = "pixel_values" + supports_gradient_checkpointing = True + + def _init_weights(self, module): + """Initialize the weights""" + if isinstance(module, (nn.Linear, nn.Conv2d)): + # Slightly different from the TF version which uses truncated_normal for initialization + # cf https://github.com/pytorch/pytorch/pull/5617 + module.weight.data.normal_(mean=0.0, std=self.config.initializer_range) + if module.bias is not None: + module.bias.data.zero_() + elif isinstance(module, nn.LayerNorm): + module.bias.data.zero_() + module.weight.data.fill_(1.0) + + def _set_gradient_checkpointing(self, module, value=False): + if isinstance(module, MaskFormerSwinEncoder): + module.gradient_checkpointing = value + + +class MaskFormerSwinModel(MaskFormerSwinPreTrainedModel): + def __init__(self, config, add_pooling_layer=True): + super().__init__(config) + self.config = config + self.num_layers = len(config.depths) + self.num_features = int(config.embed_dim * 2 ** (self.num_layers - 1)) + + self.embeddings = MaskFormerSwinEmbeddings(config) + self.encoder = MaskFormerSwinEncoder(config, self.embeddings.patch_grid) + + self.layernorm = nn.LayerNorm(self.num_features, eps=config.layer_norm_eps) + self.pooler = nn.AdaptiveAvgPool1d(1) if add_pooling_layer else None + + def get_input_embeddings(self): + return self.embeddings.patch_embeddings + + def _prune_heads(self, heads_to_prune): + """ + Prunes heads of the model. heads_to_prune: dict of {layer_num: list of heads to prune in this layer} See base + class PreTrainedModel + """ + for layer, heads in heads_to_prune.items(): + self.encoder.layer[layer].attention.prune_heads(heads) + + def forward( + self, + pixel_values=None, + head_mask=None, + output_attentions=None, + output_hidden_states=None, + return_dict=None, + ): + output_attentions = output_attentions if output_attentions is not None else self.config.output_attentions + output_hidden_states = ( + output_hidden_states if output_hidden_states is not None else self.config.output_hidden_states + ) + return_dict = return_dict if return_dict is not None else self.config.use_return_dict + + if pixel_values is None: + raise ValueError("You have to specify pixel_values") + + # Prepare head mask if needed + # 1.0 in head_mask indicate we keep the head + # attention_probs has shape bsz x n_heads x N x N + # input head_mask has shape [num_heads] or [num_hidden_layers x num_heads] + # and head_mask is converted to shape [num_hidden_layers x batch x num_heads x seq_length x seq_length] + head_mask = self.get_head_mask(head_mask, len(self.config.depths)) + + embedding_output, input_dimensions = self.embeddings(pixel_values) + + encoder_outputs = self.encoder( + embedding_output, + input_dimensions, + head_mask=head_mask, + output_attentions=output_attentions, + output_hidden_states=output_hidden_states, + return_dict=return_dict, + ) + + sequence_output = encoder_outputs.last_hidden_state if return_dict else encoder_outputs[0] + sequence_output = self.layernorm(sequence_output) + + pooled_output = None + if self.pooler is not None: + pooled_output = self.pooler(sequence_output.transpose(1, 2)) + pooled_output = torch.flatten(pooled_output, 1) + + if not return_dict: + return (sequence_output, pooled_output) + encoder_outputs[1:] + + hidden_states_spatial_dimensions = (input_dimensions,) + encoder_outputs.hidden_states_spatial_dimensions + + return MaskFormerSwinModelOutputWithPooling( + last_hidden_state=sequence_output, + pooler_output=pooled_output, + hidden_states=encoder_outputs.hidden_states, + hidden_states_spatial_dimensions=hidden_states_spatial_dimensions, + attentions=encoder_outputs.attentions, + ) + + +class MaskFormerSwinBackbone(MaskFormerSwinPreTrainedModel, Backbone): + """ + This class converts [`MaskFormerSwinModel`] into a generic backbone to be consumed by frameworks like DETR and + MaskFormer. + + This classes reshapes `hidden_states` from (`batch_size, sequence_length, hidden_size)` to (`batch_size, + num_channels, height, width)`). It also adds additional layernorms after each stage. + + Args: + config (`MaskFormerSwinConfig`): + The configuration used by [`MaskFormerSwinModel`]. + """ + + def __init__(self, config: MaskFormerSwinConfig): + super().__init__(config) + + self.model = MaskFormerSwinModel(config) + + self.stage_names = [f"stage{i+1}" for i in range(len(config.depths))] + + self.out_features = config.out_features + + self.out_feature_strides = {f"stage{i+1}": 2 ** (i + 2) for i in range(len(config.depths))} + num_features = [int(config.embed_dim * 2**i) for i in range(len(config.depths))] + self.num_features = num_features + self.out_feature_channels = { + "stage1": self.num_features[0], + "stage2": self.num_features[1], + "stage3": self.num_features[2], + "stage4": self.num_features[3], + } + + self.hidden_states_norms = nn.ModuleList([nn.LayerNorm(x.channels) for x in self.output_shape().values()]) + + def forward(self, *args, **kwargs) -> List[Tensor]: + output = self.model(*args, **kwargs, output_hidden_states=True) + + hidden_states = output.hidden_states + + outputs = {} + + # we need to reshape the hidden state to their original spatial dimensions + # we skip the embeddings + hidden_states: Tuple[Tuple[Tensor]] = output.hidden_states[1:] + # spatial dimensions contains all the heights and widths of each stage, including after the embeddings + spatial_dimensions: Tuple[Tuple[int, int]] = output.hidden_states_spatial_dimensions + for i, (hidden_state, (height, width)) in enumerate(zip(hidden_states, spatial_dimensions)): + norm = self.hidden_states_norms[i] + # the last element corespond to the layer's last block output but before patch merging + hidden_state_unpolled = hidden_state[-1] + hidden_state_norm = norm(hidden_state_unpolled) + # the pixel decoder (FPN) expects 3D tensors (features) + batch_size, _, hidden_size = hidden_state_norm.shape + # reshape "b (h w) d -> b d h w" + hidden_state_permuted = ( + hidden_state_norm.permute(0, 2, 1).view((batch_size, hidden_size, height, width)).contiguous() + ) + outputs[f"stage{i+1}"] = hidden_state_permuted + + return outputs + + def output_shape(self): + """A dictionary that maps feature map names to their shapes (number of channels + stride).""" + + return { + name: ShapeSpec(channels=self.out_feature_channels[name], stride=self.out_feature_strides[name]) + for name in self.out_features + } diff --git a/src/transformers/models/maskformer/test.py b/src/transformers/models/maskformer/test.py new file mode 100644 index 000000000000..0e7ad9180583 --- /dev/null +++ b/src/transformers/models/maskformer/test.py @@ -0,0 +1,20 @@ +from transformers import MaskFormerConfig, MaskFormerForInstanceSegmentation, ResNetConfig, SwinConfig + + +# Instantiating from a config will always randomly initialize all the weights + +# option 1: use default config +config = MaskFormerConfig() +model = MaskFormerForInstanceSegmentation(config) + +# option 2: use custom Swin backbone +backbone_config = SwinConfig.from_pretrained("microsoft/swin-base-patch4-window12-384-in22k") +config = MaskFormerConfig(backbone_config=backbone_config) +model = MaskFormerForInstanceSegmentation(config) + +# option 3: use custom ResNet backbone +backbone_config = ResNetConfig(depths=[3, 2, 2, 3]) +config = MaskFormerConfig(backbone_config=backbone_config) +model = MaskFormerForInstanceSegmentation(config) + +# Initializing using from_pretrained will load the weights from the pretrained model diff --git a/src/transformers/models/resnet/__init__.py b/src/transformers/models/resnet/__init__.py index f62c2999671d..be110e9c50ab 100644 --- a/src/transformers/models/resnet/__init__.py +++ b/src/transformers/models/resnet/__init__.py @@ -36,6 +36,7 @@ "ResNetForImageClassification", "ResNetModel", "ResNetPreTrainedModel", + "ResNetBackbone", ] try: @@ -63,6 +64,7 @@ else: from .modeling_resnet import ( RESNET_PRETRAINED_MODEL_ARCHIVE_LIST, + ResNetBackbone, ResNetForImageClassification, ResNetModel, ResNetPreTrainedModel, diff --git a/src/transformers/models/resnet/configuration_resnet.py b/src/transformers/models/resnet/configuration_resnet.py index 8a863e6ae278..1137e17a728b 100644 --- a/src/transformers/models/resnet/configuration_resnet.py +++ b/src/transformers/models/resnet/configuration_resnet.py @@ -58,6 +58,8 @@ class ResNetConfig(PretrainedConfig): are supported. downsample_in_first_stage (`bool`, *optional*, defaults to `False`): If `True`, the first stage will downsample the inputs using a `stride` of 2. + out_features (`List[str]`, *optional*) + If used as a backbone, list of feature names to output, e.g. ["stem", "stage1"]. Example: ```python @@ -85,6 +87,7 @@ def __init__( layer_type="bottleneck", hidden_act="relu", downsample_in_first_stage=False, + out_features=None, **kwargs ): super().__init__(**kwargs) @@ -97,6 +100,7 @@ def __init__( self.layer_type = layer_type self.hidden_act = hidden_act self.downsample_in_first_stage = downsample_in_first_stage + self.out_features = out_features class ResNetOnnxConfig(OnnxConfig): diff --git a/src/transformers/models/resnet/modeling_resnet.py b/src/transformers/models/resnet/modeling_resnet.py index b771aa3e3125..059b38aad358 100644 --- a/src/transformers/models/resnet/modeling_resnet.py +++ b/src/transformers/models/resnet/modeling_resnet.py @@ -14,7 +14,7 @@ # limitations under the License. """ PyTorch ResNet model.""" -from typing import Optional +from typing import List, Optional import torch import torch.utils.checkpoint @@ -22,6 +22,7 @@ from torch.nn import BCEWithLogitsLoss, CrossEntropyLoss, MSELoss from ...activations import ACT2FN +from ...backbone import Backbone, ShapeSpec from ...modeling_outputs import ( BaseModelOutputWithNoAttention, BaseModelOutputWithPoolingAndNoAttention, @@ -416,3 +417,70 @@ def forward( return (loss,) + output if loss is not None else output return ImageClassifierOutputWithNoAttention(loss=loss, logits=logits, hidden_states=outputs.hidden_states) + + +class ResNetBackbone(ResNetPreTrainedModel, Backbone): + """ + This class converts [`ResNetModel`] into a generic backbone to be consumed by frameworks like DETR and MaskFormer. + + Args: + config (`ResNetConfig`): + The configuration used by [`ResNetModel`]. + """ + + def __init__(self, config: ResNetConfig): + super().__init__(config) + + self.resnet = ResNetModel(config) + + # note that this assumes there are always 4 stages + self.stage_names = ["stage1", "stage2", "stage3", "stage4"] + self.out_features = config.out_features + + # TODO calculate strides appropriately + current_stride = self.resnet.embedder.embedder.convolution.stride[0] + self.out_feature_strides = { + "stem": current_stride, + "stage1": current_stride * 2, + "stage2": current_stride * 2, + "stage3": current_stride * 2, + "stage4": current_stride * 2, + } + # for idx, feat in enumerate(self.out_features): + # # current_stride *= self.model.encoder.stages[idx].layers[-1].layer[-1].convolution.stride[0] + # for layer_seq in self.model.encoder.stages[idx].layers: + # current_stride *= layer_seq.layer.convolution.stride[0] + # # current_stride *= layer.layer[-1].convolution.stride[0] + # # current_stride = int(current_stride * np.prod([layer.layer.convolution.stride for layer in self.model.encoder.stages[idx].layers])) + # print("Current stride:", current_stride) + # self.out_feature_strides[feat] = current_stride + + self.out_feature_channels = { + "stem": config.embedding_size, + "stage1": config.hidden_sizes[0], + "stage2": config.hidden_sizes[1], + "stage3": config.hidden_sizes[2], + "stage4": config.hidden_sizes[3], + } + + def forward(self, *args, **kwargs) -> List[Tensor]: + output = self.resnet(*args, **kwargs, output_hidden_states=True) + + hidden_states = output.hidden_states + + outputs = {} + if "stem" in self.out_features: + outputs["stem"] = hidden_states[0] + + for idx, stage in enumerate(self.stage_names): + if stage in self.out_features: + outputs[stage] = hidden_states[idx + 1] + + return outputs + + def output_shape(self): + output_shape = { + name: ShapeSpec(channels=self.out_feature_channels[name], stride=self.out_feature_strides[name]) + for name in self.out_features + } + return output_shape diff --git a/src/transformers/models/resnet/test.py b/src/transformers/models/resnet/test.py new file mode 100644 index 000000000000..f50399ad73f6 --- /dev/null +++ b/src/transformers/models/resnet/test.py @@ -0,0 +1,13 @@ +import torch + +from transformers import ResNetBackbone, ResNetConfig + + +model = ResNetBackbone(ResNetConfig(out_features=["stem", "stage1", "stage2", "stage3", "stage4"])) + +pixel_values = torch.rand(1, 3, 224, 224) + +outputs = model(pixel_values) + +for key, value in outputs.items(): + print(key, value.shape) diff --git a/src/transformers/models/swin/configuration_swin.py b/src/transformers/models/swin/configuration_swin.py index b169e49a5738..a11704dd9515 100644 --- a/src/transformers/models/swin/configuration_swin.py +++ b/src/transformers/models/swin/configuration_swin.py @@ -125,6 +125,7 @@ def __init__( initializer_range=0.02, layer_norm_eps=1e-5, encoder_stride=32, + out_features=None, **kwargs ): super().__init__(**kwargs) @@ -151,6 +152,7 @@ def __init__( # we set the hidden_size attribute in order to make Swin work with VisionEncoderDecoderModel # this indicates the channel dimension after the last stage of the model self.hidden_size = int(embed_dim * 2 ** (len(depths) - 1)) + self.out_features = out_features class SwinOnnxConfig(OnnxConfig): diff --git a/src/transformers/utils/dummy_pt_objects.py b/src/transformers/utils/dummy_pt_objects.py index 08db45a62ab0..b504fb759015 100644 --- a/src/transformers/utils/dummy_pt_objects.py +++ b/src/transformers/utils/dummy_pt_objects.py @@ -430,6 +430,13 @@ def load_tf_weights_in_albert(*args, **kwargs): MODEL_WITH_LM_HEAD_MAPPING = None +class AutoBackbone(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + class AutoModel(metaclass=DummyObject): _backends = ["torch"] @@ -3251,6 +3258,27 @@ def __init__(self, *args, **kwargs): requires_backends(self, ["torch"]) +class MaskFormerSwinBackbone(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + +class MaskFormerSwinModel(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + +class MaskFormerSwinPreTrainedModel(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + class MBartForCausalLM(metaclass=DummyObject): _backends = ["torch"] @@ -4436,6 +4464,13 @@ def load_tf_weights_in_rembert(*args, **kwargs): RESNET_PRETRAINED_MODEL_ARCHIVE_LIST = None +class ResNetBackbone(metaclass=DummyObject): + _backends = ["torch"] + + def __init__(self, *args, **kwargs): + requires_backends(self, ["torch"]) + + class ResNetForImageClassification(metaclass=DummyObject): _backends = ["torch"] diff --git a/tests/models/maskformer/test_modeling_maskformer_swin.py b/tests/models/maskformer/test_modeling_maskformer_swin.py new file mode 100644 index 000000000000..d3f70972a43e --- /dev/null +++ b/tests/models/maskformer/test_modeling_maskformer_swin.py @@ -0,0 +1,348 @@ +# coding=utf-8 +# Copyright 2022 The HuggingFace Inc. team. All rights reserved. +# +# Licensed under the Apache License, Version 2.0 (the "License"); +# you may not use this file except in compliance with the License. +# You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, software +# distributed under the License is distributed on an "AS IS" BASIS, +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +# See the License for the specific language governing permissions and +# limitations under the License. +""" Testing suite for the PyTorch MaskFormer Swin model. """ + +import collections +import inspect +import unittest +from typing import Dict, List, Tuple + +from transformers import SwinConfig +from transformers.testing_utils import require_torch, torch_device +from transformers.utils import is_torch_available + +from ...test_configuration_common import ConfigTester +from ...test_modeling_common import ModelTesterMixin, floats_tensor, ids_tensor + + +if is_torch_available(): + import torch + from torch import nn + + from transformers import MaskFormerSwinModel + + +class MaskFormerSwinModelTester: + def __init__( + self, + parent, + batch_size=13, + image_size=32, + patch_size=2, + num_channels=3, + embed_dim=16, + depths=[1, 2, 1], + num_heads=[2, 2, 4], + window_size=2, + mlp_ratio=2.0, + qkv_bias=True, + hidden_dropout_prob=0.0, + attention_probs_dropout_prob=0.0, + drop_path_rate=0.1, + hidden_act="gelu", + use_absolute_embeddings=False, + patch_norm=True, + initializer_range=0.02, + layer_norm_eps=1e-5, + is_training=True, + scope=None, + use_labels=True, + type_sequence_label_size=10, + encoder_stride=8, + ): + self.parent = parent + self.batch_size = batch_size + self.image_size = image_size + self.patch_size = patch_size + self.num_channels = num_channels + self.embed_dim = embed_dim + self.depths = depths + self.num_heads = num_heads + self.window_size = window_size + self.mlp_ratio = mlp_ratio + self.qkv_bias = qkv_bias + self.hidden_dropout_prob = hidden_dropout_prob + self.attention_probs_dropout_prob = attention_probs_dropout_prob + self.drop_path_rate = drop_path_rate + self.hidden_act = hidden_act + self.use_absolute_embeddings = use_absolute_embeddings + self.patch_norm = patch_norm + self.layer_norm_eps = layer_norm_eps + self.initializer_range = initializer_range + self.is_training = is_training + self.scope = scope + self.use_labels = use_labels + self.type_sequence_label_size = type_sequence_label_size + self.encoder_stride = encoder_stride + + def prepare_config_and_inputs(self): + pixel_values = floats_tensor([self.batch_size, self.num_channels, self.image_size, self.image_size]) + + labels = None + if self.use_labels: + labels = ids_tensor([self.batch_size], self.type_sequence_label_size) + + config = self.get_config() + + return config, pixel_values, labels + + def get_config(self): + return SwinConfig( + image_size=self.image_size, + patch_size=self.patch_size, + num_channels=self.num_channels, + embed_dim=self.embed_dim, + depths=self.depths, + num_heads=self.num_heads, + window_size=self.window_size, + mlp_ratio=self.mlp_ratio, + qkv_bias=self.qkv_bias, + hidden_dropout_prob=self.hidden_dropout_prob, + attention_probs_dropout_prob=self.attention_probs_dropout_prob, + drop_path_rate=self.drop_path_rate, + hidden_act=self.hidden_act, + use_absolute_embeddings=self.use_absolute_embeddings, + path_norm=self.patch_norm, + layer_norm_eps=self.layer_norm_eps, + initializer_range=self.initializer_range, + encoder_stride=self.encoder_stride, + ) + + def create_and_check_model(self, config, pixel_values, labels): + model = MaskFormerSwinModel(config=config) + model.to(torch_device) + model.eval() + result = model(pixel_values) + + expected_seq_len = ((config.image_size // config.patch_size) ** 2) // (4 ** (len(config.depths) - 1)) + expected_dim = int(config.embed_dim * 2 ** (len(config.depths) - 1)) + + self.parent.assertEqual(result.last_hidden_state.shape, (self.batch_size, expected_seq_len, expected_dim)) + + def prepare_config_and_inputs_for_common(self): + config_and_inputs = self.prepare_config_and_inputs() + ( + config, + pixel_values, + labels, + ) = config_and_inputs + inputs_dict = {"pixel_values": pixel_values} + return config, inputs_dict + + +@require_torch +class MaskFormerSwinModelTest(ModelTesterMixin, unittest.TestCase): + + all_model_classes = (MaskFormerSwinModel,) if is_torch_available() else () + fx_compatible = False + test_torchscript = False + test_pruning = False + test_resize_embeddings = False + test_head_masking = False + + def setUp(self): + self.model_tester = MaskFormerSwinModelTester(self) + self.config_tester = ConfigTester(self, config_class=SwinConfig, embed_dim=37) + + def test_config(self): + self.create_and_test_config_common_properties() + self.config_tester.create_and_test_config_to_json_string() + self.config_tester.create_and_test_config_to_json_file() + self.config_tester.create_and_test_config_from_and_save_pretrained() + self.config_tester.create_and_test_config_with_num_labels() + self.config_tester.check_config_can_be_init_without_params() + self.config_tester.check_config_arguments_init() + + def create_and_test_config_common_properties(self): + return + + def test_model(self): + config_and_inputs = self.model_tester.prepare_config_and_inputs() + self.model_tester.create_and_check_model(*config_and_inputs) + + def test_inputs_embeds(self): + # Swin does not use inputs_embeds + pass + + def test_model_common_attributes(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + self.assertIsInstance(model.get_input_embeddings(), (nn.Module)) + x = model.get_output_embeddings() + self.assertTrue(x is None or isinstance(x, nn.Linear)) + + def test_forward_signature(self): + config, _ = self.model_tester.prepare_config_and_inputs_for_common() + + for model_class in self.all_model_classes: + model = model_class(config) + signature = inspect.signature(model.forward) + # signature.parameters is an OrderedDict => so arg_names order is deterministic + arg_names = [*signature.parameters.keys()] + + expected_arg_names = ["pixel_values"] + self.assertListEqual(arg_names[:1], expected_arg_names) + + @unittest.skip(reason="MaskFormerSwin is only used as backbone and doesn't support output_attentions") + def test_attention_outputs(self): + pass + + def check_hidden_states_output(self, inputs_dict, config, model_class, image_size): + model = model_class(config) + model.to(torch_device) + model.eval() + + with torch.no_grad(): + outputs = model(**self._prepare_for_class(inputs_dict, model_class)) + + hidden_states = outputs.hidden_states + + expected_num_layers = getattr( + self.model_tester, "expected_num_hidden_layers", len(self.model_tester.depths) + 1 + ) + self.assertEqual(len(hidden_states), expected_num_layers) + + # Swin has a different seq_length + patch_size = ( + config.patch_size + if isinstance(config.patch_size, collections.abc.Iterable) + else (config.patch_size, config.patch_size) + ) + + num_patches = (image_size[1] // patch_size[1]) * (image_size[0] // patch_size[0]) + + self.assertListEqual( + list(hidden_states[0].shape[-2:]), + [num_patches, self.model_tester.embed_dim], + ) + + def test_hidden_states_output(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + image_size = ( + self.model_tester.image_size + if isinstance(self.model_tester.image_size, collections.abc.Iterable) + else (self.model_tester.image_size, self.model_tester.image_size) + ) + + for model_class in self.all_model_classes: + inputs_dict["output_hidden_states"] = True + self.check_hidden_states_output(inputs_dict, config, model_class, image_size) + + # check that output_hidden_states also work using config + del inputs_dict["output_hidden_states"] + config.output_hidden_states = True + + self.check_hidden_states_output(inputs_dict, config, model_class, image_size) + + def test_hidden_states_output_with_padding(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + config.patch_size = 3 + + image_size = ( + self.model_tester.image_size + if isinstance(self.model_tester.image_size, collections.abc.Iterable) + else (self.model_tester.image_size, self.model_tester.image_size) + ) + patch_size = ( + config.patch_size + if isinstance(config.patch_size, collections.abc.Iterable) + else (config.patch_size, config.patch_size) + ) + + padded_height = image_size[0] + patch_size[0] - (image_size[0] % patch_size[0]) + padded_width = image_size[1] + patch_size[1] - (image_size[1] % patch_size[1]) + + for model_class in self.all_model_classes: + inputs_dict["output_hidden_states"] = True + self.check_hidden_states_output(inputs_dict, config, model_class, (padded_height, padded_width)) + + # check that output_hidden_states also work using config + del inputs_dict["output_hidden_states"] + config.output_hidden_states = True + self.check_hidden_states_output(inputs_dict, config, model_class, (padded_height, padded_width)) + + @unittest.skip(reason="MaskFormerSwin doesn't have pretrained checkpoints") + def test_model_from_pretrained(self): + pass + + @unittest.skip(reason="This will be fixed once MaskFormerSwin is replaced by native Swin") + def test_initialization(self): + pass + + @unittest.skip(reason="This will be fixed once MaskFormerSwin is replaced by native Swin") + def test_gradient_checkpointing_backward_compatibility(self): + pass + + def test_model_outputs_equivalence(self): + config, inputs_dict = self.model_tester.prepare_config_and_inputs_for_common() + + def set_nan_tensor_to_zero(t): + t[t != t] = 0 + return t + + def check_equivalence(model, tuple_inputs, dict_inputs, additional_kwargs={}): + with torch.no_grad(): + tuple_output = model(**tuple_inputs, return_dict=False, **additional_kwargs) + dict_output = model(**dict_inputs, return_dict=True, **additional_kwargs).to_tuple() + + def recursive_check(tuple_object, dict_object): + if isinstance(tuple_object, (List, Tuple)): + for tuple_iterable_value, dict_iterable_value in zip(tuple_object, dict_object): + recursive_check(tuple_iterable_value, dict_iterable_value) + elif isinstance(tuple_object, Dict): + for tuple_iterable_value, dict_iterable_value in zip( + tuple_object.values(), dict_object.values() + ): + recursive_check(tuple_iterable_value, dict_iterable_value) + elif tuple_object is None: + return + else: + self.assertTrue( + torch.allclose( + set_nan_tensor_to_zero(tuple_object), set_nan_tensor_to_zero(dict_object), atol=1e-5 + ), + msg=( + "Tuple and dict output are not equal. Difference:" + f" {torch.max(torch.abs(tuple_object - dict_object))}. Tuple has `nan`:" + f" {torch.isnan(tuple_object).any()} and `inf`: {torch.isinf(tuple_object)}. Dict has" + f" `nan`: {torch.isnan(dict_object).any()} and `inf`: {torch.isinf(dict_object)}." + ), + ) + + recursive_check(tuple_output, dict_output) + + for model_class in self.all_model_classes: + model = model_class(config) + model.to(torch_device) + model.eval() + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class) + dict_inputs = self._prepare_for_class(inputs_dict, model_class) + check_equivalence(model, tuple_inputs, dict_inputs) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + check_equivalence(model, tuple_inputs, dict_inputs) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class) + dict_inputs = self._prepare_for_class(inputs_dict, model_class) + check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True}) + + tuple_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + dict_inputs = self._prepare_for_class(inputs_dict, model_class, return_labels=True) + check_equivalence(model, tuple_inputs, dict_inputs, {"output_hidden_states": True}) diff --git a/utils/check_copies.py b/utils/check_copies.py index f5d1b1c4169b..ca36a90460db 100644 --- a/utils/check_copies.py +++ b/utils/check_copies.py @@ -494,6 +494,7 @@ def check_model_list_copy(overwrite=False, max_per_line=119): "Data2VecVision": "Data2Vec", "DonutSwin": "Donut", "Marian": "MarianMT", + "MaskFormerSwin": "MaskFormer", "OpenAI GPT-2": "GPT-2", "OpenAI GPT": "GPT", "Perceiver": "Perceiver IO", diff --git a/utils/check_repo.py b/utils/check_repo.py index 8b02185fa9bd..b5d194bfafeb 100644 --- a/utils/check_repo.py +++ b/utils/check_repo.py @@ -46,6 +46,8 @@ # Being in this list is an exception and should **not** be the rule. IGNORE_NON_TESTED = PRIVATE_MODELS.copy() + [ # models to ignore for not tested + "MaskFormerSwinBackbone", # we should write separate tests for backbones + "ResNetBackbone", # we should write separate tests for backbones "CLIPSegDecoder", # Building part of bigger (tested) model. "TableTransformerEncoder", # Building part of bigger (tested) model. "TableTransformerDecoder", # Building part of bigger (tested) model. @@ -240,6 +242,7 @@ ("data2vec-audio", "data2vec"), ("data2vec-vision", "data2vec"), ("donut-swin", "donut"), + ("maskformer-swin", "maskformer"), ] )