Refactor swin transfomer so later we can reuse component for 3d version #6088

YosuaMichael · 2022-05-25T11:41:57Z

Refactoring swin_transformer so we can reuse some components for video swin transformer.
Discussion are in: facebookresearch/multimodal#43 (comment)

In this change, we need to convert the previous weight to adapt the new structure. After converting, I did test on all the weights and here are the validation script and result:

python -u ~/script/run_with_submitit.py \
    --timeout 3000 --nodes 1 --ngpus 1 --batch-size=1 \
    --partition train --model swin_s \
    --data-path="/datasets01_ontap/imagenet_full_size/061417" \
    --weights="Swin_S_Weights.IMAGENET1K_V1" \
    --test-only 

# Test:  Acc@1 81.470 Acc@5 95.776
# Previous accuracy before refactoring: Acc@1 81.474 Acc@5 95.776

python ~/script/run_with_submitit.py \
    --timeout 3000 --nodes 1 --ngpus 1 --batch-size=1 \
    --partition train --model swin_t \
    --data-path="/datasets01_ontap/imagenet_full_size/061417" \
    --weights="Swin_T_Weights.IMAGENET1K_V1" \
    --test-only 

# Test:  Acc@1 83.196 Acc@5 96.362
# Previous accuracy before refactoring: Acc@1 83.196 Acc@5 96.360

python -u ~/script/run_with_submitit.py \
    --timeout 3000 --nodes 1 --ngpus 1 --batch-size=1 \
    --partition train --model swin_b \
    --data-path="/datasets01_ontap/imagenet_full_size/061417" \
    --weights="Swin_B_Weights.IMAGENET1K_V1" \
    --test-only 

# Test:  Acc@1 83.584 Acc@5 96.640
# Previous accuracy before refactoring: Acc@1 83.582 Acc@5 96.640

Overall it have some minor changes (+- 0.004%) that I think still acceptable.

…ases

… without head by specifying num_heads=None

YosuaMichael · 2022-05-25T15:09:37Z

I notice the failed test on jit / fx, will investigate on it

datumbox · 2022-05-25T17:02:57Z

@xiaohu2015 I wonder if you have the bandwidth to check the proposed updates in this PR.

To provide some context, Yosua is making some small adjustments to the original implementation so that we can support easier the 3D version of the Swin model. Please let us know if you have any thoughts.

torchvision/models/swin_transformer.py

…swin-transformer-breaking

…rap with torch.fx.wrap so it is excluded from tracing

…mer-breaking [WIP] Part of refactoring swin transformer that require re-training the model

YosuaMichael · 2022-05-25T23:22:13Z

@xiaohu2015 I wonder if you have the bandwidth to check the proposed updates in this PR.

To provide some context, Yosua is making some small adjustments to the original implementation so that we can support easier the 3D version of the Swin model. Please let us know if you have any thoughts.

Hi @xiaohu2015 , would be great if you can take a look at this PR :)
To add more info, currently I added the 3D version of Swin Transformer model on torchmultimodal: https://github.com/facebookresearch/multimodal/blob/44786d8743d75fdd30ba79adc283a5eaaa3ecfca/torchmultimodal/modules/encoders/swin_transformer_3d_encoder.py
The plan is to add this into torchvision later and we create this PR to refactor torchvision 2d swin transformer so we can reuse some of the components and generally make it easier to upstream 3D version later.

YosuaMichael

We do the following changes:

Use List[int] for params like window_size, shift_size and patch_size
Make PatchMerging and SwinTransformerBlock able to be used in both 2d and 3d
Separate PatchEmbed from SwinTransformer class and enable user to change it in constructor param
Make num_classes optional, and enable the user to get the model without head if they set it to None
Update the method to handle edge cases where window_size is larger than input
Change weight url to the ported weight (just ported and not retrained)

YosuaMichael · 2022-05-25T23:24:34Z

torchvision/models/swin_transformer.py

@@ -428,7 +506,7 @@ class Swin_T_Weights(WeightsEnum):
            "recipe": "https://github.com/pytorch/vision/tree/main/references/classification#swintransformer",
            "_metrics": {
                "ImageNet-1K": {
-                    "acc@1": 81.474,
+                    "acc@1": 81.470,


The acc@1 of swin_t decrease by 0.004%, but I think this is because of some randomness as the acc@5 for swin_s and acc@1 for swin_b each increase by 0.002%

I don't think you need to change this. Every time you run the analysis you might get a slightly different number on the 3rd decimal and there is not much you can do about it.

torchvision/models/swin_transformer.py

xiaohu2015 · 2022-05-26T03:04:10Z

@YosuaMichael All your modifcation looks good for me. But I have one question about the Update the method to handle edge cases where window_size is larger than input
In the original code, we will pad the input and set shift_size=0 but not change the window size. But you choose to change the window size but not pad the input.

the two ways maybe different, can have a impact on the performance (but don't affect current model, we don't have such situation)

YosuaMichael · 2022-05-26T07:08:30Z

@YosuaMichael All your modifcation looks good for me. But I have one question about the Update the method to handle edge cases where window_size is larger than input
In the original code, we will pad the input and set shift_size=0 but not change the window size. But you choose to change the window size but not pad the input.

Hi @xiaohu2015 , thanks for taking a look :)
I think the original code also modify the window_size when it is larger than input_size. Here is the part I refer to: https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py#L192-L195

xiaohu2015 · 2022-05-26T07:39:43Z

@YosuaMichael All your modifcation looks good for me. But I have one question about the Update the method to handle edge cases where window_size is larger than input
In the original code, we will pad the input and set shift_size=0 but not change the window size. But you choose to change the window size but not pad the input.

Hi @xiaohu2015 , thanks for taking a look :) I think the original code also modify the window_size when it is larger than input_size. Here is the part I refer to: https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py#L192-L195

You are right, I also think the second way is more reasonable ( For 192x192 input, we should use window-size=6 in the last stage in swinv2).

xiaohu2015 · 2022-05-26T07:42:49Z

@YosuaMichael I think you maybe want to implement the https://arxiv.org/pdf/2106.13230.pdf? It should use 3D windows? maybe I missed some thing?

YosuaMichael · 2022-05-26T07:59:34Z

@YosuaMichael I think you maybe want to implement the https://arxiv.org/pdf/2106.13230.pdf? It should use 3D windows? maybe I missed some thing?

@xiaohu2015 Yes, you are right! I want to implement Video Swin Transformer , currently I added it on torchmultimodal repository first (see this PR) but I plan to upstream it into torchvision as video_classifier after that.

datumbox

Thanks @YosuaMichael, just a few comments below

torchvision/models/swin_transformer.py

datumbox · 2022-05-26T09:36:03Z

torchvision/models/swin_transformer.py

@@ -428,7 +506,7 @@ class Swin_T_Weights(WeightsEnum):
            "recipe": "https://github.com/pytorch/vision/tree/main/references/classification#swintransformer",
            "_metrics": {
                "ImageNet-1K": {
-                    "acc@1": 81.474,
+                    "acc@1": 81.470,


I don't think you need to change this. Every time you run the analysis you might get a slightly different number on the 3rd decimal and there is not much you can do about it.

torchvision/models/swin_transformer.py

…chael/vision into models/refactor-swin-transformer

datumbox

LGTM, thanks @YosuaMichael.

Can you please provide proof for new inference runs to confirm the accuracy of the models remain unaffected?

We can merge on Green CI.

torchvision/models/swin_transformer.py

YosuaMichael · 2022-05-26T13:14:21Z

@datumbox I have rerun the validation script and it return the same exact result as when I ran before (the one in PR description). Will merge now

github-actions · 2022-05-26T13:15:44Z

Hey @YosuaMichael!

You merged this PR, but no labels were added. The list of valid labels is available at https://github.com/pytorch/vision/blob/main/.github/process_commit.py

…on (pytorch#6088) * Use List[int] instead of int for window_size and shift_size * Make PatchMerging and SwinTransformerBlock able to handle 2d and 3d cases * Separate patch embedding from SwinTransformer and enable to get model without head by specifying num_heads=None * Dont use if before padding so it is fx friendly * Put the handling on window_size edge cases on separate function and wrap with torch.fx.wrap so it is excluded from tracing * Update the weight url to the converted weight with new structure * Update the accuracy of swin_transformer * Change assert to Exception and nit * Make num_classes optional * Add typing output for _fix_window_and_shift_size function * init head to None to make it jit scriptable * Revert the change to make num_classes optional * Revert unneccesarry changes that might be risky * Remove self.head declaration

xiaohu2015 · 2022-05-27T02:21:24Z

@YosuaMichael All your modifcation looks good for me. But I have one question about the Update the method to handle edge cases where window_size is larger than input
In the original code, we will pad the input and set shift_size=0 but not change the window size. But you choose to change the window size but not pad the input.

Hi @xiaohu2015 , thanks for taking a look :) I think the original code also modify the window_size when it is larger than input_size. Here is the part I refer to: https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py#L192-L195

why you revert this modifcation?

…on (#6088) (#6100) * Use List[int] instead of int for window_size and shift_size * Make PatchMerging and SwinTransformerBlock able to handle 2d and 3d cases * Separate patch embedding from SwinTransformer and enable to get model without head by specifying num_heads=None * Dont use if before padding so it is fx friendly * Put the handling on window_size edge cases on separate function and wrap with torch.fx.wrap so it is excluded from tracing * Update the weight url to the converted weight with new structure * Update the accuracy of swin_transformer * Change assert to Exception and nit * Make num_classes optional * Add typing output for _fix_window_and_shift_size function * init head to None to make it jit scriptable * Revert the change to make num_classes optional * Revert unneccesarry changes that might be risky * Remove self.head declaration

… 3d version (#6088) Summary: * Use List[int] instead of int for window_size and shift_size * Make PatchMerging and SwinTransformerBlock able to handle 2d and 3d cases * Separate patch embedding from SwinTransformer and enable to get model without head by specifying num_heads=None * Dont use if before padding so it is fx friendly * Put the handling on window_size edge cases on separate function and wrap with torch.fx.wrap so it is excluded from tracing * Update the weight url to the converted weight with new structure * Update the accuracy of swin_transformer * Change assert to Exception and nit * Make num_classes optional * Add typing output for _fix_window_and_shift_size function * init head to None to make it jit scriptable * Revert the change to make num_classes optional * Revert unneccesarry changes that might be risky * Remove self.head declaration Reviewed By: NicolasHug Differential Revision: D36760917 fbshipit-source-id: 920177e069913775773c45e19abeb32017faaaee

xiaohu2015 · 2022-06-30T06:28:30Z

@YosuaMichael All your modifcation looks good for me. But I have one question about the Update the method to handle edge cases where window_size is larger than input
In the original code, we will pad the input and set shift_size=0 but not change the window size. But you choose to change the window size but not pad the input.

Hi @xiaohu2015 , thanks for taking a look :) I think the original code also modify the window_size when it is larger than input_size. Here is the part I refer to: https://github.com/microsoft/Swin-Transformer/blob/main/models/swin_transformer.py#L192-L195

why you revert this modifcation?

It maybe important for swin_v2 model, since they use 192x192 input, the window_size in the last stage will be 6 rather than 7

datumbox · 2022-06-30T08:35:05Z

@xiaohu2015 Thanks for the feedback. sorry we missed your message!

@YosuaMichael could you please provide the details?

YosuaMichael · 2022-06-30T08:51:29Z

@xiaohu2015 Sorry for the late reply and thanks for the feedback.

The reason why we revert the changes is because we want to cherrypick this changes in the new release of torchvision 0.13.0, at that time we decide to not change any previous behaviour but only do more "safe" changes that aim for reusing components to swin_transformer 3d.

I think we can put it back now with another PR, I will create a PR for this and tag you!

YosuaMichael · 2022-06-30T10:22:20Z

@xiaohu2015 I create the changes on this PR: #6222

Use List[int] instead of int for window_size and shift_size

fb1a749

YosuaMichael self-assigned this May 25, 2022

facebook-github-bot added the cla signed label May 25, 2022

YosuaMichael added 2 commits May 25, 2022 12:58

Make PatchMerging and SwinTransformerBlock able to handle 2d and 3d c…

c3902ae

…ases

Separate patch embedding from SwinTransformer and enable to get model…

13d4d85

… without head by specifying num_heads=None

Dont use if before padding so it is fx friendly

a3d1192

datumbox reviewed May 25, 2022

View reviewed changes

torchvision/models/swin_transformer.py Outdated Show resolved Hide resolved

torchvision/models/swin_transformer.py Outdated Show resolved Hide resolved

torchvision/models/swin_transformer.py Show resolved Hide resolved

torchvision/models/swin_transformer.py Outdated Show resolved Hide resolved

YosuaMichael added 6 commits May 25, 2022 18:34

Merge branch 'models/refactor-swin-transformer' into models/refactor-…

6a57c15

…swin-transformer-breaking

Put the handling on window_size edge cases on separate function and w…

ba5f8f9

…rap with torch.fx.wrap so it is excluded from tracing

Update the weight url to the converted weight with new structure

ed14bd7

Update the accuracy of swin_transformer

548109a

Merge pull request #2 from YosuaMichael/models/refactor-swin-transfor…

5d09c35

…mer-breaking [WIP] Part of refactoring swin transformer that require re-training the model

Change assert to Exception and nit

2d767f8

YosuaMichael marked this pull request as ready for review May 25, 2022 22:15

Make num_classes optional

65c8439

Merge branch 'main' into models/refactor-swin-transformer

2c0dead

YosuaMichael changed the title ~~[WIP] Refactor swin transfomer so later we can reuse component for 3d version~~ Refactor swin transfomer so later we can reuse component for 3d version May 25, 2022

YosuaMichael commented May 25, 2022

View reviewed changes

YosuaMichael mentioned this pull request May 25, 2022

Add model and architecture for Omnivore model facebookresearch/multimodal#43

Closed

Add typing output for _fix_window_and_shift_size function

f0f872f

init head to None to make it jit scriptable

e2317f4

datumbox reviewed May 26, 2022

View reviewed changes

YosuaMichael and others added 5 commits May 26, 2022 11:03

Revert the change to make num_classes optional

77ce2b9

Revert unneccesarry changes that might be risky

f04801f

Merge branch 'main' into models/refactor-swin-transformer

480f762

Remove self.head declaration

8cf5e39

Merge branch 'models/refactor-swin-transformer' of github.com:YosuaMi…

c260edb

…chael/vision into models/refactor-swin-transformer

datumbox approved these changes May 26, 2022

View reviewed changes

torchvision/models/swin_transformer.py Show resolved Hide resolved

YosuaMichael merged commit 952f480 into pytorch:main May 26, 2022

This was referenced May 26, 2022

[Cherrypick] Refactor swin transfomer so its component can be reused on the 3d version #6100

Merged

[v0.13] Release Tracker #6081

Closed

datumbox mentioned this pull request Jun 16, 2022

Adding _log_api_usage_once to Swin's reusable components #6174

Merged

YosuaMichael mentioned this pull request Jun 30, 2022

Swin transformer handle window size smaller than input size #6222

Closed

datumbox mentioned this pull request Jul 6, 2022

Add SwinV2 in TorchVision #6242

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor swin transfomer so later we can reuse component for 3d version #6088

Refactor swin transfomer so later we can reuse component for 3d version #6088

YosuaMichael commented May 25, 2022 •

edited

Loading

YosuaMichael commented May 25, 2022

datumbox commented May 25, 2022

YosuaMichael commented May 25, 2022

YosuaMichael left a comment

YosuaMichael May 25, 2022

datumbox May 26, 2022

xiaohu2015 commented May 26, 2022 •

edited

Loading

YosuaMichael commented May 26, 2022

xiaohu2015 commented May 26, 2022 •

edited

Loading

xiaohu2015 commented May 26, 2022

YosuaMichael commented May 26, 2022 •

edited

Loading

datumbox left a comment

datumbox May 26, 2022

datumbox left a comment •

edited

Loading

YosuaMichael commented May 26, 2022 •

edited

Loading

github-actions bot commented May 26, 2022

xiaohu2015 commented May 27, 2022

xiaohu2015 commented Jun 30, 2022

datumbox commented Jun 30, 2022 •

edited

Loading

YosuaMichael commented Jun 30, 2022

YosuaMichael commented Jun 30, 2022

Refactor swin transfomer so later we can reuse component for 3d version #6088

Refactor swin transfomer so later we can reuse component for 3d version #6088

Conversation

YosuaMichael commented May 25, 2022 • edited Loading

YosuaMichael commented May 25, 2022

datumbox commented May 25, 2022

YosuaMichael commented May 25, 2022

YosuaMichael left a comment

Choose a reason for hiding this comment

YosuaMichael May 25, 2022

Choose a reason for hiding this comment

datumbox May 26, 2022

Choose a reason for hiding this comment

xiaohu2015 commented May 26, 2022 • edited Loading

YosuaMichael commented May 26, 2022

xiaohu2015 commented May 26, 2022 • edited Loading

xiaohu2015 commented May 26, 2022

YosuaMichael commented May 26, 2022 • edited Loading

datumbox left a comment

Choose a reason for hiding this comment

datumbox May 26, 2022

Choose a reason for hiding this comment

datumbox left a comment • edited Loading

Choose a reason for hiding this comment

YosuaMichael commented May 26, 2022 • edited Loading

github-actions bot commented May 26, 2022

xiaohu2015 commented May 27, 2022

xiaohu2015 commented Jun 30, 2022

datumbox commented Jun 30, 2022 • edited Loading

YosuaMichael commented Jun 30, 2022

YosuaMichael commented Jun 30, 2022

YosuaMichael commented May 25, 2022 •

edited

Loading

xiaohu2015 commented May 26, 2022 •

edited

Loading

xiaohu2015 commented May 26, 2022 •

edited

Loading

YosuaMichael commented May 26, 2022 •

edited

Loading

datumbox left a comment •

edited

Loading

YosuaMichael commented May 26, 2022 •

edited

Loading

datumbox commented Jun 30, 2022 •

edited

Loading