Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parameter Groups / Transfer Learning #514

Closed
DrClick opened this issue Nov 15, 2019 · 22 comments
Closed

Parameter Groups / Transfer Learning #514

DrClick opened this issue Nov 15, 2019 · 22 comments
Labels
example feature Is an improvement or enhancement help wanted Open to be worked on question Further information is requested won't fix This will not be worked on

Comments

@DrClick
Copy link

DrClick commented Nov 15, 2019

I am trying to train a pre-trained reset50 with a small network on top of it frozen and then unfreeze and train the whole network.

My questions are:
1 - Where is the most appropriate place in the framework to create parameter groups?
2 - Does it make sense to add options to freeze/unfreeze to support selectively freezing groups

Regarding the current implementation of freeze/unfreeze, it has the side effect of setting the state of the model to eval/train. This seems inappropriate. If this is of interest, I am happy to make a pull request.

@williamFalcon
Copy link
Contributor

Great questions.

  1. i would structure transfer learning like this:
def __init__(self, ...):
    self.pretrained_model = SomeModel.load_from_...()
    self.pretrained_model.freeze()

    self.finetune_model = ...


def configure_optimizers(self):
    return Adam(self.pretrained_model.parameters(), ...)
  1. .eval disables dropout and batchnorm which you don't get just from disabling grad. this means both are necessary to use pretrained as a feature extractor.

@DrClick
Copy link
Author

DrClick commented Nov 15, 2019

Hey William. I am a bit confused on that response. Let me be more clear on the process.
I have defined my model as

        self.layer_groups = OrderedDict()
        
        self.resnet = nn.Sequential(
            *list(models.resnet50(pretrained=True).children())[:-2]
        )
        self.layer_groups["resnet"] = self.resnet

        self.classifier_head = nn.Sequential(
            *[
                nn.AdaptiveAvgPool2d(output_size=1),
                nn.AdaptiveMaxPool2d(output_size=1),
                nn.Flatten(),
                nn.BatchNorm1d(
                    2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
                ),
                nn.Dropout(p=0.25),
                nn.Linear(2048, 512, bias=True),
                nn.ReLU(),
                nn.BatchNorm1d(
                    512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
                ),
                nn.Dropout(p=0.5),
                nn.Linear(512, 2, bias=True),
            ]
        )
        self.layer_groups["classifier_head"] = self.classifier_head

I would like to train this full network with self.resnet frozen for a few epochs. Then unfreeze this network and train the whole model some more.

To accomplish this

def freeze_to(self, n:int, exclude_types=(nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)) -> None:
        """Freeze layers up to layer group `n`.
        
        Look at each group, and freeze each paraemeter, except excluded types
        """

        print("freezing", f"freeze_to model called, level requested {n}")

        def set_requires_grad_for_module(module: nn.Module, requires_grad: bool):
            "Sets each parameter in lthe module to the `requires_grad` value"
            params = list(module.parameters())
            
            for param in params: 
                param.requires_grad = requires_grad

        #layer groups is an ordered dict, get the keys up to the nth layer
        for group_key in list(self.layer_groups)[:n]:
            group = self.layer_groups[group_key]
            for layer in group:
                if not isinstance(layer, exclude_types): 
                    set_requires_grad_for_module(layer, False)
        
        for group_key in list(self.layer_groups)[n:]:
            group = self.layer_groups[group_key]
            set_requires_grad_for_module(group, True)

    def freeze(self) -> None:
        self.freeze_to(len(self.layer_groups))

    def unfreeze(self):
       self.freeze_to(0)

It seems to me, in this example you have given, that when you try to train, the model will be stuck in eval mode because of the call to model.eval. I think you may have the idea that pre-trained is a lightning module that has been trained. Is this the case? If so, that makes using pre-trained networks need to be wrapped in Lightning Module which I think goes against the idea of "just pytorch."

@LucFrachon
Copy link

Hi, I too would be interested in a step-by-step 'tutorial' for doing transfer learning with Pytorch-Lightning. Is this something that might be added to the docs?

@Borda Borda added feature Is an improvement or enhancement help wanted Open to be worked on question Further information is requested example labels Jan 24, 2020
@Borda
Copy link
Member

Borda commented Mar 26, 2020

@LucFrachon would you interested in creating such tutorial?
@jeremyjordan @awaelchli pls ^^

@LucFrachon
Copy link

LucFrachon commented Mar 31, 2020

I'm not sure I am fully qualified for this :-)
I believe such a tutorial should at least cover:

  • Freezing parts of a model for a few epochs
  • Unfreezing and fine-tuning (in line with @DrClick 's ideas), possibly with a different learning rate
  • Setting up different learning rates/weight decay factors AND different learning rate schedules for different parameter groups

I can see how to do these things individually, but I struggle to see how to integrate them elegantly in a LightningModule...

@jbschiratti
Copy link
Contributor

@Borda @LucFrachon I was working on something very similar these last few days.
What I want is to have a pre-trained feature extractor (say ResNet50) and be able to do the following (part of an ongoing work to reproduce the results from a research paper):

  • keep the feature extractor frozen with lr = 1e-2 for a few epochs
  • unfreeze the feature extractor and train with lr = 1e-3 for a few epochs
  • keep training with lr = 1e-4.
    I wrote this gist. Here, the network is training on dummy data (just noise). Still, if you think this is relevant, I would be happy to disucss it with you and make it into a pytorch-lightning tutorial.

@Borda
Copy link
Member

Borda commented Apr 8, 2020

@jbschiratti
Copy link
Contributor

from torchvision.datasets import FakeData I used this fake dataset just as proof of concept.

@Borda
Copy link
Member

Borda commented Apr 8, 2020

cool, mind sending PR as a lightning example?
just thinking to have it run on real (small) dataset...
cc: @PyTorchLightning/core-contributors @williamFalcon

@jbschiratti
Copy link
Contributor

Sure ! But I think it would be more relevant with some real images (the dataset used in this example, for instance). Don't you think?

@Borda
Copy link
Member

Borda commented Apr 9, 2020

well, real-world examples would be nice nut we still need to stay in kind of minimal mode, we do not want a user to download couple gb dataset just for an example... BUT your Ants-Bets looks good

@sairahul
Copy link

Hi, i created a similar example using fastai and pytorch-lightning. Might be useful for some one https://github.com/sairahul/mlexperiments/blob/master/pytorch-lightning/fine_tuning_example.py

@jbschiratti
Copy link
Contributor

@Borda I eventually made a PR with a slightly modified version of the example I proposed.

@lizhitwo
Copy link

@Borda I eventually made a PR with a slightly modified version of the example I proposed.

Thank you for the tutorial! This is much needed. I have a small question -- is there a reason why the frozen parameters cannot be added to the optimizer from the beginning, and has to be added on specific epochs? Afaik if .requires_grad is False, the optimizers will ignore the parameter.

@jbschiratti
Copy link
Contributor

@lizhitwo Sure! I added the parameters separately to emphasize the idea that these parameters were not trained before a given epoch. AFAIK, what you're proposing would work as well!

@lizhitwo
Copy link

@jbschiratti Thanks for the explanation!

@hcjghr
Copy link

hcjghr commented Apr 27, 2020

@jbschiratti (and maybe someone else), is there a big advantage in either approach? I understand you made it in this way to emphasis the approach but in general which would be more optimal?

I came across a discussion on pytorch forum (https://discuss.pytorch.org/t/passing-a-subset-of-the-parameters-to-an-optimizer-equivalent-to-setting-requires-grad-of-subset-only-to-true/42866) where it is suggested that passing all the parameters to the optimizer and then marking the frozen ones with requires_grad=False flag prevents the gradients from being computed, and subsequently saves some memory. Not sure if this is still relevant as the discussion is a year old...

@jbschiratti
Copy link
Contributor

@hcjghr The point of having parameters group (and adding the parameters sequentially as we unfreeze them) is to allow for different learning rates. I updated the example in #1564. If you had a single parameter group in your optimizer, you might not be able to use such training strategies.

@hcjghr
Copy link

hcjghr commented Apr 28, 2020

@jbschiratti I completely understand it now and I definitely agree this is the best way to have it in the example to show different possibilities. Thanks for the quick explanation.

@stale
Copy link

stale bot commented Jun 27, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@stale stale bot added the won't fix This will not be worked on label Jun 27, 2020
@stale stale bot closed this as completed Jul 6, 2020
@uwaisiqbal
Copy link

Hey I'm trying to do something similar with some categorical data where I want to freeze the pre-trained model for the initial training then train the full network slightly. I struggled to find the example as its not mentioned in the docs but is hidden away in the github repo. The README.md for the parent directory doesn't mention this example either so its not very visible. Although, I think it is a super useful example for transfer learning!

@IemProg
Copy link

IemProg commented Mar 2, 2023

Hi,

I'm trying to fine-tune a subset of parameters for pre-defined model. Only fine-tune the bias parameters as done in BitFit paper. But i'm getting this error.

"One of the differentiated Tensors does not require grad"

I'm looping through all the parameters (named_parameters), and then changing the requires_grad value if "bias" is in the name of the parameter.

Did anyone faced this issue ?

Thanks,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
example feature Is an improvement or enhancement help wanted Open to be worked on question Further information is requested won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

10 participants