Parameter Groups / Transfer Learning #514

DrClick · 2019-11-15T10:51:51Z

I am trying to train a pre-trained reset50 with a small network on top of it frozen and then unfreeze and train the whole network.

My questions are:
1 - Where is the most appropriate place in the framework to create parameter groups?
2 - Does it make sense to add options to freeze/unfreeze to support selectively freezing groups

Regarding the current implementation of freeze/unfreeze, it has the side effect of setting the state of the model to eval/train. This seems inappropriate. If this is of interest, I am happy to make a pull request.

williamFalcon · 2019-11-15T12:58:16Z

Great questions.

i would structure transfer learning like this:

def __init__(self, ...):
    self.pretrained_model = SomeModel.load_from_...()
    self.pretrained_model.freeze()

    self.finetune_model = ...


def configure_optimizers(self):
    return Adam(self.pretrained_model.parameters(), ...)

.eval disables dropout and batchnorm which you don't get just from disabling grad. this means both are necessary to use pretrained as a feature extractor.

DrClick · 2019-11-15T16:54:22Z

Hey William. I am a bit confused on that response. Let me be more clear on the process.
I have defined my model as

        self.layer_groups = OrderedDict()
        
        self.resnet = nn.Sequential(
            *list(models.resnet50(pretrained=True).children())[:-2]
        )
        self.layer_groups["resnet"] = self.resnet

        self.classifier_head = nn.Sequential(
            *[
                nn.AdaptiveAvgPool2d(output_size=1),
                nn.AdaptiveMaxPool2d(output_size=1),
                nn.Flatten(),
                nn.BatchNorm1d(
                    2048, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
                ),
                nn.Dropout(p=0.25),
                nn.Linear(2048, 512, bias=True),
                nn.ReLU(),
                nn.BatchNorm1d(
                    512, eps=1e-05, momentum=0.1, affine=True, track_running_stats=True
                ),
                nn.Dropout(p=0.5),
                nn.Linear(512, 2, bias=True),
            ]
        )
        self.layer_groups["classifier_head"] = self.classifier_head

I would like to train this full network with self.resnet frozen for a few epochs. Then unfreeze this network and train the whole model some more.

To accomplish this

def freeze_to(self, n:int, exclude_types=(nn.BatchNorm1d, nn.BatchNorm2d, nn.BatchNorm3d)) -> None:
        """Freeze layers up to layer group `n`.
        
        Look at each group, and freeze each paraemeter, except excluded types
        """

        print("freezing", f"freeze_to model called, level requested {n}")

        def set_requires_grad_for_module(module: nn.Module, requires_grad: bool):
            "Sets each parameter in lthe module to the `requires_grad` value"
            params = list(module.parameters())
            
            for param in params: 
                param.requires_grad = requires_grad

        #layer groups is an ordered dict, get the keys up to the nth layer
        for group_key in list(self.layer_groups)[:n]:
            group = self.layer_groups[group_key]
            for layer in group:
                if not isinstance(layer, exclude_types): 
                    set_requires_grad_for_module(layer, False)
        
        for group_key in list(self.layer_groups)[n:]:
            group = self.layer_groups[group_key]
            set_requires_grad_for_module(group, True)

    def freeze(self) -> None:
        self.freeze_to(len(self.layer_groups))

    def unfreeze(self):
       self.freeze_to(0)

It seems to me, in this example you have given, that when you try to train, the model will be stuck in eval mode because of the call to model.eval. I think you may have the idea that pre-trained is a lightning module that has been trained. Is this the case? If so, that makes using pre-trained networks need to be wrapped in Lightning Module which I think goes against the idea of "just pytorch."

LucFrachon · 2020-01-22T13:08:32Z

Hi, I too would be interested in a step-by-step 'tutorial' for doing transfer learning with Pytorch-Lightning. Is this something that might be added to the docs?

Borda · 2020-03-26T14:08:51Z

@LucFrachon would you interested in creating such tutorial?
@jeremyjordan @awaelchli pls ^^

LucFrachon · 2020-03-31T16:24:30Z

I'm not sure I am fully qualified for this :-)
I believe such a tutorial should at least cover:

Freezing parts of a model for a few epochs
Unfreezing and fine-tuning (in line with @DrClick 's ideas), possibly with a different learning rate
Setting up different learning rates/weight decay factors AND different learning rate schedules for different parameter groups

I can see how to do these things individually, but I struggle to see how to integrate them elegantly in a LightningModule...

jbschiratti · 2020-04-08T09:37:23Z

@Borda @LucFrachon I was working on something very similar these last few days.
What I want is to have a pre-trained feature extractor (say ResNet50) and be able to do the following (part of an ongoing work to reproduce the results from a research paper):

keep the feature extractor frozen with lr = 1e-2 for a few epochs
unfreeze the feature extractor and train with lr = 1e-3 for a few epochs
keep training with lr = 1e-4.
I wrote this gist. Here, the network is training on dummy data (just noise). Still, if you think this is relevant, I would be happy to disucss it with you and make it into a pytorch-lightning tutorial.

Borda · 2020-04-08T10:35:48Z

@jbschiratti nice example, what dataset are you using?
https://gist.github.com/jbschiratti/e93f1ff9cc518a93769101044160d64d#file-fine_tuning-py-L244-L260

jbschiratti · 2020-04-08T10:49:55Z

from torchvision.datasets import FakeData I used this fake dataset just as proof of concept.

Borda · 2020-04-08T11:15:30Z

cool, mind sending PR as a lightning example?
just thinking to have it run on real (small) dataset...
cc: @PyTorchLightning/core-contributors @williamFalcon

jbschiratti · 2020-04-09T12:32:13Z

Sure ! But I think it would be more relevant with some real images (the dataset used in this example, for instance). Don't you think?

Borda · 2020-04-09T12:51:01Z

well, real-world examples would be nice nut we still need to stay in kind of minimal mode, we do not want a user to download couple gb dataset just for an example... BUT your Ants-Bets looks good

sairahul · 2020-04-14T04:13:30Z

Hi, i created a similar example using fastai and pytorch-lightning. Might be useful for some one https://github.com/sairahul/mlexperiments/blob/master/pytorch-lightning/fine_tuning_example.py

jbschiratti · 2020-04-22T17:15:57Z

@Borda I eventually made a PR with a slightly modified version of the example I proposed.

lizhitwo · 2020-04-23T19:16:07Z

@Borda I eventually made a PR with a slightly modified version of the example I proposed.

Thank you for the tutorial! This is much needed. I have a small question -- is there a reason why the frozen parameters cannot be added to the optimizer from the beginning, and has to be added on specific epochs? Afaik if .requires_grad is False, the optimizers will ignore the parameter.

jbschiratti · 2020-04-24T08:54:00Z

@lizhitwo Sure! I added the parameters separately to emphasize the idea that these parameters were not trained before a given epoch. AFAIK, what you're proposing would work as well!

lizhitwo · 2020-04-24T16:49:07Z

@jbschiratti Thanks for the explanation!

hcjghr · 2020-04-27T23:16:17Z

@jbschiratti (and maybe someone else), is there a big advantage in either approach? I understand you made it in this way to emphasis the approach but in general which would be more optimal?

I came across a discussion on pytorch forum (https://discuss.pytorch.org/t/passing-a-subset-of-the-parameters-to-an-optimizer-equivalent-to-setting-requires-grad-of-subset-only-to-true/42866) where it is suggested that passing all the parameters to the optimizer and then marking the frozen ones with requires_grad=False flag prevents the gradients from being computed, and subsequently saves some memory. Not sure if this is still relevant as the discussion is a year old...

jbschiratti · 2020-04-28T09:53:26Z

@hcjghr The point of having parameters group (and adding the parameters sequentially as we unfreeze them) is to allow for different learning rates. I updated the example in #1564. If you had a single parameter group in your optimizer, you might not be able to use such training strategies.

hcjghr · 2020-04-28T20:41:31Z

@jbschiratti I completely understand it now and I definitely agree this is the best way to have it in the example to show different possibilities. Thanks for the quick explanation.

stale · 2020-06-27T20:56:27Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

uwaisiqbal · 2020-10-23T16:11:51Z

Hey I'm trying to do something similar with some categorical data where I want to freeze the pre-trained model for the initial training then train the full network slightly. I struggled to find the example as its not mentioned in the docs but is hidden away in the github repo. The README.md for the parent directory doesn't mention this example either so its not very visible. Although, I think it is a super useful example for transfer learning!

IemProg · 2023-03-02T15:14:35Z

Hi,

I'm trying to fine-tune a subset of parameters for pre-defined model. Only fine-tune the bias parameters as done in BitFit paper. But i'm getting this error.

"One of the differentiated Tensors does not require grad"

I'm looping through all the parameters (named_parameters), and then changing the requires_grad value if "bias" is in the name of the parameter.

Did anyone faced this issue ?

Thanks,

Borda added feature Is an improvement or enhancement help wanted Open to be worked on question Further information is requested example labels Jan 24, 2020

jbschiratti mentioned this issue Apr 22, 2020

Transfer learning example #1564

Merged

stale bot added the won't fix This will not be worked on label Jun 27, 2020

stale bot closed this as completed Jul 6, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parameter Groups / Transfer Learning #514

Parameter Groups / Transfer Learning #514

DrClick commented Nov 15, 2019

williamFalcon commented Nov 15, 2019

DrClick commented Nov 15, 2019 •

edited

Loading

LucFrachon commented Jan 22, 2020

Borda commented Mar 26, 2020 •

edited

Loading

LucFrachon commented Mar 31, 2020 •

edited

Loading

jbschiratti commented Apr 8, 2020

Borda commented Apr 8, 2020

jbschiratti commented Apr 8, 2020

Borda commented Apr 8, 2020 •

edited

Loading

jbschiratti commented Apr 9, 2020

Borda commented Apr 9, 2020

sairahul commented Apr 14, 2020

jbschiratti commented Apr 22, 2020

lizhitwo commented Apr 23, 2020

jbschiratti commented Apr 24, 2020

lizhitwo commented Apr 24, 2020

hcjghr commented Apr 27, 2020

jbschiratti commented Apr 28, 2020

hcjghr commented Apr 28, 2020

stale bot commented Jun 27, 2020

uwaisiqbal commented Oct 23, 2020

IemProg commented Mar 2, 2023

Parameter Groups / Transfer Learning #514

Parameter Groups / Transfer Learning #514

Comments

DrClick commented Nov 15, 2019

williamFalcon commented Nov 15, 2019

DrClick commented Nov 15, 2019 • edited Loading

LucFrachon commented Jan 22, 2020

Borda commented Mar 26, 2020 • edited Loading

LucFrachon commented Mar 31, 2020 • edited Loading

jbschiratti commented Apr 8, 2020

Borda commented Apr 8, 2020

jbschiratti commented Apr 8, 2020

Borda commented Apr 8, 2020 • edited Loading

jbschiratti commented Apr 9, 2020

Borda commented Apr 9, 2020

sairahul commented Apr 14, 2020

jbschiratti commented Apr 22, 2020

lizhitwo commented Apr 23, 2020

jbschiratti commented Apr 24, 2020

lizhitwo commented Apr 24, 2020

hcjghr commented Apr 27, 2020

jbschiratti commented Apr 28, 2020

hcjghr commented Apr 28, 2020

stale bot commented Jun 27, 2020

uwaisiqbal commented Oct 23, 2020

IemProg commented Mar 2, 2023

DrClick commented Nov 15, 2019 •

edited

Loading

Borda commented Mar 26, 2020 •

edited

Loading

LucFrachon commented Mar 31, 2020 •

edited

Loading

Borda commented Apr 8, 2020 •

edited

Loading