Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

resnet26 is worse than resnet20? #1

Closed
ZHAIXINGZHAIYUE opened this issue Dec 20, 2019 · 6 comments
Closed

resnet26 is worse than resnet20? #1

ZHAIXINGZHAIYUE opened this issue Dec 20, 2019 · 6 comments
Labels
question Further information is requested

Comments

@ZHAIXINGZHAIYUE
Copy link

using Imagenette:
resnet26 is worse than resnet20?

@akshaykulkarni07
Copy link
Member

Which particular experiment are you talking about?

@ZHAIXINGZHAIYUE
Copy link
Author

Results using Imagenette :

Student Model Validation Accuracy without Teacher (%) Validation Accuracy with simultaneous training (%) Validation Accuracy with stagewise training (%) Difference between Teacher and Student (for stagewise) (%)
ResNet10 91.8 92.2 97.4 1.8
ResNet14 91.2 93.2 98.8 0.4
ResNet18 91.4 92.4 98.8 0.4
ResNet20 91.6 92.4 98.8 0.4
ResNet26 90.6 91.8 99 0.2

@akshaykulkarni07
Copy link
Member

akshaykulkarni07 commented Dec 20, 2019

We have trained for 100 epochs using Adam optimizer with LR 1e-4. In such a case, we find ResNet26 marginally poor compared to ResNet20. This can possibly be due to:

  • ResNet26 has more parameters and thus, it will tend to overfit when trained with the same amount of data as used for ResNet20. Now, a similar argument can be applied for ResNet20 and ResNet18/14, but the observation is not the same. This may be because fewer parameters imply less learning capability. So, ResNet18 or 20 may be considered to be in a sort of 'sweet spot' between fewer parameters and too many parameters.
  • The first point becomes more apparent when we see that stagewise training results are increasing with number of parameters. This is because there are fewer parameters to optimize in each stage of training, which improves the overall training (because all parameters are not trained together).

@akshaykulkarni07 akshaykulkarni07 added the question Further information is requested label Dec 20, 2019
@ZHAIXINGZHAIYUE
Copy link
Author

@akshaykvnit thank you very much,

@ZHAIXINGZHAIYUE
Copy link
Author

@akshaykvnit Should I fix the mean and var in BN of the first stage when I train the second stage?

@akshaykulkarni07
Copy link
Member

Yes, ideally we should freeze all parameters that are not in the particular stage being trained. This will also include the BN parameters. According to this answer, setting requires_grad = False will do the job (as we have done).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants