Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implementation of ATKD #9

Merged
merged 2 commits into from
Jun 7, 2020
Merged

Implementation of ATKD #9

merged 2 commits into from
Jun 7, 2020

Conversation

akshaykulkarni07
Copy link
Member

@akshaykulkarni07 akshaykulkarni07 commented May 31, 2020

Adding implementation of Attention Transfer KD (ATKD) which is an ICLR '17 paper. Some inspiration for the source code is from their official implementation.

Please go through the code (in case there are any obvious mistakes).

I ran one experiment with ResNet10 full data and the result is 92% validation accuracy (comparable to 92.2% of simultaneous KD for same settings).

@akshaykulkarni07 akshaykulkarni07 added the enhancement New feature or request label May 31, 2020
@akshaykulkarni07 akshaykulkarni07 self-assigned this May 31, 2020
@akshaykulkarni07 akshaykulkarni07 marked this pull request as ready for review May 31, 2020 07:00
@akshaykulkarni07
Copy link
Member Author

Loss Function of ATKD
The authors mention that they use beta = 1000 / (batch size * number of elements in attention map) (around 0.1). They also say that they decay this beta parameter when using it along with KD.
Now, we have 2 choices:

  1. Use beta = 1 because our implementation doesn't rely on such weighting.
  2. Use beta as in the paper. However, they don't mention how they actually decay it and their code is not exactly readable (but if you can go through and find, then it would be good).

Please advise on what to do? As far as implementation is concerned, both are equally straightforward.

Copy link
Member

@navidpanchi navidpanchi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything looks good to me.

Copy link
Member

@SharathRaparthy SharathRaparthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not completely aware of this method but approving

@SharathRaparthy
Copy link
Member

Loss Function of ATKD
The authors mention that they use beta = 1000 / (batch size * number of elements in attention map) (around 0.1). They also say that they decay this beta parameter when using it along with KD.
Now, we have 2 choices:

  1. Use beta = 1 because our implementation doesn't rely on such weighting.
  2. Use beta as in the paper. However, they don't mention how they actually decay it and their code is not exactly readable (but if you can go through and find, then it would be good).

Please advise on what to do? As far as implementation is concerned, both are equally straightforward.

Implement #1 as of now and let's see how the experiments go. In the meanwhile, you can look into the decaying. Are they using any schedulers?

@akshaykulkarni07
Copy link
Member Author

akshaykulkarni07 commented Jun 7, 2020

@navidpanchi @SharathRaparthy
Some more information of their experiments:

  1. Use SGD with weight decay (1e-4) (L2 regularization) (we use Adam without weight decay)
  2. Use learning rate scheduling in steps. Multiply LR by 0.1 at 30, 60, 90 epochs. (we don't use any LR scheduling in image classification experiments).
  3. No information about decaying the beta parameter. They mention in their README that they plan to add the code for it, but last commit was in July 2018. There is also an open issue regarding this without reply.

@akshaykulkarni07 akshaykulkarni07 merged commit 0b7e95d into master Jun 7, 2020
@akshaykulkarni07 akshaykulkarni07 deleted the atkd branch June 7, 2020 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants