Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

by Antti Tarvainen (Curious AI), Harri Valpola (CuriousAI)

ArXiv:1703, NIPS 2017

This paper proposes a new semi-supervised learning model by incorporating the Student-Teacher models, and using the exponential mean (=average) of the student model weights as the teacher model, hence the name "Mean Teacher."

Importantly, the accompanied blog post and the code (+tips for choosing hyperparameters and other tuning) is awesome. It explains recent advancements in semi-supervised classification methods very well. I really like this kind of well-organized, well-surveyed, well-explained materials. It makes life much much easier.

Approach

Detailed approach is organized very nicely in the blog post. Below is my attempt to summarize it more compactly:

Entropy Minimization (2004) : pull the unlabeld sample predictions to the nearest class, which is the same as just changing the prediction confidence.
Student-Teacher models : either make the student task harder, or the teacher task easier to learn something useful
- Harder student task
  - $\Gamma$ version of Ladder Networks (2015) : make a perturbed input and train it to mimic the clean prediction
  - Virtual Adversarial Training (2015, 2017) : make an adversarial sample for a pertubed input
- Easier teacher task
  - Pseudo-Ensemble Agreement (2014) : ensemble over noises, use an ensemble of two perturbed predictions as the clean prediction
  - $\Pi$ Model (2017) : same as above
  - CT-GAN (2017) : same as above
  - Temporal Ensemble (2017) : in addition, ensemble over models - exponential moving average on the student model predictions, and use it as the teacher model prediction (which then pulls the student model again)
  - Mean Teacher (2017) : exponential moving average on the student model weights instead of on predictions. This allows online learning updates, better memory usage, and better performance.

Experiments

Settings:

CIFAR-10 with { 1000, 2000, 4000, 50000 } labels (out of 50000)
SVHN with { 250, 500, 1000, 73275 } labels (out of 73275)
- 500 labels with { 100000, 500000 } extra unlabeled images are also experimented
ImageNet 2012 with 128000 labels (10%, out of 1280000)

My Thoughts

Self-ensembling techniques seems to show surprisingly promising results... But why??
Can this averaging of model weights (Polyak averaging) benefit also in applications other than semi-supervised classification setting? Need to do some experiments.
Student-Teacher models are very prominent these days. Seems like it somehow regularizes(?) models well and boosts generalization performance. We should all think about whether it will also help in our own problems of interest.

Jan. 10, 2018 Note by Myungsub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mean-teachers.md

mean-teachers.md

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

by Antti Tarvainen (Curious AI), Harri Valpola (CuriousAI)

ArXiv:1703, NIPS 2017

Approach

Experiments

My Thoughts

Files

mean-teachers.md

Latest commit

History

mean-teachers.md

File metadata and controls

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

by Antti Tarvainen (Curious AI), Harri Valpola (CuriousAI)

ArXiv:1703, NIPS 2017

Approach

Experiments

My Thoughts