Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Safe LayerNorm #14699

Closed
wants to merge 3 commits into from
Closed

Conversation

sxjscience
Copy link
Member

Description

Change all reduce in the LayerNorm operator to the "safe" version introduced in #14616. In essence, the safe version uses a higher-precision dtype to store the result of reduction. This may avoid the non-convergence problem in training Bert using Float16.

Partially solves #14073.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Always use the safe reduction in LayerNorm

Comments

@sxjscience
Copy link
Member Author

@Roshrini
Copy link
Member

@anirudh2290 @apeforest Can you help review this PR?

@Roshrini Roshrini added the pr-awaiting-review PR is waiting for code review label Apr 16, 2019
Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we relying on any exiting unit test?

@roywei
Copy link
Member

roywei commented Apr 29, 2019

@sxjscience Could you rebase your PR, the Julia CI failure should be fixed now. Thanks!

Copy link
Member

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make it backward compatible like #14830

@sxjscience sxjscience mentioned this pull request May 20, 2019
6 tasks
@sxjscience sxjscience closed this May 20, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants