fix: model weight updates with automatic_optimization=False in mixed precision training #20460

iamarunbrahma · 2024-12-02T19:49:19Z

What does this PR do?

Fixes an issue where model weights were not being properly updated when using automatic_optimization=False with mixed precision training (16-mixed precision). The core issues were:

Improper gradient scaling/unscaling in the AMP plugin
Missing validation of gradient values before optimizer steps
Lack of proper error handling for repeated unscale operations

The changes include:

Enhanced AMP plugin implementation with proper gradient scaling workflow
Added explicit gradient unscaling before optimizer steps
Improved error handling for edge cases
Added warning when gradients become NaN/inf during training

Fixes #20215

Before submitting

Was this discussed/agreed via a GitHub issue? (not for typos and docs)
Did you read the contributor guideline, Pull Request section?
Did you make sure your PR does only one thing, instead of bundling different changes together?
Did you make sure to update the documentation with your changes? (if necessary)
Did you write any new necessary tests? (not for typos and docs)
Did you verify new and existing tests pass locally with your changes?
Did you list all the breaking changes introduced by this pull request?
Did you update the CHANGELOG? (not for typos, docs, test updates, or minor internal changes/refactors)

PR review

Anyone in the community is welcome to review the PR.
Before you start reviewing, make sure you have read the review guidelines. In short, see the following bullet-list:

Reviewer checklist

Is this pull request ready for review? (if not, please submit in draft mode)
Check that all items from Before submitting are resolved
Make sure the title is self-explanatory and the description concisely explains the PR
Add labels and milestones (and optionally projects) to the PR so it can be classified

📚 Documentation preview 📚: https://pytorch-lightning--20460.org.readthedocs.build/en/20460/

…ing (16-mixed)

for more information, see https://pre-commit.ci

lantiga · 2024-12-03T11:59:46Z

Thank you for the contribution, will review today

codecov · 2024-12-03T12:21:38Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79%. Comparing base (be608fa) to head (4e6faf5).

❗ There is a different number of reports uploaded between BASE (be608fa) and HEAD (4e6faf5). Click for more details.

HEAD has 542 uploads less than BASE

Flag BASE (be608fa) HEAD (4e6faf5)

cpu 146 24

lightning_fabric 21 0

pytest 76 0

python3.9 36 6

lightning 109 18

python3.11 36 6

python3.10 19 3

gpu 2 0

python3.12 55 9

pytorch2.1 27 9

pytest-full 72 24

pytorch_lightning 18 6

pytorch2.2.2 9 3

pytorch2.3 9 3

pytorch2.5.1 18 6

pytorch2.4.1 9 3

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #20460     +/-   ##
=========================================
- Coverage      88%      79%     -9%     
=========================================
  Files         267      265      -2     
  Lines       23276    23288     +12     
=========================================
- Hits        20383    18322   -2061     
- Misses       2893     4966   +2073

lantiga · 2024-12-04T16:35:44Z

something must have happened when submitting the PR, changes need to go in here
https://github.com/Lightning-AI/pytorch-lightning/blob/master/src/lightning/pytorch/plugins/precision/amp.py#L67

but they appear as a stand-alone file https://github.com/Lightning-AI/pytorch-lightning/pull/20460/files#diff-77ef63a4f3cecc394dba6f10386a36ba11362b5e4086b71fca4a4561caf8b0c8, with the method outside the class

lantiga · 2024-12-04T18:14:38Z

@iamarunbrahma looking at the signature for optimizer_step from, are you working off a pre-2.0 version of PyTorch Lightning? optimizer_idx was removed before 2.0.0 in #16539

lantiga · 2024-12-04T19:21:59Z

ok the more I look at this the more puzzled I am, can you clarify the code in the PR @iamarunbrahma, how it fits with the plugin optimizer_step and how it solves the original issue?

lantiga · 2024-12-04T19:59:01Z

BTW the title is contradicting the issue: the issue describes something happening with automatic_optimization=True, not False. I'm closing this one.

fix: Ensure model weights update correctly with mixed precision train…

0e1c38f

…ing (16-mixed)

iamarunbrahma requested review from lantiga, Borda, tchaton, justusschock and ethanwharris as code owners December 2, 2024 19:49

github-actions bot added the pl Generic label for PyTorch Lightning package label Dec 2, 2024

[pre-commit.ci] auto fixes from pre-commit.com hooks

4e6faf5

for more information, see https://pre-commit.ci

iamarunbrahma mentioned this pull request Dec 2, 2024

Model does not update its weights #20215

Open

lantiga added the precision: amp Automatic Mixed Precision label Dec 3, 2024

lantiga closed this Dec 4, 2024

iamarunbrahma deleted the automatic_optimization_fix branch December 6, 2024 18:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: model weight updates with automatic_optimization=False in mixed precision training #20460

fix: model weight updates with automatic_optimization=False in mixed precision training #20460

iamarunbrahma commented Dec 2, 2024 •

edited by github-actions bot

Loading

lantiga commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

lantiga commented Dec 4, 2024 •

edited

Loading

lantiga commented Dec 4, 2024

lantiga commented Dec 4, 2024

lantiga commented Dec 4, 2024

fix: model weight updates with automatic_optimization=False in mixed precision training #20460

fix: model weight updates with automatic_optimization=False in mixed precision training #20460

Conversation

iamarunbrahma commented Dec 2, 2024 • edited by github-actions bot Loading

What does this PR do?

PR review

lantiga commented Dec 3, 2024 • edited Loading

codecov bot commented Dec 3, 2024 • edited Loading

Codecov Report

lantiga commented Dec 4, 2024 • edited Loading

lantiga commented Dec 4, 2024

lantiga commented Dec 4, 2024

lantiga commented Dec 4, 2024

iamarunbrahma commented Dec 2, 2024 •

edited by github-actions bot

Loading

lantiga commented Dec 3, 2024 •

edited

Loading

codecov bot commented Dec 3, 2024 •

edited

Loading

lantiga commented Dec 4, 2024 •

edited

Loading