Skip to content

fix: checkpoint saving with distributed optimizer + overlap param gather#949

Merged
terrykong merged 1 commit intoNVIDIA-NeMo:mainfrom
ananthsub:ananthsub/fix-ckpt-save-dist-opt-overlap-param-gather
Aug 21, 2025
Merged

fix: checkpoint saving with distributed optimizer + overlap param gather#949
terrykong merged 1 commit intoNVIDIA-NeMo:mainfrom
ananthsub:ananthsub/fix-ckpt-save-dist-opt-overlap-param-gather

Conversation

@ananthsub
Copy link
Contributor

@ananthsub ananthsub commented Aug 20, 2025

What does this PR do ?

Fixes checkpoint saving when distributed optimizer + overlap param gather are used together. We need to disable/re-enable the pre-forward hooks around checkpoint saving as done here to force the parameter sync to ensure consistency before checkpoint saving: https://github.com/NVIDIA-NeMo/Megatron-Bridge/blob/742aaf84d9a3c4108b675968cc5b5ed59e4e4921/src/megatron/bridge/training/train.py#L787-L805

Currently state: the loss spikes on resume without this:
Screenshot 2025-08-20 at 1 54 41 AM

With this PR:
Screenshot 2025-08-20 at 2 31 35 AM

Screenshot 2025-08-20 at 1 54 30 AM

Issues

List issues that this PR closes (syntax):

Usage

  • You can potentially add a usage example below
# Add a code snippet demonstrating how to use this

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you run the unit tests and functional tests locally? Visit our Testing Guide for how to run tests
  • Did you add or update any necessary documentation? Visit our Document Development Guide for how to write, build and test the docs.

Additional Information

  • ...

… for checkpoint saving

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub ananthsub requested a review from ashors1 August 20, 2025 09:38
@ananthsub ananthsub added t-mcore CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Aug 20, 2025
Copy link
Contributor

@ashors1 ashors1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the fix @ananthsub!

@terrykong terrykong added CI:docs Run doctest and removed CI:L2 Run doctests, unit tests, functional tests, and convergence tests labels Aug 20, 2025
@terrykong terrykong enabled auto-merge August 20, 2025 16:27
@terrykong terrykong added this pull request to the merge queue Aug 20, 2025
Merged via the queue into NVIDIA-NeMo:main with commit da69573 Aug 21, 2025
64 of 66 checks passed
@ananthsub ananthsub deleted the ananthsub/fix-ckpt-save-dist-opt-overlap-param-gather branch August 22, 2025 05:57
jveronvialard pushed a commit that referenced this pull request Aug 27, 2025
…her (#949)

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Julien Veron Vialard <jveronvialar@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Aug 28, 2025
…her (NVIDIA-NeMo#949)

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
soodoshll pushed a commit to soodoshll/RL that referenced this pull request Sep 4, 2025
…her (NVIDIA-NeMo#949)

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Signed-off-by: Qidong Su <qidongs@nvidia.com>
PrinsYin pushed a commit to PrinsYin/RL that referenced this pull request Nov 30, 2025
…her (NVIDIA-NeMo#949)

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants