Allow custom checkpoint impl via checkpoint manager interface by ananthsub · Pull Request #1955 · NVIDIA-NeMo/Megatron-Bridge

ananthsub · 2026-01-15T09:49:27Z

What does this PR do ?

Currently, checkpointing save/load is implemented purely through functional calls, making customization by users impossible without direct modification/forking.

This PR introduces a CheckpointManager protocol to allow users to inject custom save/load implementations without needing to fork the code.

Protocol overview:

CheckpointManager: main protocol to implement save/load through. finalizing async calls is also routed through this protocl now too for cleanup
Auxiliary dataclasses to encapsulate save/load argument inputs

CheckpointConfig is augmented with a field to specify the path to the custom manager class. If specified, the class is imported and used during training. A factory function is added to instantiate the manager to use for training. A default implementation is provided, which wraps around the existing functions. The default implementation also stores the checkpoint context map, used for local checkpointing. There are no changes made to these functions in this PR, so the core checkpoint saving/loading logic is untouched.

Integration points:

setup.py is updated to create the checkpoint manager and use it for the initial checkpoint load
in train.py in the main training loop, the checkpoint manager is used for all saves + async finalization
in pretrain.py, the checkpoint manager is used for the end of train checkpoint save

Changelog

Enable custom checkpoint implementations via CheckpointManager interface

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

Summary by CodeRabbit

New Features
- Custom checkpoint manager configuration - users can now specify a custom checkpoint manager class through configuration to customize checkpoint save/load behavior.
Documentation
- Comprehensive documentation added for creating and using custom checkpoint managers, including configuration guidance, usage examples, and implementation patterns.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

copy-pr-bot · 2026-01-15T09:49:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ananthsub · 2026-01-15T09:49:43Z

/ok to test 7d3064c

ananthsub · 2026-01-15T18:05:16Z

/ok to test 0425c2a

maanug-nv

I think this will work great for pretraining, but I have some questions outside of that use case.

I think several functions in model_load_save.py would fail if using a custom checkpoint. probably we don't need to support HF <-> custom format. but what about loading for inference? i think we would either have to make all the functions that model_load_save.py imports from checkpointing part of the protocol (load model weights, get run config), or just document/warn that it is undefined behavior.
if a user wants to do PEFT on a custom checkpoint, seems like loading would fail because we call _load_checkpoint_from_path() in the peft_pre_wrap_hook(). How would I replace that hook after setup? With a callback? Generally less familiar with PEFT, so maybe there's other concerns.

src/megatron/bridge/training/checkpointing.py

maanug-nv · 2026-02-03T01:25:29Z

src/megatron/bridge/training/setup.py

What if the user's custom logic does not save the same train_state.pt or changes it's name, location, etc.? then we never load right?

similar q's -

I suspect cfg.checkpoint.load is expected to be the same as the save prop? i.e. the parent dir, not a specific checkpoint version? Is there any defaulting logic for it to be the same as save, or do users have to explicitly set it?

Is it possible to make this condition customizable, so that the checkpoint manager can determine if it can load or not, and then how to load? E.g. add an API to CheckpointManager like should_load(...) -> bool or something. That way it can also take into account any custom configuration, plus specific logic it wants to run.

There are 2 conditions I'm thinking of for loading a checkpoint - whether the user has configured loading (usually yes to auto handle recovering after an interruption, for example) and whether there is an available checkpoint to load from. Both have to be true to successfully load, otherwise either train from scratch or fail if the user wants to force loading from a checkpoint.

Where is the downstream logic that will use this? How does it know whether to train from scratch or from a pre-loaded checkpoint?

ananthsub · 2026-02-05T17:14:00Z

I think several functions in model_load_save.py would fail if using a custom checkpoint. probably we don't need to support HF <-> custom format. but what about loading for inference? i think we would either have to make all the functions that model_load_save.py imports from checkpointing part of the protocol (load model weights, get run config), or just document/warn that it is undefined behavior.

if a user wants to do PEFT on a custom checkpoint, seems like loading would fail because we call _load_checkpoint_from_path() in the peft_pre_wrap_hook(). How would I replace that hook after setup? With a callback? Generally less familiar with PEFT, so maybe there's other concerns.

I think supporting a custom checkpoint format is beyond the initial scope at the moment. This customization is intended for users to be able to handle the save/load operations, but we should expect to have the same checkpoint format on disk. Otherwise there are too many dependencies both within bridge and outside that would then need to access the checkpoint protocol to figure out how to load the checkpoint, which will be very complicated for users to deal with.

I've updated the docs with a limitations section to make this clear

ananthsub · 2026-02-05T17:14:05Z

/ok to test b3d9164

docs/training/checkpointing.md

g-husam · 2026-02-06T21:19:58Z

src/megatron/bridge/training/pretrain.py


        barrier_and_log("after training is done")
        ckpt_config = config.checkpoint
        if ckpt_config.save and state.train_state.step != 0 and ckpt_config.save_interval != 0:


Hey @ananthsub , question: will this checkpoint every step so long as these properties are truthy? I'm wondering where the evaluation of something like if curr_step % save_interval == 0: do save is happening, to know when to call the save.

Alternatively, the step info can be passed in to checkpoint_manager.save (I think it is in state.training_state), and then the manager can make the decision. That would be ideal for us actually, but doesn't look like the default impl does that check, so I presume it is happening elsewhere, unless I missed it.

(I'm still reviewing to see if I have any other q's, but wanted to post this q in the meantime to avoid a long delay)

g-husam

Thanks again!

g-husam · 2026-02-09T16:28:30Z

docs/training/checkpointing.md

 )
 ```

 ```{note}


Btw unrelated, but when reading the rendered doc I noticed this seems to throw off the formatting (I think the backticks are not closed off at the end)

Could also use

> [!NOTE] > The `load` parameter should ...

src/megatron/bridge/training/checkpointing.py

g-husam · 2026-02-10T03:34:29Z

src/megatron/bridge/training/checkpointing.py

+        except AttributeError as e:
+            raise AttributeError(f"Module '{module_path}' does not have class '{class_name}': {e}") from e
+
+        manager = custom_manager_class(checkpoint_config)


we would need to provide some additional config props to our custom checkpoint managers, which we can do by creating a subtype of CheckpointConfig, and using that in the recipe rather than CheckpointConfig. That looks like it should work fine, but would be great to get some confirmation/guarantee that subclassing will be supported, perhaps with documentation and some test case

@g-husam yes, that is the intended way to customize this. I will update the documentation to make that clearer

docs/training/checkpointing.md

g-husam · 2026-02-10T03:58:52Z

src/megatron/bridge/training/setup.py

similar q's -

I suspect cfg.checkpoint.load is expected to be the same as the save prop? i.e. the parent dir, not a specific checkpoint version? Is there any defaulting logic for it to be the same as save, or do users have to explicitly set it?

Is it possible to make this condition customizable, so that the checkpoint manager can determine if it can load or not, and then how to load? E.g. add an API to CheckpointManager like should_load(...) -> bool or something. That way it can also take into account any custom configuration, plus specific logic it wants to run.

There are 2 conditions I'm thinking of for loading a checkpoint - whether the user has configured loading (usually yes to auto handle recovering after an interruption, for example) and whether there is an available checkpoint to load from. Both have to be true to successfully load, otherwise either train from scratch or fail if the user wants to force loading from a checkpoint.

Where is the downstream logic that will use this? How does it know whether to train from scratch or from a pre-loaded checkpoint?

g-husam · 2026-02-26T19:38:14Z

👋 Hey @ananthsub , just checking in on my prior questions and to see when this could be merged. Thanks!

g-husam · 2026-03-11T16:03:45Z

👋 Hey @ananthsub , just checking in on my prior questions and to see when this could be merged. Thanks!

Hi @ananthsub, just checking in again on this. Thanks!

yaoyu-33 · 2026-03-24T20:37:30Z

/ok to test 61c17d7

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

yaoyu-33 · 2026-03-24T20:43:06Z

/ok to test 7695c03

yaoyu-33 · 2026-03-24T20:44:03Z

/claude review

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-24T20:45:42Z

/ok to test 3e86223

claude · 2026-03-24T20:46:02Z

src/megatron/bridge/training/pretrain.py

+                CheckpointSaveContext(
+                    state=state,
+                    model=model,
+                    optimizer=optimizer,
+                    opt_param_scheduler=scheduler,
+                    num_floating_point_operations_so_far=int(state.train_state.floating_point_operations_so_far),
+                    train_data_iterator=train_data_iterator,
+                )
+            )

    else:
        print_rank_0("skipping training ...")


Bug: train() already calls checkpoint_manager.finalize_async_saves(terminate=True) before returning (train.py L626), which shuts down the async save worker. This subsequent save() may fail or behave unexpectedly if async checkpointing is enabled.

Also, if the final training step happens to align with save_interval, this duplicates the checkpoint that train() already saved.

claude · 2026-03-24T20:46:04Z

tests/unit_tests/training/test_checkpointing.py

+            mock_init_ctx.return_value = {"context": "data"}
+            manager = DefaultCheckpointManager(config)
+            manager.save(ctx)
+


Missing test: The TypeError branch in create_checkpoint_manager (when a custom class doesn't implement the CheckpointManager protocol) has no test coverage. Consider adding a test with a class that's missing one of the required methods (e.g., save) to verify the error is raised.

- Add exception chaining (from err) to ValueError in create_checkpoint_manager - Rename unused ctx -> _ctx in CustomTestManager test stub - Add pytestmark = pytest.mark.unit to test_train.py - Remove redundant final checkpoint save in _pretrain: train() already saves the final checkpoint and terminates the async worker before returning; the duplicate save in pretrain.py was both redundant and unsafe for async ckpt Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-24T20:52:02Z

/ok to test 26e6b08

…checkpoint_manager context Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 · 2026-03-24T23:33:18Z

/ok to test da25dec

ananthsub requested review from dimapihtar and yaoyu-33 January 15, 2026 09:49

copy-pr-bot bot temporarily deployed to nemo-ci January 15, 2026 09:50 Inactive

copy-pr-bot bot temporarily deployed to test January 15, 2026 09:50 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 15, 2026 09:55 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 15, 2026 09:56 Failure

copy-pr-bot bot temporarily deployed to test January 15, 2026 18:05 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 15, 2026 18:06 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 15, 2026 18:17 Inactive

copy-pr-bot bot had a problem deploying to nemo-ci January 15, 2026 18:18 Failure

copy-pr-bot bot temporarily deployed to nemo-ci January 15, 2026 18:31 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 15, 2026 18:40 Inactive

copy-pr-bot bot temporarily deployed to nemo-ci January 29, 2026 10:54 Inactive

maanug-nv reviewed Feb 3, 2026

View reviewed changes

yaoyu-33 previously approved these changes Feb 5, 2026

View reviewed changes

g-husam reviewed Feb 5, 2026

View reviewed changes

docs/training/checkpointing.md Show resolved Hide resolved

g-husam reviewed Feb 6, 2026

View reviewed changes

g-husam reviewed Feb 10, 2026

View reviewed changes

ananthsub added 4 commits March 24, 2026 13:42

allow custom checkpoint impl via checkpoint manager interface

a5a5ee4

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

update docs

0cda272

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

fix test mocks

3c9b10e

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

fix rebase

a08d1bf

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>

document format restrictions

3e86223

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com> Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

claude bot reviewed Mar 24, 2026

View reviewed changes

[ckpt] fix: replace undefined checkpointing_context in setup.py with …

da25dec

…checkpoint_manager context Signed-off-by: yaoyu-33 <yaoyu.094@gmail.com>

yaoyu-33 approved these changes Mar 25, 2026

View reviewed changes

Conversation

ananthsub commented Jan 15, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Jan 15, 2026

Uh oh!

ananthsub commented Jan 15, 2026

Uh oh!

ananthsub commented Jan 15, 2026

Uh oh!

maanug-nv left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

maanug-nv Feb 3, 2026

Choose a reason for hiding this comment

Uh oh!

g-husam Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ananthsub commented Feb 5, 2026

Uh oh!

ananthsub commented Feb 5, 2026

Uh oh!

Uh oh!

g-husam Feb 6, 2026

Choose a reason for hiding this comment

Uh oh!

g-husam left a comment

Choose a reason for hiding this comment

Uh oh!

g-husam Feb 9, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

g-husam Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

ananthsub Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

g-husam Feb 10, 2026

Choose a reason for hiding this comment

Uh oh!

g-husam commented Feb 26, 2026

Uh oh!

g-husam commented Mar 11, 2026

Uh oh!

yaoyu-33 commented Mar 24, 2026

Uh oh!

yaoyu-33 commented Mar 24, 2026

Uh oh!

yaoyu-33 commented Mar 24, 2026

Uh oh!

yaoyu-33 commented Mar 24, 2026

Uh oh!

claude bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

claude bot Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 commented Mar 24, 2026

Uh oh!

yaoyu-33 commented Mar 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

ananthsub commented Jan 15, 2026 •

edited by coderabbitai bot

Loading