Skip to content

[checkpoint] Expose public API to load model weights#2239

Open
ananthsub wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
ananthsub:ckpt-utils-consolidate
Open

[checkpoint] Expose public API to load model weights#2239
ananthsub wants to merge 9 commits intoNVIDIA-NeMo:mainfrom
ananthsub:ckpt-utils-consolidate

Conversation

@ananthsub
Copy link
Copy Markdown
Contributor

@ananthsub ananthsub commented Feb 5, 2026

What does this PR do ?

Expose a public load_model_weights function which to load just the model states from the checkpoint. this avoids the checkpoint config and interactions between the load vs pretrained_checkpoint directories, or being required to use the ckpt_set to load form a specific step

Changelog

  • Add specific line by line info of high level changes in this PR.

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for FSDP DTensor checkpoint loading format with enhanced state dict conversion capabilities.
    • Improved checkpoint path resolution to intelligently detect and select the correct iteration directory when multiple versions exist.
  • Bug Fixes

    • Enhanced checkpoint discovery logic to reliably locate configuration files across various checkpoint directory structures.
    • Improved error handling for missing checkpoint paths and malformed iteration directories.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Feb 5, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@ananthsub
Copy link
Copy Markdown
Contributor Author

/ok to test d9bab4f

@ananthsub
Copy link
Copy Markdown
Contributor Author

/ok to test d96819a

@ananthsub ananthsub requested a review from yaoyu-33 February 5, 2026 17:38
@ananthsub ananthsub marked this pull request as ready for review February 23, 2026 15:32
Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub
Copy link
Copy Markdown
Contributor Author

/ok to test 9c0c19c

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub
Copy link
Copy Markdown
Contributor Author

/ok to test f3b360d

Signed-off-by: Ananth Subramaniam <ansubramania@nvidia.com>
@ananthsub
Copy link
Copy Markdown
Contributor Author

/ok to test 9c54439

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants