fix hf model loading : TBD#1449
fix hf model loading : TBD#1449yiakwy-xpu-ml-framework-team wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
Conversation
368ad00 to
92d2892
Compare
|
|
||
| # Final check on whether all original tensors were written. | ||
| unsaved_keys = all_expected_keys - all_saved_keys | ||
| unsaved_keys = all_expected_keys.intersection(all_saved_keys) |
There was a problem hiding this comment.
could you share the error you were seeing without this change? this change does not look correct - we should be calculating the set difference to determine the unsaved keys, not the intersection.
There was a problem hiding this comment.
Ok I see.
anyway. It seems that megatron added some empty tensor , then saving failed.
if not unsaved_keys:
...
else:
print(
f"\nError: {len(unsaved_keys)} tensors from the original checkpoint were not written. See warnings above for details."
)
Intersection just let the program pass but does not generate the correct distributed checkponit.
There was a problem hiding this comment.
Anyway, the old megatron codes work well.
But megatron core and megatron bridge cannot generate distributed checkpoint.
There was a problem hiding this comment.
@ananthsub I will take closer look at this file. Note the current fucntions does not work for old small models for testing purpose. I will check with gpt-oss later.
There was a problem hiding this comment.
@ananthsub the original codes works with gptoss.
So the problem only appear in the testing model. Let me check what happened with testing model.
Full GptOSS continue training example:
NVIDIA/Megatron-LM#2383
What does this PR do ?
Add a one line overview of what this PR aims to accomplish.
HF loading broken.
Reason : set intersection is not unified
Test
Snapshot
Changelog
GitHub Actions CI
See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.
Before your PR is "Ready for review"
Pre checks:
If you haven't finished some of the above items you can still open "Draft" PR.
Additional Information