Skip to content
Draft
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions src/megatron/bridge/models/hf_pretrained/state.py
Original file line number Diff line number Diff line change
Expand Up @@ -795,9 +795,9 @@ def save_generator(
)

# Final check on whether all original tensors were written.
unsaved_keys = all_expected_keys - all_saved_keys
unsaved_keys = all_expected_keys.intersection(all_saved_keys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share the error you were seeing without this change? this change does not look correct - we should be calculating the set difference to determine the unsaved keys, not the intersection.

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see.

anyway. It seems that megatron added some empty tensor , then saving failed.

        if not unsaved_keys:
            ...
        else:
            print(
                f"\nError: {len(unsaved_keys)} tensors from the original checkpoint were not written. See warnings above for details."
            )

Intersection just let the program pass but does not generate the correct distributed checkponit.

#1466

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, the old megatron codes work well.

But megatron core and megatron bridge cannot generate distributed checkpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub I will take closer look at this file. Note the current fucntions does not work for old small models for testing purpose. I will check with gpt-oss later.

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub the original codes works with gptoss.

So the problem only appear in the testing model. Let me check what happened with testing model.

Full GptOSS continue training example:
NVIDIA/Megatron-LM#2383

if not unsaved_keys:
extra_keys = all_yielded_keys - all_expected_keys
extra_keys = all_yielded_keys.intersection(all_expected_keys)
if extra_keys:
print(
f"\nSuccess: All tensors from the original checkpoint were written. "
Expand Down