fix hf model loading : TBD by yiakwy-xpu-ml-framework-team · Pull Request #1449 · NVIDIA-NeMo/Megatron-Bridge

yiakwy-xpu-ml-framework-team · 2025-11-21T07:52:09Z

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

HF loading broken.

Reason : set intersection is not unified

Test

from megatron.bridge import AutoBridge
from megatron.bridge.training.model_load_save import load_megatron_model, load_tokenizer

from megatron.core import parallel_state
from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# model info
MODEL="Llama-2-7b-hf"
MODEL_TYPE="llama2-7B"

# create soft links to /workspace/models
MODEL_DIR="/workspace/models"

HF_MODEL_DIR=f"{MODEL_DIR}/{MODEL}"

# Specify model partitions
TP=1
PP=1

SAVER="mcore_bridge"

def export():
    # fp8 recipe
    dtype="fp8"

    # using Megatron Bridge provider API
    bridge = AutoBridge.from_hf_pretrained(f"{HF_MODEL_DIR}", trust_remote_code=True)

    provider = bridge.to_megatron_provider()

    provider.tensor_model_parallel_size = TP
    provider.pipeline_model_parallel_size = PP

    provider.finalize()

    model = provider.provide_distributed_model(wrap_with_ddp=False)

    # output info
    OUTPUT=f"{MODEL_DIR}/MODEL-to-{SAVER}-tp{TP}-pp{PP}-{dtype}"

    bridge.save_hf_pretrained(model, f"{OUTPUT}")

    return model
    
 model = export()

Snapshot

Changelog

unifying set operation

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to # (issue)

copy-pr-bot · 2025-11-21T07:52:11Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

ananthsub · 2025-11-21T23:16:30Z

src/megatron/bridge/models/hf_pretrained/state.py


        # Final check on whether all original tensors were written.
-        unsaved_keys = all_expected_keys - all_saved_keys
+        unsaved_keys = all_expected_keys.intersection(all_saved_keys)


could you share the error you were seeing without this change? this change does not look correct - we should be calculating the set difference to determine the unsaved keys, not the intersection.

Ok I see.

anyway. It seems that megatron added some empty tensor , then saving failed.

if not unsaved_keys: ... else: print( f"\nError: {len(unsaved_keys)} tensors from the original checkpoint were not written. See warnings above for details." )

Intersection just let the program pass but does not generate the correct distributed checkponit.

#1466

Anyway, the old megatron codes work well.

But megatron core and megatron bridge cannot generate distributed checkpoint.

@ananthsub I will take closer look at this file. Note the current fucntions does not work for old small models for testing purpose. I will check with gpt-oss later.

@ananthsub the original codes works with gptoss.

So the problem only appear in the testing model. Let me check what happened with testing model.

Full GptOSS continue training example:
NVIDIA/Megatron-LM#2383

github-actions bot added the community-request label Nov 21, 2025

unifying set intersection operations

92d2892

yiakwy-xpu-ml-framework-team force-pushed the fix_hf_model_loading branch from 368ad00 to 92d2892 Compare November 21, 2025 07:53

ananthsub reviewed Nov 21, 2025

View reviewed changes

yiakwy-xpu-ml-framework-team changed the title ~~fix hf model loading : unifying set intersection operations~~ fix hf model loading : TBD Nov 22, 2025

yiakwy-xpu-ml-framework-team marked this pull request as draft November 25, 2025 10:21

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix hf model loading : TBD#1449

fix hf model loading : TBD#1449
yiakwy-xpu-ml-framework-team wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yiakwy-xpu-ml-framework-team:fix_hf_model_loading

yiakwy-xpu-ml-framework-team commented Nov 21, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

ananthsub Nov 21, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yiakwy-xpu-ml-framework-team commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

Test

Snapshot

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Nov 21, 2025

Uh oh!

ananthsub Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Nov 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiakwy-xpu-ml-framework-team commented Nov 21, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Nov 22, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Nov 22, 2025 •

edited

Loading