Skip to content

fix hf model loading : TBD#1449

Draft
yiakwy-xpu-ml-framework-team wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yiakwy-xpu-ml-framework-team:fix_hf_model_loading
Draft

fix hf model loading : TBD#1449
yiakwy-xpu-ml-framework-team wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yiakwy-xpu-ml-framework-team:fix_hf_model_loading

Conversation

@yiakwy-xpu-ml-framework-team
Copy link

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Nov 21, 2025

What does this PR do ?

Add a one line overview of what this PR aims to accomplish.

HF loading broken.

Reason : set intersection is not unified

Test

from megatron.bridge import AutoBridge
from megatron.bridge.training.model_load_save import load_megatron_model, load_tokenizer

from megatron.core import parallel_state
from megatron.core.tensor_parallel.random import model_parallel_cuda_manual_seed

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

# model info
MODEL="Llama-2-7b-hf"
MODEL_TYPE="llama2-7B"

# create soft links to /workspace/models
MODEL_DIR="/workspace/models"

HF_MODEL_DIR=f"{MODEL_DIR}/{MODEL}"

# Specify model partitions
TP=1
PP=1

SAVER="mcore_bridge"

def export():
    # fp8 recipe
    dtype="fp8"

    # using Megatron Bridge provider API
    bridge = AutoBridge.from_hf_pretrained(f"{HF_MODEL_DIR}", trust_remote_code=True)

    provider = bridge.to_megatron_provider()

    provider.tensor_model_parallel_size = TP
    provider.pipeline_model_parallel_size = PP

    provider.finalize()

    model = provider.provide_distributed_model(wrap_with_ddp=False)

    # output info
    OUTPUT=f"{MODEL_DIR}/MODEL-to-{SAVER}-tp{TP}-pp{PP}-{dtype}"

    bridge.save_hf_pretrained(model, f"{OUTPUT}")

    return model
    
 model = export() 

Snapshot

截屏2025-11-21 15 41 42

Changelog

  • unifying set operation

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

  • Related to # (issue)

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 21, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.


# Final check on whether all original tensors were written.
unsaved_keys = all_expected_keys - all_saved_keys
unsaved_keys = all_expected_keys.intersection(all_saved_keys)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you share the error you were seeing without this change? this change does not look correct - we should be calculating the set difference to determine the unsaved keys, not the intersection.

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok I see.

anyway. It seems that megatron added some empty tensor , then saving failed.

        if not unsaved_keys:
            ...
        else:
            print(
                f"\nError: {len(unsaved_keys)} tensors from the original checkpoint were not written. See warnings above for details."
            )

Intersection just let the program pass but does not generate the correct distributed checkponit.

#1466

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway, the old megatron codes work well.

But megatron core and megatron bridge cannot generate distributed checkpoint.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub I will take closer look at this file. Note the current fucntions does not work for old small models for testing purpose. I will check with gpt-oss later.

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Nov 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ananthsub the original codes works with gptoss.

So the problem only appear in the testing model. Let me check what happened with testing model.

Full GptOSS continue training example:
NVIDIA/Megatron-LM#2383

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team changed the title fix hf model loading : unifying set intersection operations fix hf model loading : TBD Nov 22, 2025
@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team marked this pull request as draft November 25, 2025 10:21
@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Mar 6, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants