[Fix] gptoss yarn parameter by yiakwy-xpu-ml-framework-team · Pull Request #1491 · NVIDIA-NeMo/Megatron-Bridge

yiakwy-xpu-ml-framework-team · 2025-11-24T21:54:53Z

What does this PR do ?

Fix gptoss GptOSS yarn default parameters when passing them to mcore where mcore does not accept None as valid vlaues.

Changelog

Fix GptOss Yarn default parameters : mcore cannot read none values

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?
Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
- Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

Related to [Feature] add gptoss continue train bf16-fp8 (sft) example [part1 - mcore] NVIDIA/Megatron-LM#2383

copy-pr-bot · 2025-11-24T21:54:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

yiakwy-xpu-ml-framework-team · 2025-11-25T09:46:19Z

Doctor sergiopperez Could you have a look at this ?

This is essentail for support of GptOSS traininig D.

yaoyu-33 · 2025-12-02T16:39:52Z

src/megatron/bridge/training/config.py

    """

-    use_gloo_process_groups: bool = True
+    use_gloo_process_groups: bool = False # True NOTE (yiakwy)


why is this change? please attach a note in the comment if the change is necessary

by default gloo will be used to create TCP connection in CPU side.

But we don't need it. It should be assumed to be Flase.

Will roll back to True

maanug-nv · 2026-02-17T18:57:19Z

updated version of this PR at #2413

maanug-nv · 2026-02-17T19:43:52Z

Sorry, I misinterpreted some settings in MCore , so #2413 is invalid. but yarn_mscale and yarn_mscale_all_dim are intentionally none (cc @cuichenx), and use_gloo_process_groups should stay as True to match initialize_model_parallel() defaults in MCore. So I don't think any of the changes in this PR are desired.

github-actions bot added the community-request label Nov 24, 2025

yiakwy-xpu-ml-framework-team mentioned this pull request Nov 25, 2025

[Feature] add gptoss continue train bf16-fp8 (sft) example [part1 - mcore] NVIDIA/Megatron-LM#2383

Closed

6 tasks

fix gptoss ckpt

0d06277

yiakwy-xpu-ml-framework-team force-pushed the fix_gptoss_ckpt_generation branch from 9877f5f to 0d06277 Compare November 25, 2025 10:05

yaoyu-33 reviewed Dec 2, 2025

View reviewed changes

yiakwy-xpu-ml-framework-team requested a review from yaoyu-33 December 3, 2025 07:39

chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026

maanug-nv closed this Feb 17, 2026

chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Fix] gptoss yarn parameter#1491

[Fix] gptoss yarn parameter#1491
yiakwy-xpu-ml-framework-team wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yiakwy-xpu-ml-framework-team:fix_gptoss_ckpt_generation

yiakwy-xpu-ml-framework-team commented Nov 24, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 24, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 25, 2025

Uh oh!

yaoyu-33 Dec 2, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025 •

edited

Loading

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025 •

edited

Loading

Uh oh!

maanug-nv commented Feb 17, 2026

Uh oh!

maanug-nv commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yiakwy-xpu-ml-framework-team commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Changelog

GitHub Actions CI

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot bot commented Nov 24, 2025

Uh oh!

yiakwy-xpu-ml-framework-team commented Nov 25, 2025

Uh oh!

yaoyu-33 Dec 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maanug-nv commented Feb 17, 2026

Uh oh!

maanug-nv commented Feb 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yiakwy-xpu-ml-framework-team commented Nov 24, 2025 •

edited

Loading

yaoyu-33 Dec 2, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Dec 3, 2025 •

edited

Loading

yiakwy-xpu-ml-framework-team Dec 3, 2025 •

edited

Loading