Skip to content

[Fix] gptoss yarn parameter#1491

Closed
yiakwy-xpu-ml-framework-team wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yiakwy-xpu-ml-framework-team:fix_gptoss_ckpt_generation
Closed

[Fix] gptoss yarn parameter#1491
yiakwy-xpu-ml-framework-team wants to merge 1 commit intoNVIDIA-NeMo:mainfrom
yiakwy-xpu-ml-framework-team:fix_gptoss_ckpt_generation

Conversation

@yiakwy-xpu-ml-framework-team
Copy link

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team commented Nov 24, 2025

What does this PR do ?

Fix gptoss GptOSS yarn default parameters when passing them to mcore where mcore does not accept None as valid vlaues.

Changelog

  • Fix GptOss Yarn default parameters : mcore cannot read none values

GitHub Actions CI

See the CI sectionin the Contributing doc for how to trigger the CI. A Nvidia developer will need to approve and trigger the CI for external contributors.

Before your PR is "Ready for review"

Pre checks:

  • Make sure you read and followed Contributor guidelines
  • Did you write any new necessary tests?
  • Did you add or update any necessary documentation?
  • Does the PR affect components that are optional to install? (Ex: Numba, Pynini, Apex etc)
    • Reviewer: Does the PR have correct import guards for all optional libraries?

If you haven't finished some of the above items you can still open "Draft" PR.

Additional Information

@copy-pr-bot
Copy link

copy-pr-bot bot commented Nov 24, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@yiakwy-xpu-ml-framework-team
Copy link
Author

Doctor sergiopperez Could you have a look at this ?

This is essentail for support of GptOSS traininig D.

"""

use_gloo_process_groups: bool = True
use_gloo_process_groups: bool = False # True NOTE (yiakwy)
Copy link
Contributor

@yaoyu-33 yaoyu-33 Dec 2, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why is this change? please attach a note in the comment if the change is necessary

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by default gloo will be used to create TCP connection in CPU side.

But we don't need it. It should be assumed to be Flase.

Copy link
Author

@yiakwy-xpu-ml-framework-team yiakwy-xpu-ml-framework-team Dec 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will roll back to True

@chtruong814 chtruong814 added the needs-follow-up Issue needs follow-up label Jan 11, 2026
@maanug-nv
Copy link
Contributor

updated version of this PR at #2413

@maanug-nv maanug-nv closed this Feb 17, 2026
@chtruong814 chtruong814 removed the needs-follow-up Issue needs follow-up label Feb 17, 2026
@maanug-nv
Copy link
Contributor

Sorry, I misinterpreted some settings in MCore , so #2413 is invalid. but yarn_mscale and yarn_mscale_all_dim are intentionally none (cc @cuichenx), and use_gloo_process_groups should stay as True to match initialize_model_parallel() defaults in MCore. So I don't think any of the changes in this PR are desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants