[Feature] add gptoss continue train bf16-fp8 (sft) example [part1 - mcore] by yiakwy-xpu-ml-framework-team · Pull Request #2383 · NVIDIA/Megatron-LM

yiakwy-xpu-ml-framework-team · 2025-11-24T20:45:27Z

What does this PR do ?

⚠️ For major changes (either in lines of code or in its impact), please make sure to first share discuss a design-doc with the team.

Add gptoss 20b training example in Hopper platform:

Env :

cu 12.8 + torch 2.9.1
fa2.8.3 , fa3 (latest)
TransformerrEngine [pytorch ](latest, 2.9.0)
triton 3.4 (to work with triton sorting kernel)
Base Image : SGLang 0.5.0rc2

Support parallel scheme:

PP=1, EP=8
PP=2, EP=4
TP=8, ETP=1, EP=8, PP=1

ETP must be 1 ,since ETP should not be > 1 when add_bias is True (required by GptOSS) .

Snapshot

full gptoss 24 layers with TP8-EP8:

Steps to reproduce

Generate distributed checkpoint:

torchrun $DISTRIBUTED_ARGS convert_mcore_bf16_checkpoint_from_hf.py 2>&1 | tee megatron_fwd.log

start training jobs

# slurm args
bash training_gptoss_20b_120b_h100_bf16_fp8.sh

Changes

megatron core:
- experts : output should be applied bias with unpadded tokens_per_expert
GptOss Yarn Config:
- add GptOss yarn config before model construction

Other Rleated Issue:

Megatron Bridge:
- [Fix] gptoss yarn parameter NVIDIA-NeMo/Megatron-Bridge#1491 : fix yarn default parameter
TransformerEngine
- Fix transformer 2.9.0 (torch 2.9.1 used by SGLang 0.5.5) build TransformerEngine#2445 : build customer transformer-engine to work with pytorch 3.8, pytorch 3.9.1, triton 3.4.0 (safe commit ID for triton kernels : 05b2c186c) (Note transformer uses sort kernel not supported triton 3.5.1)
SGlang Rollout
- [BUG] fix sglang veRL GptOSS rollout problem sgl-project/sglang#14099
VeRL
- [trainer] feat: add gpt-oss sglang-megatron example verl-project/verl#4394
- [megatron] feat: support gpt-oss verl-project/verl#4323

Contribution process

flowchart LR
    A[Pre-checks] --> B[PR Tests]
    subgraph Code Review/Approval
        C1[Expert Review] --> C2[Final Review]
    end
    B --> C1
    C2 --> D[Merge]

Pre-checks

I want this PR in a versioned release and have added the appropriate Milestone (e.g., Core 0.8)
I have added relevant unit tests
I have added relevant functional tests
I have added proper typing to my code Typing guidelines
I have added relevant documentation
I have run the autoformatter.sh on my PR

Code review

The following process is enforced via the CODEOWNERS file for changes into megatron/core. For changes outside of megatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.

For MRs into `main` branch

(Step 1): Add PR label `Expert Review`

(Step 2): Collect the expert reviewers reviews

Attach the Expert Review label when your PR is ready for review.
GitHub auto-assigns expert reviewers based on your changes. They will get notified and pick up your PR soon.

⚠️ Only proceed to the next step once all reviewers have approved, merge-conflict are resolved and the CI is passing.
Final Review might get declined if these requirements are not fulfilled.

(Step 3): Final Review

Add Final Review label
GitHub auto-assigns final reviewers based on your changes. They will get notified and pick up your PR soon.

(Optional Step 4): Cherry-pick into release branch

If this PR also needs to be merged into core_r* release branches, after this PR has been merged, select Cherry-pick to open a new PR into the release branch.

For MRs into `dev` branch

The proposed review process for `dev` branch is under active discussion.

MRs are mergable after one approval by either eharper@nvidia.com or zijiey@nvidia.com.

Merging your PR

Any member of core-adlr and core-nemo will be able to merge your PR.

copy-pr-bot · 2025-11-24T20:45:31Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

megatron/core/models/gpt/gpt_model.py

sbhavani · 2025-12-01T20:35:57Z

@yiakwy-xpu-ml-framework-team thanks for creating this example!

We are currently refactoring config management to improve validation and reduce the number of default args just like in Megatron Bridge. I'd recommend we wait until the refactor is done this month and then merge the example.

I also noticed you require other changes outside of examples/ and arguments.py. I think it'd better to split those changes into separate PRs (experts.py bug fix, doc fixes, etc.) from the GPT-OSS example.

yaoyu-33 · 2025-12-02T16:43:47Z

examples/gptoss/convert_mcore_bf16_checkpoint_from_hf.py

+
+from megatron.bridge import AutoBridge
+from megatron.bridge.utils.common_utils import get_last_rank, print_rank_0
+from megatron.bridge.training.model_load_save import load_megatron_model, save_megatron_model, load_tokenizer


Hi, it's not ideal to import bridge from megatron-lm side, it will cause a cyclic dependency. Also megatron-lm env doesnt require users to install bridge.

do you think it's okay to put the example in bridge?

@yaoyu-33 I see, this is a tool in example indendent from Megatron-Core. I didn't encoutner the cyclic reference problem.

I use this script to generate megatron distributed checkpoint and verify it.

it's more like we are not asking megatron-lm users to install bridge, so it's a bit confusing to put things here. I think it needs other people's opinions to see what's the best way.

yaoyu-33 · 2025-12-02T16:44:36Z

examples/llama/convert_megatron_bf16_checkpoint_from_hf.sh

@@ -0,0 +1,43 @@
+#/usr/bin/bash


llama changes should be another pr, if this pr is mostly gpt-oss related.

I noticed we only added verification for llama3. I just updated it from my local repo.

The mian focus this script is gptoss. Yes it is better to be in separate PR but if it is not too inconvenient to you and we can add it to the repo in this PR D.

yaoyu-33 · 2025-12-02T16:47:08Z

megatron/core/parallel_state.py

    get_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]] = None,
    get_position_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]] = None,
-    create_gloo_process_groups: bool = True,
+    create_gloo_process_groups: bool = False,


this is a major change that affect all training, it will increase the mem footprint, what's the reason?

Ok I will mute it if memory footprint is increased. I didn't observed side effects and I just find that by deault many gloo connection was created in CPU side.

fengxy-03 · 2026-01-16T07:12:57Z

@yiakwy-xpu-ml-framework-team @yaoyu-33 Hi, I've implemented a GPT-OSS model (0.11B) based on your guidelines using Megatron-LM 0.16.0rc0. During training, the throughput per GPU is only around 1.0 TFLOP/s, which seems abnormal for this configuration.

Could you please take a look at my script and logs to see if there are any obvious misconfigurations?

Training Script

SEQ_LENGTH=8192
MAX_LENGTH=8192
TRAIN_SAMPLES=1518124
LAST_TRAIN_SAMPLES=0
LR_DECAY_SAMPLES=$(((TRAIN_SAMPLES - LAST_TRAIN_SAMPLES) * 80 / 100))
CHECKPOINT_PATH=
# ================end================
TOKENIZER_TYPE=SentencePieceTokenizer
TOKENIZER_MODEL=
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=128

DISTRIBUTED_ARGS=" \
    --nnodes=1 \
    --nproc_per_node=8 \
    --node_rank=0 \
    --master_addr=localhost \
    --master_port=6000"

MODEL_ARGS=" \
    --no-masked-softmax-fusion \
    --transformer-impl transformer_engine \
    --disable-bias-linear \
    --untie-embeddings-and-output-weights \
    --no-rope-fusion \
    --normalization RMSNorm \
    --num-layers 12 \
    --hidden-size 512 \
    --ffn-hidden-size 2048 \
    --num-attention-heads 64 \
    --group-query-attention \
    --num-query-groups 8 \
    --seq-length 8192 \
    --max-position-embeddings 8192 \
    --use-mcore-models \
    --rotary-percent 1.0 \
    --rope-type yarn \
    --position-embedding-type yarn
    --rotary-base 10000 \
    --no-bias-gelu-fusion \
    --export-force-local-attention \
    --no-bias-dropout-fusion \
    --quick-geglu \
    --glu-linear-offset 1.0 \
    --softmax-type learnable \
    --window-attn-skip-freq 2 \
    --activation-func-clamp-value 7.0 \
    --window-size 128,0 \
    --enable-gpt-oss"

MOE_ARGS=" \
    --num-experts 4 \
    --moe-router-topk 2 \
    --moe-router-load-balancing-type aux_loss \
    --moe-aux-loss-coeff 1e-3 \
    --moe-grouped-gemm \
    --moe-token-dispatcher-type alltoall \
    --overlap-param-gather \
    --overlap-grad-reduce \
    --moe-ffn-hidden-size 2048 \
    --moe-router-dtype fp32 \
    --moe-z-loss-coeff 1e-3 \
    --moe-permute-fusion"

DATA_ARGS=" \
    --num-workers 8 \
    --dataloader-type cyclic \
    --tokenizer-type ${TOKENIZER_TYPE} \
    --tokenizer-model ${TOKENIZER_MODEL} \
    --data-path \
    --split 1000,0,0 \
    --no-create-attention-mask-in-dataloader"

TRAINING_ARGS=" \
    --micro-batch-size ${MICRO_BATCH_SIZE} \
    --global-batch-size ${GLOBAL_BATCH_SIZE} \
    --lr 1.0e-5 \
    --train-samples ${TRAIN_SAMPLES} \
    --lr-decay-samples ${LR_DECAY_SAMPLES} \
    --lr-decay-style cosine \
    --min-lr 1.0e-6 \
    --weight-decay 0.1 \
    --lr-warmup-fraction 0.05 \
    --clip-grad 1.0 \
    --bf16 \
    --use-flash-attn \
    --attention-softmax-in-fp32 \
    --accumulate-allreduce-grads-in-fp32 \
    --disable-bf16-reduced-precision-matmul \
    --recompute-activations"

MODEL_PARALLEL_ARGS=" \
    --tensor-model-parallel-size 4 \
    --pipeline-model-parallel-size 1 \
    --expert-model-parallel-size 2 \
    --sequence-parallel \
    --context-parallel-size 1 \
    --use-distributed-optimizer \
    --fp8-format hybrid \
    --fp8-param-gather \
    --fp8-amax-compute-algo max \
    --fp8-amax-history-len 1024"
    
LOGGING_ARGS=" \
    --log-interval 1 \
    --save-interval 10000 \
    --eval-interval 50000000 \
    --eval-iters 0 \
    --save $CHECKPOINT_PATH \
    --tensorboard-dir "${CHECKPOINT_PATH}/tensorboard" \
    --wandb-project ${WANDB_PROJECT:-"gpt-oss"} \
    --wandb-exp-name ${WANDB_NAME:-"gpt-oss-test"} \
    --moe-per-layer-logging \
    --no-load-optim \
    --no-load-rng \
    --log-throughput"


python -m torch.distributed.run ${DISTRIBUTED_ARGS} pretrain_gpt.py \
    ${MODEL_ARGS} \
    ${MOE_ARGS} \
    ${DATA_ARGS} \
    ${TRAINING_ARGS} \
    ${MODEL_PARALLEL_ARGS} \
    ${LOGGING_ARGS}

Logs

 [2026-01-16 15:01:05] iteration        3/   11860 | consumed samples:          384 | elapsed time per iteration (ms): 75210.0 | throughput per GPU (TFLOP/s/GPU): 1.1 | learning rate: 6.323595E-08 | global batch size:   128 | lm loss: 6.102232E+00 | z_loss: 2.145113E+00 | load_balancing_loss: 1.155455E+00 | loss scale: 1.0 | grad norm: 75.384 | number of skipped iterations:   0 | number of nan iterations:   0 |
 [2026-01-16 15:02:20] iteration        4/   11860 | consumed samples:          512 | elapsed time per iteration (ms): 75126.0 | throughput per GPU (TFLOP/s/GPU): 1.1 | learning rate: 8.431460E-08 | global batch size:   128 | lm loss: 6.109521E+00 | z_loss: 2.139967E+00 | load_balancing_loss: 1.154540E+00 | loss scale: 1.0 | grad norm: 74.716 | number of skipped iterations:   0 | number of nan iterations:   0 |

Phlip79 · 2026-03-04T23:12:22Z

Please reference Megatron-Bridge for how to use bf16 and fp8 for GPT-OSS.

yiakwy-xpu-ml-framework-team requested review from a team as code owners November 24, 2025 20:45

github-actions bot added the community-request label Nov 24, 2025

This was referenced Nov 24, 2025

add hoper llama golden with mcore calling stack #987

Closed

gpt-oss implementation #1739

Open

fix hf model loading : TBD NVIDIA-NeMo/Megatron-Bridge#1449

Draft

yiakwy-xpu-ml-framework-team added 2 commits November 25, 2025 05:01

add gptoss continue train (sft) example

b154895

clean file

7b7d0b3

yiakwy-xpu-ml-framework-team force-pushed the add_gptoss_example branch from 9cd908f to 7b7d0b3 Compare November 24, 2025 21:09

rename gpt20b continuous training file

6278c29

yiakwy-xpu-ml-framework-team mentioned this pull request Nov 24, 2025

[Fix] gptoss yarn parameter NVIDIA-NeMo/Megatron-Bridge#1491

Closed

5 tasks

ISEEKYAN reviewed Nov 25, 2025

View reviewed changes

megatron/core/models/gpt/gpt_model.py Outdated Show resolved Hide resolved

megatron/core/models/gpt/gpt_model.py Outdated Show resolved Hide resolved

add move modifcation from GPTModel to outside

a0c3dac

yiakwy-xpu-ml-framework-team requested a review from ISEEKYAN November 25, 2025 09:08

yiakwy-xpu-ml-framework-team changed the base branch from main to dev November 25, 2025 09:12

yiakwy-xpu-ml-framework-team requested review from a team as code owners November 25, 2025 09:12

yiakwy-xpu-ml-framework-team changed the base branch from dev to main November 25, 2025 09:14

This was referenced Nov 25, 2025

[Feature] Dev add gptoss example #2394

Draft

[BUG] fix sglang veRL GptOSS rollout problem sgl-project/sglang#14099

Open

yaoyu-33 reviewed Dec 2, 2025

View reviewed changes

sbhavani mentioned this pull request Dec 10, 2025

Help! Megatron LM training after Bridge conversion (GPT-OSS) NVIDIA-NeMo/Megatron-Bridge#1009

Closed

fengxy-03 mentioned this pull request Jan 23, 2026

Support for GPT-OSS pretraining script in Megatron-Bridge NVIDIA-NeMo/Megatron-Bridge#1944

Open

Phlip79 closed this Mar 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] add gptoss continue train bf16-fp8 (sft) example [part1 - mcore]#2383

[Feature] add gptoss continue train bf16-fp8 (sft) example [part1 - mcore]#2383
yiakwy-xpu-ml-framework-team wants to merge 4 commits intoNVIDIA:mainfrom
yiakwy-xpu-ml-framework-team:add_gptoss_example

yiakwy-xpu-ml-framework-team commented Nov 24, 2025 •

edited

Loading

Uh oh!

copy-pr-bot bot commented Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

sbhavani commented Dec 1, 2025

Uh oh!

yaoyu-33 Dec 2, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025

Uh oh!

yaoyu-33 Dec 3, 2025

Uh oh!

yaoyu-33 Dec 2, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025

Uh oh!

yaoyu-33 Dec 2, 2025

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025 •

edited

Loading

Uh oh!

fengxy-03 commented Jan 16, 2026

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

yiakwy-xpu-ml-framework-team commented Nov 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do ?

Snapshot

Steps to reproduce

Changes

Other Rleated Issue:

Contribution process

Pre-checks

Code review

(Step 1): Add PR label Expert Review

(Step 2): Collect the expert reviewers reviews

(Step 3): Final Review

(Optional Step 4): Cherry-pick into release branch

Merging your PR

Uh oh!

copy-pr-bot bot commented Nov 24, 2025

Uh oh!

Uh oh!

Uh oh!

sbhavani commented Dec 1, 2025

Uh oh!

yaoyu-33 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025

Choose a reason for hiding this comment

Uh oh!

yaoyu-33 Dec 2, 2025

Choose a reason for hiding this comment

Uh oh!

yiakwy-xpu-ml-framework-team Dec 3, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

fengxy-03 commented Jan 16, 2026

Training Script

Logs

Uh oh!

Phlip79 commented Mar 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yiakwy-xpu-ml-framework-team commented Nov 24, 2025 •

edited

Loading

(Step 1): Add PR label `Expert Review`

yiakwy-xpu-ml-framework-team Dec 3, 2025 •

edited

Loading