[Feature] add gptoss continue train bf16-fp8 (sft) example [part1 - mcore]#2383
[Feature] add gptoss continue train bf16-fp8 (sft) example [part1 - mcore]#2383yiakwy-xpu-ml-framework-team wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
9cd908f to
7b7d0b3
Compare
|
@yiakwy-xpu-ml-framework-team thanks for creating this example! We are currently refactoring config management to improve validation and reduce the number of default args just like in Megatron Bridge. I'd recommend we wait until the refactor is done this month and then merge the example. I also noticed you require other changes outside of examples/ and arguments.py. I think it'd better to split those changes into separate PRs (experts.py bug fix, doc fixes, etc.) from the GPT-OSS example. |
|
|
||
| from megatron.bridge import AutoBridge | ||
| from megatron.bridge.utils.common_utils import get_last_rank, print_rank_0 | ||
| from megatron.bridge.training.model_load_save import load_megatron_model, save_megatron_model, load_tokenizer |
There was a problem hiding this comment.
Hi, it's not ideal to import bridge from megatron-lm side, it will cause a cyclic dependency. Also megatron-lm env doesnt require users to install bridge.
do you think it's okay to put the example in bridge?
There was a problem hiding this comment.
@yaoyu-33 I see, this is a tool in example indendent from Megatron-Core. I didn't encoutner the cyclic reference problem.
I use this script to generate megatron distributed checkpoint and verify it.
There was a problem hiding this comment.
it's more like we are not asking megatron-lm users to install bridge, so it's a bit confusing to put things here. I think it needs other people's opinions to see what's the best way.
| @@ -0,0 +1,43 @@ | |||
| #/usr/bin/bash | |||
There was a problem hiding this comment.
llama changes should be another pr, if this pr is mostly gpt-oss related.
There was a problem hiding this comment.
I noticed we only added verification for llama3. I just updated it from my local repo.
The mian focus this script is gptoss. Yes it is better to be in separate PR but if it is not too inconvenient to you and we can add it to the repo in this PR D.
| get_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]] = None, | ||
| get_position_embedding_ranks: Optional[Callable[[List[int], Optional[int]], List[int]]] = None, | ||
| create_gloo_process_groups: bool = True, | ||
| create_gloo_process_groups: bool = False, |
There was a problem hiding this comment.
this is a major change that affect all training, it will increase the mem footprint, what's the reason?
There was a problem hiding this comment.
Ok I will mute it if memory footprint is increased. I didn't observed side effects and I just find that by deault many gloo connection was created in CPU side.
|
@yiakwy-xpu-ml-framework-team @yaoyu-33 Hi, I've implemented a GPT-OSS model (0.11B) based on your guidelines using Megatron-LM 0.16.0rc0. During training, the throughput per GPU is only around 1.0 TFLOP/s, which seems abnormal for this configuration. Could you please take a look at my script and logs to see if there are any obvious misconfigurations? Training ScriptSEQ_LENGTH=8192
MAX_LENGTH=8192
TRAIN_SAMPLES=1518124
LAST_TRAIN_SAMPLES=0
LR_DECAY_SAMPLES=$(((TRAIN_SAMPLES - LAST_TRAIN_SAMPLES) * 80 / 100))
CHECKPOINT_PATH=
# ================end================
TOKENIZER_TYPE=SentencePieceTokenizer
TOKENIZER_MODEL=
MICRO_BATCH_SIZE=1
GLOBAL_BATCH_SIZE=128
DISTRIBUTED_ARGS=" \
--nnodes=1 \
--nproc_per_node=8 \
--node_rank=0 \
--master_addr=localhost \
--master_port=6000"
MODEL_ARGS=" \
--no-masked-softmax-fusion \
--transformer-impl transformer_engine \
--disable-bias-linear \
--untie-embeddings-and-output-weights \
--no-rope-fusion \
--normalization RMSNorm \
--num-layers 12 \
--hidden-size 512 \
--ffn-hidden-size 2048 \
--num-attention-heads 64 \
--group-query-attention \
--num-query-groups 8 \
--seq-length 8192 \
--max-position-embeddings 8192 \
--use-mcore-models \
--rotary-percent 1.0 \
--rope-type yarn \
--position-embedding-type yarn
--rotary-base 10000 \
--no-bias-gelu-fusion \
--export-force-local-attention \
--no-bias-dropout-fusion \
--quick-geglu \
--glu-linear-offset 1.0 \
--softmax-type learnable \
--window-attn-skip-freq 2 \
--activation-func-clamp-value 7.0 \
--window-size 128,0 \
--enable-gpt-oss"
MOE_ARGS=" \
--num-experts 4 \
--moe-router-topk 2 \
--moe-router-load-balancing-type aux_loss \
--moe-aux-loss-coeff 1e-3 \
--moe-grouped-gemm \
--moe-token-dispatcher-type alltoall \
--overlap-param-gather \
--overlap-grad-reduce \
--moe-ffn-hidden-size 2048 \
--moe-router-dtype fp32 \
--moe-z-loss-coeff 1e-3 \
--moe-permute-fusion"
DATA_ARGS=" \
--num-workers 8 \
--dataloader-type cyclic \
--tokenizer-type ${TOKENIZER_TYPE} \
--tokenizer-model ${TOKENIZER_MODEL} \
--data-path \
--split 1000,0,0 \
--no-create-attention-mask-in-dataloader"
TRAINING_ARGS=" \
--micro-batch-size ${MICRO_BATCH_SIZE} \
--global-batch-size ${GLOBAL_BATCH_SIZE} \
--lr 1.0e-5 \
--train-samples ${TRAIN_SAMPLES} \
--lr-decay-samples ${LR_DECAY_SAMPLES} \
--lr-decay-style cosine \
--min-lr 1.0e-6 \
--weight-decay 0.1 \
--lr-warmup-fraction 0.05 \
--clip-grad 1.0 \
--bf16 \
--use-flash-attn \
--attention-softmax-in-fp32 \
--accumulate-allreduce-grads-in-fp32 \
--disable-bf16-reduced-precision-matmul \
--recompute-activations"
MODEL_PARALLEL_ARGS=" \
--tensor-model-parallel-size 4 \
--pipeline-model-parallel-size 1 \
--expert-model-parallel-size 2 \
--sequence-parallel \
--context-parallel-size 1 \
--use-distributed-optimizer \
--fp8-format hybrid \
--fp8-param-gather \
--fp8-amax-compute-algo max \
--fp8-amax-history-len 1024"
LOGGING_ARGS=" \
--log-interval 1 \
--save-interval 10000 \
--eval-interval 50000000 \
--eval-iters 0 \
--save $CHECKPOINT_PATH \
--tensorboard-dir "${CHECKPOINT_PATH}/tensorboard" \
--wandb-project ${WANDB_PROJECT:-"gpt-oss"} \
--wandb-exp-name ${WANDB_NAME:-"gpt-oss-test"} \
--moe-per-layer-logging \
--no-load-optim \
--no-load-rng \
--log-throughput"
python -m torch.distributed.run ${DISTRIBUTED_ARGS} pretrain_gpt.py \
${MODEL_ARGS} \
${MOE_ARGS} \
${DATA_ARGS} \
${TRAINING_ARGS} \
${MODEL_PARALLEL_ARGS} \
${LOGGING_ARGS}Logs |
|
Please reference Megatron-Bridge for how to use bf16 and fp8 for GPT-OSS. |
What does this PR do ?
Add gptoss 20b training example in Hopper platform:
Env :
Support parallel scheme:
ETP must be 1 ,since ETP should not be > 1 when add_bias is True (required by GptOSS) .
Snapshot
full gptoss 24 layers with TP8-EP8:

Steps to reproduce
Generate distributed checkpoint:
start training jobs
Changes
megatron core:
GptOss Yarn Config:
Other Rleated Issue:
Megatron Bridge:
TransformerEngine
SGlang Rollout
VeRL
Contribution process
flowchart LR A[Pre-checks] --> B[PR Tests] subgraph Code Review/Approval C1[Expert Review] --> C2[Final Review] end B --> C1 C2 --> D[Merge]Pre-checks
Core 0.8)Code review
The following process is enforced via the CODEOWNERS file for changes into
megatron/core. For changes outside ofmegatron/core, it is up to the PR author whether or not to tag the Final Reviewer team.For MRs into `main` branch
(Step 1): Add PR label
Expert Review(Step 2): Collect the expert reviewers reviews
Expert Reviewlabel when your PR is ready for review.Final Review might get declined if these requirements are not fulfilled.
(Step 3): Final Review
Final Reviewlabel(Optional Step 4): Cherry-pick into release branch
If this PR also needs to be merged into
core_r*release branches, after this PR has been merged, selectCherry-pickto open a new PR into the release branch.For MRs into `dev` branch
The proposed review process for `dev` branch is under active discussion.MRs are mergable after one approval by either
eharper@nvidia.comorzijiey@nvidia.com.Merging your PR
Any member of core-adlr and
core-nemowill be able to merge your PR.