[megatron] support qwen3.5 models for megatron, bump mbridge + megatron-core to latest#1425
[megatron] support qwen3.5 models for megatron, bump mbridge + megatron-core to latest#1425erictang000 wants to merge 2 commits intomainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds a Megatron training script for Qwen 3.5, updates dependencies, and introduces a monkey-patch for transformers v5 compatibility. Review feedback identifies a likely version typo in pyproject.toml, an undefined variable and inconsistent model naming in the shell script, and suggests more specific exception handling for the vLLM engine workaround.
| trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \ | ||
| trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \ | ||
| trainer.use_sample_packing=false \ | ||
| trainer.flash_attn=$FLASH_ATTN \ |
There was a problem hiding this comment.
| @@ -0,0 +1,77 @@ | |||
| set -x | |||
|
|
|||
| # Colocated GRPO training+generation for Moonlight-16B-A3B-Instruct on GSM8K with Megatron. | |||
| generator.inference_engine.gpu_memory_utilization=0.6 \ | ||
| trainer.logger="$LOGGER" \ | ||
| trainer.project_name="gsm8k_megatron" \ | ||
| trainer.run_name="gsm8k_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_moonlight16b-a3b" \ |
There was a problem hiding this comment.
The run_name suffix refers to moonlight16b-a3b. It should be updated to match the Qwen 3.5 model being used.
| trainer.run_name="gsm8k_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_moonlight16b-a3b" \ | |
| trainer.run_name="gsm8k_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_qwen3.5-0.8b" \ |
| except Exception: | ||
| pass |
There was a problem hiding this comment.
Catching a broad Exception and passing silently is generally discouraged. While this is a monkey-patch workaround, it would be safer to catch specific errors (like ImportError or AttributeError) or at least log a warning if the patch fails, to aid in debugging if the library structure changes unexpectedly.
| trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \ | ||
| trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \ | ||
| trainer.use_sample_packing=false \ | ||
| trainer.flash_attn=$FLASH_ATTN \ |
There was a problem hiding this comment.
🔴 Undefined $FLASH_ATTN variable used in script, will pass empty value to config
The script references $FLASH_ATTN on line 49 (trainer.flash_attn=$FLASH_ATTN), but the variable definition on line 30 is commented out (# FLASH_ATTN=false). Since the script does not use set -u, the undefined variable silently expands to an empty string, resulting in trainer.flash_attn= being passed to the training entrypoint. This will either cause a configuration parsing error or set an unexpected value. Every other megatron script in the same directory (e.g., run_megatron_moonlight.sh:33, run_megatron_qwen3-30b-a3b.sh:27) properly defines FLASH_ATTN before using it.
Prompt for agents
In examples/train/megatron/run_megatron_qwen3.5.sh, the FLASH_ATTN variable on line 30 is commented out (# FLASH_ATTN=false) but is referenced on line 49 as trainer.flash_attn=$FLASH_ATTN. Either:
1. Uncomment line 30 to define FLASH_ATTN (e.g., change line 30 from '# FLASH_ATTN=false' to 'FLASH_ATTN=true' or 'FLASH_ATTN=false' depending on whether flash attention is supported for Qwen3.5), or
2. Remove line 49 entirely if the flash_attn config should not be set for this model (similar to how run_megatron_nemotron_mini_4b.sh omits it).
Was this helpful? React with 👍 or 👎 to provide feedback.
GPU CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23869520430
Megatron GPU CI: https://github.com/NovaSky-AI/SkyRL/actions/runs/23869278330
WandB run for Qwen3.5-0.8B:

https://wandb.ai/sky-posttraining-uc-berkeley/gsm8k_megatron/runs/5cm9tg0j