Skip to content

[megatron] support qwen3.5 models for megatron, bump mbridge + megatron-core to latest#1425

Open
erictang000 wants to merge 2 commits intomainfrom
qwen3_5_megatron
Open

[megatron] support qwen3.5 models for megatron, bump mbridge + megatron-core to latest#1425
erictang000 wants to merge 2 commits intomainfrom
qwen3_5_megatron

Conversation

@erictang000
Copy link
Copy Markdown
Collaborator

@erictang000 erictang000 commented Apr 1, 2026

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds a Megatron training script for Qwen 3.5, updates dependencies, and introduces a monkey-patch for transformers v5 compatibility. Review feedback identifies a likely version typo in pyproject.toml, an undefined variable and inconsistent model naming in the shell script, and suggests more specific exception handling for the vLLM engine workaround.

trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \
trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \
trainer.use_sample_packing=false \
trainer.flash_attn=$FLASH_ATTN \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The variable $FLASH_ATTN is used here, but its definition on line 30 is commented out. This will result in an empty value being passed to the trainer, which may cause a parsing error in the entrypoint.

Suggested change
trainer.flash_attn=$FLASH_ATTN \
trainer.flash_attn=false \

@@ -0,0 +1,77 @@
set -x

# Colocated GRPO training+generation for Moonlight-16B-A3B-Instruct on GSM8K with Megatron.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The comment mentions Moonlight-16B-A3B-Instruct, but the script is configured for Qwen/Qwen3.5-0.8B (line 12). This should be updated to reflect the correct model.

generator.inference_engine.gpu_memory_utilization=0.6 \
trainer.logger="$LOGGER" \
trainer.project_name="gsm8k_megatron" \
trainer.run_name="gsm8k_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_moonlight16b-a3b" \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The run_name suffix refers to moonlight16b-a3b. It should be updated to match the Qwen 3.5 model being used.

Suggested change
trainer.run_name="gsm8k_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_moonlight16b-a3b" \
trainer.run_name="gsm8k_megatron_tp${MEGATRON_TP}_pp${MEGATRON_PP}_cp${MEGATRON_CP}_ep${MEGATRON_EP}_etp${MEGATRON_ETP}_qwen3.5-0.8b" \

Comment on lines +57 to +58
except Exception:
pass
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Catching a broad Exception and passing silently is generally discouraged. While this is a monkey-patch workaround, it would be safer to catch specific errors (like ImportError or AttributeError) or at least log a warning if the patch fails, to aid in debugging if the library structure changes unexpectedly.

Copy link
Copy Markdown
Contributor

@devin-ai-integration devin-ai-integration bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Devin Review found 1 potential issue.

View 4 additional findings in Devin Review.

Open in Devin Review

trainer.policy.megatron_config.expert_model_parallel_size=$MEGATRON_EP \
trainer.policy.megatron_config.expert_tensor_parallel_size=$MEGATRON_ETP \
trainer.use_sample_packing=false \
trainer.flash_attn=$FLASH_ATTN \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 Undefined $FLASH_ATTN variable used in script, will pass empty value to config

The script references $FLASH_ATTN on line 49 (trainer.flash_attn=$FLASH_ATTN), but the variable definition on line 30 is commented out (# FLASH_ATTN=false). Since the script does not use set -u, the undefined variable silently expands to an empty string, resulting in trainer.flash_attn= being passed to the training entrypoint. This will either cause a configuration parsing error or set an unexpected value. Every other megatron script in the same directory (e.g., run_megatron_moonlight.sh:33, run_megatron_qwen3-30b-a3b.sh:27) properly defines FLASH_ATTN before using it.

Prompt for agents
In examples/train/megatron/run_megatron_qwen3.5.sh, the FLASH_ATTN variable on line 30 is commented out (# FLASH_ATTN=false) but is referenced on line 49 as trainer.flash_attn=$FLASH_ATTN. Either:
1. Uncomment line 30 to define FLASH_ATTN (e.g., change line 30 from '# FLASH_ATTN=false' to 'FLASH_ATTN=true' or 'FLASH_ATTN=false' depending on whether flash attention is supported for Qwen3.5), or
2. Remove line 49 entirely if the flash_attn config should not be set for this model (similar to how run_megatron_nemotron_mini_4b.sh omits it).
Open in Devin Review

Was this helpful? React with 👍 or 👎 to provide feedback.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant