Skip to content

[doc] fix: set use_dist_checkpointing to False for ref model in qwen3moe-30b script #3198

Merged
vermouth1992 merged 1 commit intoverl-project:mainfrom
none0663:fix_qwen3moe_30b_script
Aug 25, 2025
Merged

[doc] fix: set use_dist_checkpointing to False for ref model in qwen3moe-30b script #3198
vermouth1992 merged 1 commit intoverl-project:mainfrom
none0663:fix_qwen3moe_30b_script

Conversation

@none0663
Copy link
Contributor

What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b script, because there is not dist_megatron_ckpt model path for ref model.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to fix an incorrect configuration for the reference model in the qwen3moe-30b training script by setting use_dist_checkpointing to False. While the intent is correct, the implementation introduces a potential issue by using a shared variable ${USE_DIST_CKPT}. As the pull request description notes, the reference model does not support distributed checkpointing, so this setting should be hardcoded to False to prevent future misconfigurations that could lead to runtime errors. The other change, which fixes a trailing backslash and adds a newline at the end of the file, is a good correction.

actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=${infer_ppo_micro_batch_size_per_gpu} \
actor_rollout_ref.ref.log_prob_max_token_len_per_gpu=${infer_ppo_max_token_len} \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=True \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

Based on the pull request description, use_dist_checkpointing must be False for the reference model because it lacks a distributed checkpoint path. Using the ${USE_DIST_CKPT} variable makes this setting configurable. If a user sets USE_DIST_CKPT=True (e.g., for the actor model), it would also be incorrectly enabled for the reference model, likely causing a runtime error. To ensure the script's robustness and prevent misconfiguration, this value should be hardcoded to False for the reference model.

Suggested change
actor_rollout_ref.ref.megatron.use_dist_checkpointing=${USE_DIST_CKPT} \
actor_rollout_ref.ref.megatron.use_dist_checkpointing=False \

@vermouth1992 vermouth1992 merged commit 58c847b into verl-project:main Aug 25, 2025
4 checks passed
@none0663 none0663 deleted the fix_qwen3moe_30b_script branch August 25, 2025 04:47
PopSoda2002 pushed a commit to PopSoda2002/verl that referenced this pull request Aug 26, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Aug 27, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
cczitong123 pushed a commit to cczitong123/verl that referenced this pull request Sep 5, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
DDVD233 pushed a commit to DDVD233/mirl that referenced this pull request Sep 5, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
WncFht pushed a commit to WncFht/verl that referenced this pull request Oct 10, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
techkang pushed a commit to techkang/verl that referenced this pull request Oct 31, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…moe-30b script (verl-project#3198)

### What does this PR do?

Set use_dist_checkpointing to False for ref model in qwen3moe-30b
script, because there is not dist_megatron_ckpt model path for ref
model.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants