[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl and Support async ref_logp by ziqi-wlb · Pull Request #3 · rednote-hilab/dots.rl

ziqi-wlb · 2025-08-29T02:15:55Z

What does this PR do?

Add validation StateMachine
Support async ref_logp: remove all offloads(param/grad/optimizer..)
Change doc for async-rl

Performance: Compared with the previous async-rl the performance is further improved by 20% (170s -> 140s, after tuning engines tp, 140s -> 112s)

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Async RL Configuration
+actor_rollout_ref.async_pipeline=True \
 
# Resource Management
+trainer.sperated_node_ratios=[0.5,0.5] \
# means: each task group uses 0.5 of total nodes
# means: train/logp/ref_logp use 0.5 ngpus, generate use 0.5 ngpus

# Performance Tuning, enable async-param-update
+actor_rollout_ref.rollout.enable_dual_buffer=True \
# The sender granularity of the actor training node during parameter update
+actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=512 \
# The receiver granularity of the rollout inference node is too large, which will cause GPU-OOM
+actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
 
# The granularity of offpolicy, 2 means that generate is faster than the train node to execute 2 steps, that is, one-step-offpolicy
+trainer.generate_ahead_steps=2 \

Task Group Configuration Examples

Example 1: Complete Separation

+trainer.sperated_node_tasks=[logp,ref_logp,actor-train,generate] \
+trainer.sperated_node_ratios=[0.25,0.25,0.25,0.25] \

Explanation: Each task gets 25% of total nodes

logp: 25% nodes
ref_logp: 25% nodes
actor-train: 25% nodes
generate: 25% nodes

Example 2: Hybrid Mode (logp + actor-train grouped)

+trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
+trainer.sperated_node_ratios=[0.5,0.25,0.25] \

Explanation:

First group [logp,actor-train]: 50% nodes (shared)
ref_logp: 25% nodes
generate: 25% nodes

Example 3: Hybrid Mode (logp + actor-train + ref_logp grouped)

+trainer.sperated_node_tasks=[[logp,actor-train,ref_logp],generate] \
+trainer.sperated_node_ratios=[0.5,0.5] \

Explanation:

First group [logp,actor-train,ref_logp]: 50% nodes (shared)
generate: 50% nodes

Script of red-moe

  python3 -m verl.trainer.main_ppo --config-path=$ROOT_PATH/run_verl --config-name='redmoe_megatron' \
	++hydra.run.dir=outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}-${env:RANK,0} \
	algorithm.adv_estimator=grpo \
	data.train_files="$TRAIN_DATA_PATH" \
	data.val_files="$TEST_DATA_PATH" \
	data.train_batch_size=128 \
	data.max_prompt_length=$max_prompt_length \
	data.max_response_length=$max_response_length \
	data.filter_overlong_prompts=True \
	data.filter_overlong_prompts_workers=32 \
	data.truncation='error' \
	actor_rollout_ref.hybrid_engine=False \
	actor_rollout_ref.model.path=$MODEL_PATH \
	actor_rollout_ref.model.trust_remote_code=True \
	actor_rollout_ref.actor.optim.lr=1e-6 \
	actor_rollout_ref.actor.load_weight=True \
	actor_rollout_ref.actor.ppo_mini_batch_size=128 \
	actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.actor.megatron.param_offload=False \
	actor_rollout_ref.actor.megatron.grad_offload=False \
	actor_rollout_ref.actor.megatron.optimizer_offload=False \
	actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.tensor_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.expert_model_parallel_size=8 \
  actor_rollout_ref.ref.megatron.param_offload=False \
  actor_rollout_ref.ref.megatron.grad_offload=False \
  actor_rollout_ref.ref.megatron.optimizer_offload=False \
  actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
  actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.tensor_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.expert_model_parallel_size=1 \
	actor_rollout_ref.actor.use_kl_loss=True \
  actor_rollout_ref.rollout.n=16 \
  +actor_rollout_ref.rollout.enable_dual_buffer=True \
  +actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=256 \
  +actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
	actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
	actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.rollout.name=sglang \
	actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
	actor_rollout_ref.rollout.free_cache_engine=False \
	actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
	algorithm.use_kl_in_reward=False \
  +trainer.async_pipeline=True \
  +trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
  +trainer.sperated_node_ratios=[0.5,0.25,0.25] \
  +trainer.generate_ahead_steps=1 \
	trainer.val_only=False \
	trainer.critic_warmup=0 \
  trainer.resume_mode=disable \
	trainer.logger=['console','tensorboard'] \
	trainer.project_name="verl_async_rl_redmoe16b" \
	trainer.experiment_name=$EXP_NAME \
	trainer.n_gpus_per_node=8 \
	trainer.nnodes=${WORLD_SIZE} \
	trainer.save_freq=200 \
	trainer.test_freq=64 \
	trainer.total_epochs=100 2>&1 | tee log_async_rl_n16_${RANK}.txt $@

Add state-machine for async-rl Add async param-update overlap with logp and generate

…ate-machine and add red-moe model (verl-project#1) * add xdg ulysses * add grpo scripts * 适配redmoe+mcore by光速 * Bump from guangsu * [feat] Add async-rl with param-sync and async-pipeline Add state-machine for async-rl Add async param-update overlap with logp and generate * Update README * Refine code * rebase to main * add offload-grad for megatron-worker * Refine code * Refine code * Refine code --------- Co-authored-by: zuijiang <jiangjiangzuijiang@gmail.com> Co-authored-by: root <liuyanjiang601@xiaohongshu.com> Co-authored-by: weishi <bushou@xiaohongshu.com>

Support validation and ref_logp sperated

zuijiang and others added 17 commits August 19, 2025 06:21

add xdg ulysses

cc2df6b

add grpo scripts

95f3466

适配redmoe+mcore by光速

86a66a6

Bump from guangsu

75bd461

[feat] Add async-rl with param-sync and async-pipeline

28a0dd9

Add state-machine for async-rl Add async param-update overlap with logp and generate

Update README

c1f94a5

Refine code

6a3e533

rebase to main

ae10015

add offload-grad for megatron-worker

15e7718

Refine code

ad39348

Refine code

c7e0216

Refine code

d1914e5

Fix save checkpoint

6e42f66

Support validation and ref_logp sperated

Merge from feat/async-ref-logp

f319332

Fix pp param-sync

e4619d7

Fallback to per-tensor-generator and fix load-checkpoint

56a34c1

ziqi-wlb force-pushed the feat/async-ref-logp branch from 55d3e96 to 56a34c1 Compare September 2, 2025 04:28

ziqi-wlb added 4 commits September 2, 2025 15:19

Support valid skip first-val-step

c3ecc80

Fix ref-model path for resume

68fdedc

Fix cpu-oom for large model

06a9493

Add memory_efficient_mode to fallback to single buffer for param-update

c2f4988

ziqi-wlb force-pushed the feat/async-ref-logp branch from 28eeded to c2f4988 Compare September 4, 2025 04:18

ziqi-wlb added 7 commits September 4, 2025 15:46

Add clear buffer for param-update

d7ae3ca

Add overlap logp and recv

ce683e3

Add nccl-sync for param-update

9bce5a0

WIP: debug for nccl-sync

b77c224

WIP: hang at step2 param-update

ecf14f8

Fix hang for param-update when nccl-sync

ee4db3a

Porting: support dots model register for engine

dd9b5dc

Fix hang for infer_tp>1

54716bd

ziqi-wlb force-pushed the feat/async-ref-logp branch from 9adcdd9 to 0838101 Compare September 8, 2025 03:04

Refine code for async-param

94ad7da

ziqi-wlb force-pushed the feat/async-ref-logp branch 3 times, most recently from 4709fd1 to fda6381 Compare September 9, 2025 02:21

optimize for param-update nccl

473669d

ziqi-wlb force-pushed the feat/async-ref-logp branch from fda6381 to 473669d Compare September 9, 2025 02:28

ziqi-wlb added 3 commits September 9, 2025 18:45

optimize for param-update nccl: 3.5s->2s

3306bf4

enable mem clear and refine log

247e30a

refine mem clear

e4f945b

zhouheyun force-pushed the main branch from a9cab3f to b5a5e88 Compare September 10, 2025 03:37

ziqi-wlb closed this Sep 10, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl and Support async ref_logp#3

[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl and Support async ref_logp#3
ziqi-wlb wants to merge 34 commits intorednote-hilab:mainfrom
ziqi-wlb:feat/async-ref-logp

ziqi-wlb commented Aug 29, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

ziqi-wlb commented Aug 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Checklist Before Starting

API and Usage Example

Task Group Configuration Examples

Example 1: Complete Separation

Example 2: Hybrid Mode (logp + actor-train grouped)

Example 3: Hybrid Mode (logp + actor-train + ref_logp grouped)

Script of red-moe

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

ziqi-wlb commented Aug 29, 2025 •

edited

Loading