Skip to content

[megatron, sglang, rollout, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync#4

Merged
zhouheyun merged 36 commits intorednote-hilab:dots.rlfrom
ziqi-wlb:feat/async-ref-logp
Sep 12, 2025
Merged

[megatron, sglang, rollout, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync#4
zhouheyun merged 36 commits intorednote-hilab:dots.rlfrom
ziqi-wlb:feat/async-ref-logp

Conversation

@ziqi-wlb
Copy link

What does this PR do?

  1. Add validation StateMachine
  2. Support async ref_logp: remove all offloads(param/grad/optimizer..)
  3. Change doc for async-rl

Performance:

  1. async-ref-logp: Compared with the previous async-rl the performance is further improved by 20% (170s -> 140s, after tuning engines tp, 140s -> 112s)
  2. nccl-sync: The end-to-end's performance is 50% higher than Verl's hybrid-engine. The (actor-gather/broadcast/engine-load)overlap+bucket-fused nccl-sync optimization reduces parameter synchronization time by 60%. (dots2.0 300B + nGPUs=512)
image

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Async RL Configuration
+actor_rollout_ref.async_pipeline=True \
 
# Resource Management
+trainer.sperated_node_ratios=[0.5,0.5] \
# means: each task group uses 0.5 of total nodes
# means: train/logp/ref_logp use 0.5 ngpus, generate use 0.5 ngpus

# Performance Tuning, enable async-param-update
+actor_rollout_ref.rollout.enable_dual_buffer=True \
# support: async-cpu or sync-nccl
+actor_rollout_ref.rollout.enable_param_async=False \
# The sender granularity of the actor training node during parameter update
+actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=512 \
# The receiver granularity of the rollout inference node is too large, which will cause GPU-OOM
+actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
 
# The granularity of offpolicy, 2 means that generate is faster than the train node to execute 2 steps, that is, one-step-offpolicy
+trainer.generate_ahead_steps=2 \

Task Group Configuration Examples

Example 1: Complete Separation

+trainer.sperated_node_tasks=[logp,ref_logp,actor-train,generate] \
+trainer.sperated_node_ratios=[0.25,0.25,0.25,0.25] \

Explanation: Each task gets 25% of total nodes

  • logp: 25% nodes
  • ref_logp: 25% nodes
  • actor-train: 25% nodes
  • generate: 25% nodes

Example 2: Hybrid Mode (logp + actor-train grouped)

+trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
+trainer.sperated_node_ratios=[0.5,0.25,0.25] \

Explanation:

  • First group [logp,actor-train]: 50% nodes (shared)
  • ref_logp: 25% nodes
  • generate: 25% nodes

Example 3: Hybrid Mode (logp + actor-train + ref_logp grouped)

+trainer.sperated_node_tasks=[[logp,actor-train,ref_logp],generate] \
+trainer.sperated_node_ratios=[0.5,0.5] \

Explanation:

  • First group [logp,actor-train,ref_logp]: 50% nodes (shared)
  • generate: 50% nodes

Script of red-moe

  python3 -m verl.trainer.main_ppo --config-path=$ROOT_PATH/run_verl --config-name='redmoe_megatron' \
	++hydra.run.dir=outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}-${env:RANK,0} \
	algorithm.adv_estimator=grpo \
	data.train_files="$TRAIN_DATA_PATH" \
	data.val_files="$TEST_DATA_PATH" \
	data.train_batch_size=128 \
	data.max_prompt_length=$max_prompt_length \
	data.max_response_length=$max_response_length \
	data.filter_overlong_prompts=True \
	data.filter_overlong_prompts_workers=32 \
	data.truncation='error' \
	actor_rollout_ref.hybrid_engine=False \
	actor_rollout_ref.model.path=$MODEL_PATH \
	actor_rollout_ref.model.trust_remote_code=True \
	actor_rollout_ref.actor.optim.lr=1e-6 \
	actor_rollout_ref.actor.load_weight=True \
	actor_rollout_ref.actor.ppo_mini_batch_size=128 \
	actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.actor.megatron.param_offload=False \
	actor_rollout_ref.actor.megatron.grad_offload=False \
	actor_rollout_ref.actor.megatron.optimizer_offload=False \
	actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.tensor_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.expert_model_parallel_size=8 \
  actor_rollout_ref.ref.megatron.param_offload=False \
  actor_rollout_ref.ref.megatron.grad_offload=False \
  actor_rollout_ref.ref.megatron.optimizer_offload=False \
  actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
  actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.tensor_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.expert_model_parallel_size=1 \
	actor_rollout_ref.actor.use_kl_loss=True \
  actor_rollout_ref.rollout.n=16 \
  +actor_rollout_ref.rollout.enable_dual_buffer=True \
  +actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=256 \
  +actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
	actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
	actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.rollout.name=sglang \
	actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
	actor_rollout_ref.rollout.free_cache_engine=False \
	actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
	algorithm.use_kl_in_reward=False \
  +trainer.async_pipeline=True \
  +trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
  +trainer.sperated_node_ratios=[0.5,0.25,0.25] \
  +trainer.generate_ahead_steps=1 \
	trainer.val_only=False \
	trainer.critic_warmup=0 \
  trainer.resume_mode=disable \
	trainer.logger=['console','tensorboard'] \
	trainer.project_name="verl_async_rl_redmoe16b" \
	trainer.experiment_name=$EXP_NAME \
	trainer.n_gpus_per_node=8 \
	trainer.nnodes=${WORLD_SIZE} \
	trainer.save_freq=200 \
	trainer.test_freq=64 \
	trainer.total_epochs=100 2>&1 | tee log_async_rl_n16_${RANK}.txt $@

@ziqi-wlb ziqi-wlb changed the title [megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync [megatron, sglang, rollout, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync Sep 10, 2025
@zhouheyun zhouheyun merged commit bc3ef7a into rednote-hilab:dots.rl Sep 12, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants