[megatron, sglang, rollout, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync by ziqi-wlb · Pull Request #4 · rednote-hilab/dots.rl

ziqi-wlb · 2025-09-10T04:10:39Z

What does this PR do?

Add validation StateMachine
Support async ref_logp: remove all offloads(param/grad/optimizer..)
Change doc for async-rl

Performance:

async-ref-logp: Compared with the previous async-rl the performance is further improved by 20% (170s -> 140s, after tuning engines tp, 140s -> 112s)
nccl-sync: The end-to-end's performance is 50% higher than Verl's hybrid-engine. The (actor-gather/broadcast/engine-load)overlap+bucket-fused nccl-sync optimization reduces parameter synchronization time by 60%. (dots2.0 300B + nGPUs=512)

Checklist Before Starting

Search for similar PRs. Paste at least one query link here: ...
Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
- {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
- If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
- {type} is in feat, fix, refactor, chore, test
- If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
- Example: [BREAKING][fsdp, megatron] feat: dynamic batching

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Async RL Configuration
+actor_rollout_ref.async_pipeline=True \
 
# Resource Management
+trainer.sperated_node_ratios=[0.5,0.5] \
# means: each task group uses 0.5 of total nodes
# means: train/logp/ref_logp use 0.5 ngpus, generate use 0.5 ngpus

# Performance Tuning, enable async-param-update
+actor_rollout_ref.rollout.enable_dual_buffer=True \
# support: async-cpu or sync-nccl
+actor_rollout_ref.rollout.enable_param_async=False \
# The sender granularity of the actor training node during parameter update
+actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=512 \
# The receiver granularity of the rollout inference node is too large, which will cause GPU-OOM
+actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
 
# The granularity of offpolicy, 2 means that generate is faster than the train node to execute 2 steps, that is, one-step-offpolicy
+trainer.generate_ahead_steps=2 \

Task Group Configuration Examples

Example 1: Complete Separation

+trainer.sperated_node_tasks=[logp,ref_logp,actor-train,generate] \
+trainer.sperated_node_ratios=[0.25,0.25,0.25,0.25] \

Explanation: Each task gets 25% of total nodes

logp: 25% nodes
ref_logp: 25% nodes
actor-train: 25% nodes
generate: 25% nodes

Example 2: Hybrid Mode (logp + actor-train grouped)

+trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
+trainer.sperated_node_ratios=[0.5,0.25,0.25] \

Explanation:

First group [logp,actor-train]: 50% nodes (shared)
ref_logp: 25% nodes
generate: 25% nodes

Example 3: Hybrid Mode (logp + actor-train + ref_logp grouped)

+trainer.sperated_node_tasks=[[logp,actor-train,ref_logp],generate] \
+trainer.sperated_node_ratios=[0.5,0.5] \

Explanation:

First group [logp,actor-train,ref_logp]: 50% nodes (shared)
generate: 50% nodes

Script of red-moe

  python3 -m verl.trainer.main_ppo --config-path=$ROOT_PATH/run_verl --config-name='redmoe_megatron' \
	++hydra.run.dir=outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}-${env:RANK,0} \
	algorithm.adv_estimator=grpo \
	data.train_files="$TRAIN_DATA_PATH" \
	data.val_files="$TEST_DATA_PATH" \
	data.train_batch_size=128 \
	data.max_prompt_length=$max_prompt_length \
	data.max_response_length=$max_response_length \
	data.filter_overlong_prompts=True \
	data.filter_overlong_prompts_workers=32 \
	data.truncation='error' \
	actor_rollout_ref.hybrid_engine=False \
	actor_rollout_ref.model.path=$MODEL_PATH \
	actor_rollout_ref.model.trust_remote_code=True \
	actor_rollout_ref.actor.optim.lr=1e-6 \
	actor_rollout_ref.actor.load_weight=True \
	actor_rollout_ref.actor.ppo_mini_batch_size=128 \
	actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.actor.megatron.param_offload=False \
	actor_rollout_ref.actor.megatron.grad_offload=False \
	actor_rollout_ref.actor.megatron.optimizer_offload=False \
	actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.tensor_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.expert_model_parallel_size=8 \
  actor_rollout_ref.ref.megatron.param_offload=False \
  actor_rollout_ref.ref.megatron.grad_offload=False \
  actor_rollout_ref.ref.megatron.optimizer_offload=False \
  actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
  actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.tensor_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.expert_model_parallel_size=1 \
	actor_rollout_ref.actor.use_kl_loss=True \
  actor_rollout_ref.rollout.n=16 \
  +actor_rollout_ref.rollout.enable_dual_buffer=True \
  +actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=256 \
  +actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
	actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
	actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.rollout.name=sglang \
	actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
	actor_rollout_ref.rollout.free_cache_engine=False \
	actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
	algorithm.use_kl_in_reward=False \
  +trainer.async_pipeline=True \
  +trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
  +trainer.sperated_node_ratios=[0.5,0.25,0.25] \
  +trainer.generate_ahead_steps=1 \
	trainer.val_only=False \
	trainer.critic_warmup=0 \
  trainer.resume_mode=disable \
	trainer.logger=['console','tensorboard'] \
	trainer.project_name="verl_async_rl_redmoe16b" \
	trainer.experiment_name=$EXP_NAME \
	trainer.n_gpus_per_node=8 \
	trainer.nnodes=${WORLD_SIZE} \
	trainer.save_freq=200 \
	trainer.test_freq=64 \
	trainer.total_epochs=100 2>&1 | tee log_async_rl_n16_${RANK}.txt $@

Add state-machine for async-rl Add async param-update overlap with logp and generate

Support validation and ref_logp sperated

zuijiang and others added 30 commits August 19, 2025 06:21

add xdg ulysses

cc2df6b

add grpo scripts

95f3466

适配redmoe+mcore by光速

86a66a6

Bump from guangsu

75bd461

[feat] Add async-rl with param-sync and async-pipeline

28a0dd9

Add state-machine for async-rl Add async param-update overlap with logp and generate

Update README

c1f94a5

Refine code

6a3e533

rebase to main

ae10015

add offload-grad for megatron-worker

15e7718

Refine code

ad39348

Refine code

c7e0216

Refine code

d1914e5

Fix save checkpoint

6e42f66

Support validation and ref_logp sperated

Merge from feat/async-ref-logp

f319332

Fix pp param-sync

e4619d7

Fallback to per-tensor-generator and fix load-checkpoint

56a34c1

Support valid skip first-val-step

c3ecc80

Fix ref-model path for resume

68fdedc

Fix cpu-oom for large model

06a9493

Add memory_efficient_mode to fallback to single buffer for param-update

c2f4988

Add clear buffer for param-update

d7ae3ca

Add overlap logp and recv

ce683e3

Add nccl-sync for param-update

9bce5a0

WIP: debug for nccl-sync

b77c224

WIP: hang at step2 param-update

ecf14f8

Fix hang for param-update when nccl-sync

ee4db3a

Porting: support dots model register for engine

dd9b5dc

Fix hang for infer_tp>1

54716bd

Refine code for async-param

94ad7da

optimize for param-update nccl

473669d

ziqi-wlb added 3 commits September 9, 2025 18:45

optimize for param-update nccl: 3.5s->2s

3306bf4

enable mem clear and refine log

247e30a

refine mem clear

e4f945b

ziqi-wlb changed the title ~~[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync~~ [megatron, sglang, rollout, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync Sep 10, 2025

ziqi-wlb added 3 commits September 10, 2025 12:23

Merge conflict for ci-check

dbdd2c8

Add hilab license and refine for ci-check

8799aac

Refine for pre-commit ci

e04a74c

zhouheyun merged commit bc3ef7a into rednote-hilab:dots.rl Sep 12, 2025
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[megatron, sglang, rollout, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync#4

[megatron, sglang, rollout, doc] feat: Add Validation StateMachine to async-rl, async ref_logp and support nccl-sync#4
zhouheyun merged 36 commits intorednote-hilab:dots.rlfrom
ziqi-wlb:feat/async-ref-logp

ziqi-wlb commented Sep 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ziqi-wlb commented Sep 10, 2025

What does this PR do?

Checklist Before Starting

API and Usage Example

Task Group Configuration Examples

Example 1: Complete Separation

Example 2: Hybrid Mode (logp + actor-train grouped)

Example 3: Hybrid Mode (logp + actor-train + ref_logp grouped)

Script of red-moe

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants