Skip to content

[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl and Support async ref_logp#3

Closed
ziqi-wlb wants to merge 34 commits intorednote-hilab:mainfrom
ziqi-wlb:feat/async-ref-logp
Closed

[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl and Support async ref_logp#3
ziqi-wlb wants to merge 34 commits intorednote-hilab:mainfrom
ziqi-wlb:feat/async-ref-logp

Conversation

@ziqi-wlb
Copy link

@ziqi-wlb ziqi-wlb commented Aug 29, 2025

What does this PR do?

  1. Add validation StateMachine
  2. Support async ref_logp: remove all offloads(param/grad/optimizer..)
  3. Change doc for async-rl

Performance: Compared with the previous async-rl the performance is further improved by 20% (170s -> 140s, after tuning engines tp, 140s -> 112s)

image

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Async RL Configuration
+actor_rollout_ref.async_pipeline=True \
 
# Resource Management
+trainer.sperated_node_ratios=[0.5,0.5] \
# means: each task group uses 0.5 of total nodes
# means: train/logp/ref_logp use 0.5 ngpus, generate use 0.5 ngpus

# Performance Tuning, enable async-param-update
+actor_rollout_ref.rollout.enable_dual_buffer=True \
# The sender granularity of the actor training node during parameter update
+actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=512 \
# The receiver granularity of the rollout inference node is too large, which will cause GPU-OOM
+actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
 
# The granularity of offpolicy, 2 means that generate is faster than the train node to execute 2 steps, that is, one-step-offpolicy
+trainer.generate_ahead_steps=2 \

Task Group Configuration Examples

Example 1: Complete Separation

+trainer.sperated_node_tasks=[logp,ref_logp,actor-train,generate] \
+trainer.sperated_node_ratios=[0.25,0.25,0.25,0.25] \

Explanation: Each task gets 25% of total nodes

  • logp: 25% nodes
  • ref_logp: 25% nodes
  • actor-train: 25% nodes
  • generate: 25% nodes

Example 2: Hybrid Mode (logp + actor-train grouped)

+trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
+trainer.sperated_node_ratios=[0.5,0.25,0.25] \

Explanation:

  • First group [logp,actor-train]: 50% nodes (shared)
  • ref_logp: 25% nodes
  • generate: 25% nodes

Example 3: Hybrid Mode (logp + actor-train + ref_logp grouped)

+trainer.sperated_node_tasks=[[logp,actor-train,ref_logp],generate] \
+trainer.sperated_node_ratios=[0.5,0.5] \

Explanation:

  • First group [logp,actor-train,ref_logp]: 50% nodes (shared)
  • generate: 50% nodes

Script of red-moe

  python3 -m verl.trainer.main_ppo --config-path=$ROOT_PATH/run_verl --config-name='redmoe_megatron' \
	++hydra.run.dir=outputs/${now:%Y-%m-%d}/${now:%H-%M-%S}-${env:RANK,0} \
	algorithm.adv_estimator=grpo \
	data.train_files="$TRAIN_DATA_PATH" \
	data.val_files="$TEST_DATA_PATH" \
	data.train_batch_size=128 \
	data.max_prompt_length=$max_prompt_length \
	data.max_response_length=$max_response_length \
	data.filter_overlong_prompts=True \
	data.filter_overlong_prompts_workers=32 \
	data.truncation='error' \
	actor_rollout_ref.hybrid_engine=False \
	actor_rollout_ref.model.path=$MODEL_PATH \
	actor_rollout_ref.model.trust_remote_code=True \
	actor_rollout_ref.actor.optim.lr=1e-6 \
	actor_rollout_ref.actor.load_weight=True \
	actor_rollout_ref.actor.ppo_mini_batch_size=128 \
	actor_rollout_ref.actor.ppo_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.actor.megatron.param_offload=False \
	actor_rollout_ref.actor.megatron.grad_offload=False \
	actor_rollout_ref.actor.megatron.optimizer_offload=False \
	actor_rollout_ref.actor.megatron.pipeline_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.tensor_model_parallel_size=1 \
	actor_rollout_ref.actor.megatron.expert_model_parallel_size=8 \
  actor_rollout_ref.ref.megatron.param_offload=False \
  actor_rollout_ref.ref.megatron.grad_offload=False \
  actor_rollout_ref.ref.megatron.optimizer_offload=False \
  actor_rollout_ref.ref.log_prob_micro_batch_size_per_gpu=1 \
  actor_rollout_ref.ref.megatron.pipeline_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.tensor_model_parallel_size=1 \
  actor_rollout_ref.ref.megatron.expert_model_parallel_size=1 \
	actor_rollout_ref.actor.use_kl_loss=True \
  actor_rollout_ref.rollout.n=16 \
  +actor_rollout_ref.rollout.enable_dual_buffer=True \
  +actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=256 \
  +actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
	actor_rollout_ref.rollout.tensor_model_parallel_size=1 \
	actor_rollout_ref.rollout.log_prob_micro_batch_size_per_gpu=1 \
	actor_rollout_ref.rollout.name=sglang \
	actor_rollout_ref.rollout.gpu_memory_utilization=0.7 \
	actor_rollout_ref.rollout.free_cache_engine=False \
	actor_rollout_ref.rollout.max_num_batched_tokens=$((max_prompt_length + max_response_length)) \
	algorithm.use_kl_in_reward=False \
  +trainer.async_pipeline=True \
  +trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
  +trainer.sperated_node_ratios=[0.5,0.25,0.25] \
  +trainer.generate_ahead_steps=1 \
	trainer.val_only=False \
	trainer.critic_warmup=0 \
  trainer.resume_mode=disable \
	trainer.logger=['console','tensorboard'] \
	trainer.project_name="verl_async_rl_redmoe16b" \
	trainer.experiment_name=$EXP_NAME \
	trainer.n_gpus_per_node=8 \
	trainer.nnodes=${WORLD_SIZE} \
	trainer.save_freq=200 \
	trainer.test_freq=64 \
	trainer.total_epochs=100 2>&1 | tee log_async_rl_n16_${RANK}.txt $@

zuijiang and others added 17 commits August 19, 2025 06:21
Add state-machine for async-rl

Add async param-update overlap with logp and generate
…ate-machine and add red-moe model (verl-project#1)

* add xdg ulysses

* add grpo scripts

* 适配redmoe+mcore by光速

* Bump from guangsu

* [feat] Add async-rl with param-sync and async-pipeline

Add state-machine for async-rl

Add async param-update overlap with logp and generate

* Update README

* Refine code

* rebase to main

* add offload-grad for megatron-worker

* Refine code

* Refine code

* Refine code

---------

Co-authored-by: zuijiang <jiangjiangzuijiang@gmail.com>
Co-authored-by: root <liuyanjiang601@xiaohongshu.com>
Co-authored-by: weishi <bushou@xiaohongshu.com>
Support validation and ref_logp sperated
@ziqi-wlb ziqi-wlb force-pushed the feat/async-ref-logp branch from 55d3e96 to 56a34c1 Compare September 2, 2025 04:28
@ziqi-wlb ziqi-wlb force-pushed the feat/async-ref-logp branch from 28eeded to c2f4988 Compare September 4, 2025 04:18
@ziqi-wlb ziqi-wlb force-pushed the feat/async-ref-logp branch from 9adcdd9 to 0838101 Compare September 8, 2025 03:04
@ziqi-wlb ziqi-wlb force-pushed the feat/async-ref-logp branch 3 times, most recently from 4709fd1 to fda6381 Compare September 9, 2025 02:21
@ziqi-wlb ziqi-wlb force-pushed the feat/async-ref-logp branch from fda6381 to 473669d Compare September 9, 2025 02:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant