Skip to content

[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl and Support async ref_logp#2

Closed
ziqi-wlb wants to merge 15 commits intorednote-hilab:mainfrom
ziqi-wlb:async-rl
Closed

[megatron, sglang, rollout, validation, doc] feat: Add Validation StateMachine to async-rl and Support async ref_logp#2
ziqi-wlb wants to merge 15 commits intorednote-hilab:mainfrom
ziqi-wlb:async-rl

Conversation

@ziqi-wlb
Copy link

What does this PR do?

  1. Add validation StateMachine
  2. Support async ref_logp: remove all offloads(param/grad/optimizer..)
  3. Change doc for async-rl

Performance: Compared with the previous async-rl the performance is further improved by 20% (170s -> 140s, after tuning engines tp, 140s -> 112s)

image

Checklist Before Starting

  • Search for similar PRs. Paste at least one query link here: ...
  • Format the PR title as [{modules}] {type}: {description} (This will be checked by the CI)
    • {modules} include fsdp, megatron, sglang, vllm, rollout, trainer, ci, training_utils, recipe, hardware, deployment, ray, worker, single_controller, misc, perf, model, algo, env, tool, ckpt, doc, data
    • If this PR involves multiple modules, separate them with , like [megatron, fsdp, doc]
    • {type} is in feat, fix, refactor, chore, test
    • If this PR breaks any API (CLI arguments, config, function signature, etc.), add [BREAKING] to the beginning of the title.
    • Example: [BREAKING][fsdp, megatron] feat: dynamic batching

API and Usage Example

Demonstrate how the API changes if any, and provide usage example(s) if possible.

# Async RL Configuration
+actor_rollout_ref.async_pipeline=True \
 
# Resource Management
+trainer.sperated_node_ratios=[0.5,0.5] \
# means: each task group uses 0.5 of total nodes
# means: train/logp/ref_logp use 0.5 ngpus, generate use 0.5 ngpus

# Performance Tuning, enable async-param-update
+actor_rollout_ref.rollout.enable_dual_buffer=True \
# The sender granularity of the actor training node during parameter update
+actor_rollout_ref.rollout.param_update_preduce_bucket_size_mb=512 \
# The receiver granularity of the rollout inference node is too large, which will cause GPU-OOM
+actor_rollout_ref.rollout.param_update_consume_bucket_size_mb=128 \
 
# The granularity of offpolicy, 2 means that generate is faster than the train node to execute 2 steps, that is, one-step-offpolicy
+trainer.generate_ahead_steps=2 \

Task Group Configuration Examples

Example 1: Complete Separation

+trainer.sperated_node_tasks=[logp,ref_logp,actor-train,generate] \
+trainer.sperated_node_ratios=[0.25,0.25,0.25,0.25] \

Explanation: Each task gets 25% of total nodes

  • logp: 25% nodes
  • ref_logp: 25% nodes
  • actor-train: 25% nodes
  • generate: 25% nodes

Example 2: Hybrid Mode (logp + actor-train grouped)

+trainer.sperated_node_tasks=[[logp,actor-train],ref_logp,generate] \
+trainer.sperated_node_ratios=[0.5,0.25,0.25] \

Explanation:

  • First group [logp,actor-train]: 50% nodes (shared)
  • ref_logp: 25% nodes
  • generate: 25% nodes

Example 3: Hybrid Mode (logp + actor-train + ref_logp grouped)

+trainer.sperated_node_tasks=[[logp,actor-train,ref_logp],generate] \
+trainer.sperated_node_ratios=[0.5,0.5] \

Explanation:

  • First group [logp,actor-train,ref_logp]: 50% nodes (shared)
  • generate: 50% nodes

Design & Code Changes

Demonstrate the high-level design if this PR is complex, and list the specific changes.

Checklist Before Submitting

Important

Please check all the following items before requesting a review, otherwise the reviewer might deprioritize this PR for review.

@ziqi-wlb ziqi-wlb closed this Aug 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant