R3 PR: Rollout Routing Replay by erictang000 · Pull Request #1273 · NovaSky-AI/SkyRL

erictang000 · 2026-03-04T02:03:24Z

Overview

This PR adds support for Rollout Routing Replay (R3) from (See Paper).

See #815 for tracking of future tasks to fully support routing replay in all settings.

We add the following flags to enable R3:

cfg.generator.inference_engine.enable_return_routed_experts=True
cfg.trainer.policy.megatron_config.moe_enable_routing_replay=True

cfg.generator.inference_engine.enable_return_routed_experts=True is a pass through argument to vLLM, which records expert router indices (returning a list of dimension (batch_size, seq_len, num_layers, top_k).

We then pass this list rollout_expert_indices list through to Megatron's native RouterReplay feature (link).

When cfg.trainer.policy.megatron_config.moe_enable_routing_replay is set to true, Megatron initializes an instance of RouterReplay on each training worker rank. RouterReplay.set_replay_data(per_layer_data) can be used to set router decisions, and RouterReplay.set_global_router_replay_action(RouterReplayAction.REPLAY_FORWARD) and RouterReplay.set_global_router_replay_action(RouterReplayAction.REPLAY_BACKWARD) can be used to set the routing mode to be forward or backward.

Results

GSM8K Training on moonlight16b-a3b shows R3 improves training stability - this can be seen both in logprob diffs as well as in clip_ratio, grad_norm, and loss, which otherwise explode and collapse training.

Supported Settings

Router Replay is supported for the following settings:

Generator Settings

use_conversation_multi_turn=True and use_conversation_multi_turn=False
batched=False and batched=True
async_engine=True and async_engine=False
NOT retokenize_chat_history mode - i.e. self.use_conversation_multi_turn and self.custom_chat_template
NOT self.generator_cfg.step_wise_trajectories - there are some question marks about how to support this when using step wise training and not strictly appending (what should the routing look like for per turn obs that the inference engine doesn't see? - do we need to disable routing overrides for those tokens?)
fully_async training - technically should work but not tested in this PR. Tracking in [skyrl-train] Enable routing replay in SkyRL #815

Inference Engine Settings

TP and EP and DP should be supported from the vLLM side
NOTE: _SKYRL_USE_NEW_INFERENCE is not supported - this will be added in a follow up PR
NOTE: cfg.generator.distributed_executor_backend must be set to mp - hanging related to a Ray Compiled Graph issue occurs when using the default ray vLLM distributed executor backend. (see [Bug]: Generation hangs until RAY_CGRAPH_get_timeout (300s) with Ray compiled DAG executor vllm-project/vllm#36237 for details on the error that comes up)
NOTE: The above use of mp also means that serving must be single node per engine, until we add support for using the mp backend with multi-node serving - progress tracked here: [vllm] Fully enable mp distributed executor backend in vLLM #1309

Trainer Settings

TP, EP, DP are all supported. CP is in progress in this PR but needs more testing. CP + PP will be added in a follow up PR.

Custom Generator support

Custom generators using SkyRL's inference engine should just plumb through

Tests

Adds test_router_replay.py, which includes:

test_logprobs - integration test that runs a training batch through vllm, and through megatron with and without R3, to verify that logprob diffs are lower with routing replay
test_forward_backward - unit test for forward_backward that verifies that a training step can complete successfully when routing replay indices are passed in

Adds test_generator_multi_turn_gsm8k_router_replay to test_skyrl_gym_generator to verify that the SkyRLGymGenerator plumbs through the router indices in an expected format.

Rollout Routing Replay

Relevant resources:
vLLM PR: vllm-project/vllm#28284
Verl PR: verl-project/verl#4101
Mindlab blog: https://macaron.im/mindlab/research/router-replay-r3-why-it-failed-and-how-we-fixed-it
Megatron-LM API guide: https://github.com/NVIDIA/Megatron-LM/blob/main/docs/api-guide/router_replay.md

Co-authored-by: Dev Patel <dev.patel@berkeley.edu>

…essful!

…nning with tp + ep for megatron

erictang000 · 2026-03-04T22:22:37Z

Forward pass with router replay showing lower logprob diff!

erictang000 · 2026-03-07T09:01:14Z

Current State:

For small scale tests, routing replay seems to be working as shown above - tested only for TP=8 serving and TP=4 and EP=8 training.

What's not working:

Deepseek style models (like moonlight-16b-a3b) - the average logprobs for both with and without routing replay seem to be way off from what is expected from the inference engine (like ~8 average logprobs vs 0.2 average logprobs) - maybe this is specific to moonlight + some megatron config? not sure. Reproducible by running the test_logprobs tests
Running a real training batch runs into hanging for all models - see [Bug]: Generation hangs until RAY_CGRAPH_get_timeout (300s) with Ray compiled DAG executor vllm-project/vllm#36237 for the relevant stack trace. This seems to happen pretty much only when router replay is enabled on the vllm end. We could try upgrading vllm to 0.17.0 to see if anything has been pushed that could help fix this. Reproducible by running a large enough training batch with router replay on the test_logprobs tests

TODOs:

fix above bugs
clean up code
add checks for settings where r3 won't be supported for now (async RL, batched generator, use_conversation_multi_turn=false)
run full end to end test showing r3 minimizes drift on DAPO (with at least one model family - others just need to pass tests.

erictang000 · 2026-03-09T19:45:36Z

Deepseek style models (like moonlight-16b-a3b) - the average logprobs for both with and without routing replay seem to be way off from what is expected from the inference engine (like ~8 average logprobs vs 0.2 average logprobs) - maybe this is specific to moonlight + some megatron config? not sure. Reproducible by running the test_logprobs tests

Solved! the issue was that in our test we were setting cfg.trainer.use_sample_packing=False, and also setting NVTE_FUSED_ATTN=0. This was causing flash-attn to be used without sample packing for the moonlight forward pass - flash attn produces incorrect logprobs that differ greatly from vLLM:

vLLM logprobs     - mean: -2.564655
Megatron (replay) - mean: -9.231623
Megatron (no rep) - mean: -9.647593

after setting cfg.trainer.use_sample_packing=True and NVTE_FUSED_ATTN=1 (to allow setting transformer_config_kwargs["attention_backend"] to "fused" correctly

vLLM logprobs     - mean: -0.223607, std: 0.674102
Megatron (replay) - mean: -0.223626, std: 0.674850
Megatron (no rep) - mean: -0.224379, std: 0.677036
With replay    - logprob diff mean: 0.006648, std: 0.021737
Without replay - logprob diff mean: 0.011115, std: 0.035957

erictang000 · 2026-03-10T02:40:27Z

Running a real training batch runs into hanging for all models - see vllm-project/vllm#36237 for the relevant stack trace. This seems to happen pretty much only when router replay is enabled on the vllm end. We could try upgrading vllm to 0.17.0 to see if anything has been pushed that could help fix this. Reproducible by running a large enough training batch with router replay on the test_logprobs tests

verified that cherry picking the changes from #1300 to use the mp backend allow us to work around the compiled graph timeout

erictang000 · 2026-03-13T06:10:38Z

+
+
+@pytest.mark.megatron
+@pytest.mark.skip(reason="Skipping router replay test for now due to size constraints")


skipping these for now - need to test some smaller models out - tracking as a follow up task in #815, will do ASAP along with PP + CP support

erictang000 · 2026-03-13T06:31:17Z

    return prompts


+def _ensure_chat_template(tokenizer):


this is needed for "allenai/OLMoE-1B-7B-0924", which is used in the skyrl_gym_generator test - i plan to port over this model to the other router replay tests (and maybe megatron moe tests in general) since it's supported in Megatron-Bridge and is a 7B with 1B activated MoE that makes it nice for CI.

erictang000 · 2026-03-13T20:33:13Z

additional curves for DAPO for posterity (green is DAPO on moonlight w/ TIS, blue is same but w/o TIS + router replay)

Time for fwd/train is slightly higher with R3:

# Summary Extending #1273, this PR provides support for pipeline parallelism and context parallelism for R3. See #815 for tracking of future tasks to fully support routing replay in all settings. # Implementation **Pipeline Parallelism** For pipeline parallelism, we create a helper function `_get_current_pp_stage_layer_range(model_config)` which maps the current PP rank and its layers to the global layer offset across all the model layers so that we can use this offset to correctly select the corresponding replay instances from a `RouterReplay.global_router_replay_instances`. First, we get the number of pipeline stages from PP world size along with the total number of model layers. For models containing dense layers / unequal pipeline stages, megatron supports setting a customer number of layers for the first and last PP rank. Then, we capture these values from the model config and check to see if the remaining number of layers can be evenly distributed across the remaining PP ranks. Finally, we return the transformer-layer range owned by the current PP rank as s_p, n_p, where: - s_p is the global starting layer index for rank p - n_p is the number of transformer layers assigned to that stage For an even partition with L total layers and P pipeline stages: - next_n_pp_layers = L // P, start_index = next_n_pp_layers * pp_rank - the offset should thus span (next_n_pp_layers * pp_rank) : (next_n_pp_layers * (pp_rank+1) For uneven partitioning, if the first and/or last stages are assigned custom layer counts, we subtract those from $L$, split the remaining layers evenly among the remaining stages, and then shift the start index accordingly. This means we can support cases like Moonlight-16B models which have 27 layers, where we can pass `num_layers_in_first_pipeline_stage` as 13 for PP=2. **Context Parallelism** When using sample_packing, our megatron worker pre-processes and post-processes packed sequences. When CP is enabled, it is split into CP*2 chunks, so each effective GPU gets 2 CP chunks of half the size. See NVIDIA/TransformerEngine#1368. To account for this extra chunking, the `setup_per_microbatch_replay_forward` method is updated to so that the effective_seq_len accounts for cp_size * 2 (same as the alignment in preprocess_packed_seqs in megatron_utils.py) along with the seqlen_per_cp as seqlen_per_cp // 2. We then index the front and back halves of these CP chunks from the aligned indices across the CP ranks and then concatenate them. This ensures that the router replay indices see the correct tokens from this CP chunking for megatron. **Testing** You can test with CP and/or PP configs from the test_router_replay file.  --- <a href="https://app.devin.ai/review/novasky-ai/skyrl/pull/1327" target="_blank"> <picture> <source media="(prefers-color-scheme: dark)" srcset="https://static.devin.ai/assets/gh-open-in-devin-review-dark.svg?v=1"> <img src="https://static.devin.ai/assets/gh-open-in-devin-review-light.svg?v=1" alt="Open with Devin"> </picture> </a>  --------- Signed-off-by: SumanthRH <sumanthrh99@gmail.com> Co-authored-by: Eric Tang <erictang000@gmail.com> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Co-authored-by: devin-ai-integration[bot] <158243242+devin-ai-integration[bot]@users.noreply.github.com> Co-authored-by: Sumanth R Hegde <39546518+SumanthRH@users.noreply.github.com> Co-authored-by: Claude Sonnet 4.6 <noreply@anthropic.com> Co-authored-by: Charlie Ruan <charlieruan@berkeley.edu> Co-authored-by: Eric Tang <46737979+erictang000@users.noreply.github.com>

previous code

a802773

Co-authored-by: Dev Patel <dev.patel@berkeley.edu>

erictang000 mentioned this pull request Mar 4, 2026

R3 PR: Rollout Routing Replay #1101

Closed

lint

21b44de

This comment was marked as resolved.

Sign in to view

make opus take a pass at test + plumbing fully thru generator

23fdc45

This comment was marked as resolved.

Sign in to view

updated test utils and file to support rollout replay indices

8daac59

This comment was marked as resolved.

Sign in to view

devpatelio added 2 commits March 4, 2026 12:03

add helper functions for router visibility and megatron testing, succ…

647426f

…essful!

linter

d4b753f

SumanthRH self-assigned this Mar 4, 2026

worked w opus to get forward pass logprob diff lower with replay + ru…

f1b9c53

…nning with tp + ep for megatron

This comment was marked as resolved.

Sign in to view

erictang000 added 3 commits March 4, 2026 23:26

add test for forward backward and fix behavior

8a8fa70

working for qwen but not moonlight... debugging moonlight

410995a

x

93eee65

devpatelio added 3 commits March 9, 2026 19:59

fixed test for moonlight by enforcing fused attn

9c716a1

Merge branch 'r3' of https://github.com/NovaSky-AI/SkyRL into HEAD

097d2ad

x

6de7d5c

This comment was marked as resolved.

Sign in to view

x

591af9b

This comment was marked as resolved.

Sign in to view

erictang000 added 2 commits March 10, 2026 22:33

clean up

acb35ec

rename var and clean up

5ad9426

This comment was marked as resolved.

Sign in to view

erictang000 added 4 commits March 12, 2026 20:39

lint

205da19

cleaning up

f78dc75

x

43297f0

x

0468c37

This comment was marked as resolved.

Sign in to view

erictang000 added 4 commits March 12, 2026 23:42

Merge branch 'main' of https://github.com/erictang000/SkyRL into r3

0dfed8d

x

4878ed7

fix bug not propagating router indices to fwd pass

a5babb4

x

7c11d73

erictang000 commented Mar 13, 2026

View reviewed changes

add supported settings to cfg validation

ac1fb79

erictang000 commented Mar 13, 2026

View reviewed changes

erictang000 added 4 commits March 13, 2026 07:07

add docs'

38b15a1

docs

465ec77

remove legacy

e6af1a0

x

951bc24

This comment was marked as resolved.

Sign in to view

erictang000 added 3 commits March 13, 2026 07:39

x

2f6c778

ur right devin

e3c965c

Merge branch 'main' of https://github.com/erictang000/SkyRL into r3

7681a33

erictang000 requested a review from SumanthRH March 13, 2026 17:55

add dapo moonlight with r3

bd69614

SumanthRH approved these changes Mar 14, 2026

View reviewed changes

SumanthRH merged commit dcd58ca into main Mar 14, 2026
5 of 6 checks passed

This was referenced Mar 16, 2026

[train][2/N] Support for Megatron PP + CP for R3 #1327

Closed

[Outdated] PR for R3 CP and PP #1334

Closed

[train][2/N] Support for Megatron PP + CP for R3 #1335

Merged

SumanthRH mentioned this pull request Mar 18, 2026

[CI] Fix test_inference_engines_generation after vllm 0.16.0 upgrade; Use the correct GSM8k path for test_generator_multi_turn_gsm8k_router_replay #1339

Merged



		@pytest.mark.megatron
		@pytest.mark.skip(reason="Skipping router replay test for now due to size constraints")

Conversation

erictang000 commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Results

Supported Settings

Generator Settings

Inference Engine Settings

Trainer Settings

Custom Generator support

Tests

Rollout Routing Replay

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

erictang000 commented Mar 4, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

erictang000 commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erictang000 commented Mar 9, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

erictang000 commented Mar 10, 2026

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

erictang000 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

erictang000 Mar 13, 2026

Choose a reason for hiding this comment

Uh oh!

This comment was marked as resolved.

Uh oh!

erictang000 commented Mar 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

erictang000 commented Mar 4, 2026 •

edited

Loading

erictang000 commented Mar 7, 2026 •

edited

Loading

erictang000 commented Mar 13, 2026 •

edited

Loading