fix: unified async trainer with verl backend by yifannnwu · Pull Request #493 · rllm-org/rllm

yifannnwu · 2026-04-06T22:01:41Z

Summary

Fixes 5 bugs that prevent the async training path (_fit_fully_async from PR #481) from working with the verl backend. The sync path is unaffected.

Bugs fixed

1. `_generation_loop` batch type mismatch (`unified_trainer.py`)

batch[0] crashes because verl's dataloader yields dicts (with extra_info/raw_prompt keys), not indexable lists. Fix: handle dict and DataProto batch types before falling back to batch[0].

2. `VerlEngine.init` missing `super().init()` (`verl_engine.py`)

The experimental VerlEngine doesn't call super().__init__(), so the base class weight_version = 0 attribute is never set. The async trainer reads rollout_engine.weight_version for staleness tracking → AttributeError.

3. `trainer_state.episodes` not set before fwd-bwd pass (`unified_trainer.py`)

_training_loop never sets trainer_state.episodes before calling transform_to_backend_batch, which asserts that episodes are present. The existing assignment after the optimizer step is for visualization/logging — too late for the fwd-bwd pass.

4. `compute_advantages` skipped in async path (`unified_trainer.py`)

_training_loop calls process_backend_batch → update_policy but skips compute_advantages. The sync path (_fit_sync) has all three steps. Without advantages, update_policy fails with KeyError: 'advantages'.

5. `timing_dict["step"]` not populated (`unified_trainer.py`)

verl's compute_throughout_metrics (called by on_batch_end) requires timing_raw["step"]. The sync path uses with simple_timer("step", timing_dict) but the async path doesn't go through that code path.

Test plan

Qwen3.5-0.8B, verl backend, async GRPO: 3 training steps completed successfully

Four bugs that prevent the async training path (`_fit_fully_async`) from working with the verl backend: 1. `_generation_loop`: `batch[0]` crashes because verl's dataloader yields dicts (with `extra_info`/`raw_prompt` keys), not indexable lists. Fix: handle dict and DataProto batch types. 2. `VerlEngine.__init__` doesn't call `super().__init__()`, so the base class `weight_version=0` attribute is never set. Async trainer reads `rollout_engine.weight_version` for staleness tracking → AttributeError. 3. `_training_loop` never sets `trainer_state.episodes` before the fwd-bwd pass. `transform_to_backend_batch` asserts episodes are present → crash. (The existing assignment at the end of the step is for logging, too late.) 4. `_training_loop` calls `process_backend_batch` → `update_policy` but skips `compute_advantages`. The sync path has all three steps. Without advantages, `update_policy` fails with KeyError: 'advantages'. 5. `timing_dict["step"]` is never populated in the async path. verl's `compute_throughout_metrics` (called by `on_batch_end`) requires it. The sync path uses `with simple_timer("step", timing_dict)` but async doesn't go through that code path. Tested with Qwen3.5-0.8B on verl backend: 3 async GRPO steps pass.

yifannnwu · 2026-04-06T22:01:48Z

cc @kylemontgomery1 for review — thanks for async path from PR #481, which is super helpful! I did a testing with verl backend + Qwen3.5.

luyuzhe111 · 2026-04-06T22:04:22Z

thanks for the fix @yifannnwu! wondering what script you used for e2e testing?

listar2000 · 2026-04-06T22:45:13Z

@yifannnwu Thx a lot for the changes!

@kylemontgomery1 -- at this point might need to figure out a good merge queue schedule for these co-existing PRs (if I remember you also have merged the #488 changes already with some additional changes to Verl engine).

kylemontgomery1 · 2026-04-06T23:11:57Z

We have some ongoing effort here: https://github.com/rllm-org/rllm/tree/verl-fully-async, but this is not ready yet.

The main change will be to decouple the backend from verl's RayPPOTrainer.

I'd also like to drop the backend-specific advantage computation so that advantage computation can be done to the trajectory groups in the buffer, but we'd lose gae (I doubt anyone has ever used gae with rllm). Happy to keep this if there is a good reason.

… extra_info, add compute_advantages + timing_dict[step] in async path

rucnyz added a commit to rucnyz/rllm that referenced this pull request May 2, 2026

fix(unified_trainer): apply PR rllm-org#493 fixes — extract task from…

4dcc184

… extra_info, add compute_advantages + timing_dict[step] in async path

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: unified async trainer with verl backend#493

fix: unified async trainer with verl backend#493
yifannnwu wants to merge 1 commit intorllm-org:mainfrom
yifannnwu:fix/async-verl-backend

yifannnwu commented Apr 6, 2026 •

edited

Loading

Uh oh!

yifannnwu commented Apr 6, 2026 •

edited

Loading

Uh oh!

luyuzhe111 commented Apr 6, 2026

Uh oh!

listar2000 commented Apr 6, 2026 •

edited

Loading

Uh oh!

kylemontgomery1 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yifannnwu commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Bugs fixed

1. _generation_loop batch type mismatch (unified_trainer.py)

2. VerlEngine.__init__ missing super().__init__() (verl_engine.py)

3. trainer_state.episodes not set before fwd-bwd pass (unified_trainer.py)

4. compute_advantages skipped in async path (unified_trainer.py)

5. timing_dict["step"] not populated (unified_trainer.py)

Test plan

Uh oh!

yifannnwu commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

luyuzhe111 commented Apr 6, 2026

Uh oh!

listar2000 commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kylemontgomery1 commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yifannnwu commented Apr 6, 2026 •

edited

Loading

1. `_generation_loop` batch type mismatch (`unified_trainer.py`)

2. `VerlEngine.init` missing `super().init()` (`verl_engine.py`)

3. `trainer_state.episodes` not set before fwd-bwd pass (`unified_trainer.py`)

4. `compute_advantages` skipped in async path (`unified_trainer.py`)

5. `timing_dict["step"]` not populated (`unified_trainer.py`)

yifannnwu commented Apr 6, 2026 •

edited

Loading

listar2000 commented Apr 6, 2026 •

edited

Loading