Skip to content

fix: unified async trainer with verl backend#493

Open
yifannnwu wants to merge 1 commit intorllm-org:mainfrom
yifannnwu:fix/async-verl-backend
Open

fix: unified async trainer with verl backend#493
yifannnwu wants to merge 1 commit intorllm-org:mainfrom
yifannnwu:fix/async-verl-backend

Conversation

@yifannnwu
Copy link
Copy Markdown
Contributor

@yifannnwu yifannnwu commented Apr 6, 2026

Summary

Fixes 5 bugs that prevent the async training path (_fit_fully_async from PR #481) from working with the verl backend. The sync path is unaffected.

Bugs fixed

1. _generation_loop batch type mismatch (unified_trainer.py)

batch[0] crashes because verl's dataloader yields dicts (with extra_info/raw_prompt keys), not indexable lists. Fix: handle dict and DataProto batch types before falling back to batch[0].

2. VerlEngine.__init__ missing super().__init__() (verl_engine.py)

The experimental VerlEngine doesn't call super().__init__(), so the base class weight_version = 0 attribute is never set. The async trainer reads rollout_engine.weight_version for staleness tracking → AttributeError.

3. trainer_state.episodes not set before fwd-bwd pass (unified_trainer.py)

_training_loop never sets trainer_state.episodes before calling transform_to_backend_batch, which asserts that episodes are present. The existing assignment after the optimizer step is for visualization/logging — too late for the fwd-bwd pass.

4. compute_advantages skipped in async path (unified_trainer.py)

_training_loop calls process_backend_batchupdate_policy but skips compute_advantages. The sync path (_fit_sync) has all three steps. Without advantages, update_policy fails with KeyError: 'advantages'.

5. timing_dict["step"] not populated (unified_trainer.py)

verl's compute_throughout_metrics (called by on_batch_end) requires timing_raw["step"]. The sync path uses with simple_timer("step", timing_dict) but the async path doesn't go through that code path.

Test plan

  • Qwen3.5-0.8B, verl backend, async GRPO: 3 training steps completed successfully

Four bugs that prevent the async training path (`_fit_fully_async`) from
working with the verl backend:

1. `_generation_loop`: `batch[0]` crashes because verl's dataloader yields
   dicts (with `extra_info`/`raw_prompt` keys), not indexable lists.
   Fix: handle dict and DataProto batch types.

2. `VerlEngine.__init__` doesn't call `super().__init__()`, so the base
   class `weight_version=0` attribute is never set. Async trainer reads
   `rollout_engine.weight_version` for staleness tracking → AttributeError.

3. `_training_loop` never sets `trainer_state.episodes` before the fwd-bwd
   pass. `transform_to_backend_batch` asserts episodes are present → crash.
   (The existing assignment at the end of the step is for logging, too late.)

4. `_training_loop` calls `process_backend_batch` → `update_policy` but
   skips `compute_advantages`. The sync path has all three steps. Without
   advantages, `update_policy` fails with KeyError: 'advantages'.

5. `timing_dict["step"]` is never populated in the async path. verl's
   `compute_throughout_metrics` (called by `on_batch_end`) requires it.
   The sync path uses `with simple_timer("step", timing_dict)` but async
   doesn't go through that code path.

Tested with Qwen3.5-0.8B on verl backend: 3 async GRPO steps pass.
@yifannnwu
Copy link
Copy Markdown
Contributor Author

yifannnwu commented Apr 6, 2026

cc @kylemontgomery1 for review — thanks for async path from PR #481, which is super helpful! I did a testing with verl backend + Qwen3.5.

@luyuzhe111
Copy link
Copy Markdown
Collaborator

thanks for the fix @yifannnwu! wondering what script you used for e2e testing?

@listar2000
Copy link
Copy Markdown
Collaborator

listar2000 commented Apr 6, 2026

@yifannnwu Thx a lot for the changes!

@kylemontgomery1 -- at this point might need to figure out a good merge queue schedule for these co-existing PRs (if I remember you also have merged the #488 changes already with some additional changes to Verl engine).

@kylemontgomery1
Copy link
Copy Markdown
Collaborator

We have some ongoing effort here: https://github.com/rllm-org/rllm/tree/verl-fully-async, but this is not ready yet.

The main change will be to decouple the backend from verl's RayPPOTrainer.

I'd also like to drop the backend-specific advantage computation so that advantage computation can be done to the trajectory groups in the buffer, but we'd lose gae (I doubt anyone has ever used gae with rllm). Happy to keep this if there is a good reason.

rucnyz added a commit to rucnyz/rllm that referenced this pull request May 2, 2026
… extra_info, add compute_advantages + timing_dict[step] in async path
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants