fix: unified async trainer with verl backend#493
fix: unified async trainer with verl backend#493yifannnwu wants to merge 1 commit intorllm-org:mainfrom
Conversation
Four bugs that prevent the async training path (`_fit_fully_async`) from
working with the verl backend:
1. `_generation_loop`: `batch[0]` crashes because verl's dataloader yields
dicts (with `extra_info`/`raw_prompt` keys), not indexable lists.
Fix: handle dict and DataProto batch types.
2. `VerlEngine.__init__` doesn't call `super().__init__()`, so the base
class `weight_version=0` attribute is never set. Async trainer reads
`rollout_engine.weight_version` for staleness tracking → AttributeError.
3. `_training_loop` never sets `trainer_state.episodes` before the fwd-bwd
pass. `transform_to_backend_batch` asserts episodes are present → crash.
(The existing assignment at the end of the step is for logging, too late.)
4. `_training_loop` calls `process_backend_batch` → `update_policy` but
skips `compute_advantages`. The sync path has all three steps. Without
advantages, `update_policy` fails with KeyError: 'advantages'.
5. `timing_dict["step"]` is never populated in the async path. verl's
`compute_throughout_metrics` (called by `on_batch_end`) requires it.
The sync path uses `with simple_timer("step", timing_dict)` but async
doesn't go through that code path.
Tested with Qwen3.5-0.8B on verl backend: 3 async GRPO steps pass.
|
cc @kylemontgomery1 for review — thanks for async path from PR #481, which is super helpful! I did a testing with verl backend + Qwen3.5. |
|
thanks for the fix @yifannnwu! wondering what script you used for e2e testing? |
|
@yifannnwu Thx a lot for the changes! @kylemontgomery1 -- at this point might need to figure out a good merge queue schedule for these co-existing PRs (if I remember you also have merged the #488 changes already with some additional changes to Verl engine). |
|
We have some ongoing effort here: https://github.com/rllm-org/rllm/tree/verl-fully-async, but this is not ready yet. The main change will be to decouple the backend from verl's RayPPOTrainer. I'd also like to drop the backend-specific advantage computation so that advantage computation can be done to the trajectory groups in the buffer, but we'd lose gae (I doubt anyone has ever used gae with rllm). Happy to keep this if there is a good reason. |
… extra_info, add compute_advantages + timing_dict[step] in async path
Summary
Fixes 5 bugs that prevent the async training path (
_fit_fully_asyncfrom PR #481) from working with the verl backend. The sync path is unaffected.Bugs fixed
1.
_generation_loopbatch type mismatch (unified_trainer.py)batch[0]crashes because verl's dataloader yields dicts (withextra_info/raw_promptkeys), not indexable lists. Fix: handledictandDataProtobatch types before falling back tobatch[0].2.
VerlEngine.__init__missingsuper().__init__()(verl_engine.py)The experimental
VerlEnginedoesn't callsuper().__init__(), so the base classweight_version = 0attribute is never set. The async trainer readsrollout_engine.weight_versionfor staleness tracking →AttributeError.3.
trainer_state.episodesnot set before fwd-bwd pass (unified_trainer.py)_training_loopnever setstrainer_state.episodesbefore callingtransform_to_backend_batch, which asserts that episodes are present. The existing assignment after the optimizer step is for visualization/logging — too late for the fwd-bwd pass.4.
compute_advantagesskipped in async path (unified_trainer.py)_training_loopcallsprocess_backend_batch→update_policybut skipscompute_advantages. The sync path (_fit_sync) has all three steps. Without advantages,update_policyfails withKeyError: 'advantages'.5.
timing_dict["step"]not populated (unified_trainer.py)verl's
compute_throughout_metrics(called byon_batch_end) requirestiming_raw["step"]. The sync path useswith simple_timer("step", timing_dict)but the async path doesn't go through that code path.Test plan