fix: support verl 0.7.1 EngineWorker in agent_workflow_trainer#474
Conversation
verl 0.7.1 defaults to `use_legacy_worker_impl: disable`, which uses `EngineWorker` instead of `TrainingWorker`. This changes the worker API: - `compute_log_prob` / `compute_ref_log_prob` / `update_actor` now return `TensorDict` instead of `DataProto` - Workers operate in no-padding format internally, so outputs must be converted back via `no_padding_2_padding` - `ppo_loss` requires `global_batch_size`, `temperature` etc. in the batch TensorDict; `compute_log_prob` needs `compute_loss=False` and `calculate_entropy=True` Without this fix, training crashes with: KeyError: 'temperature' / 'global_batch_size' AttributeError: 'TensorDict' object has no attribute 'batch' RuntimeError: tensor size mismatch (no-padding vs padding) Changes: - compute_log_prob: convert DataProto→TensorDict→no-padding before call, set compute_loss=False + calculate_entropy=True, convert output back to padded DataProto with old_log_probs/entropys keys - compute_ref_log_prob: same TensorDict handling + no_padding_2_padding - update_actor: inject mini_batch_size, epochs, seed, global_batch_size, temperature, calculate_entropy into batch before call; handle TensorDict return for metrics extraction - validation/distillation compute_log_prob: same pattern - All changes are backward-compatible (isinstance checks for TensorDict vs DataProto returns)
|
cc @jeffreysijuntan for review |
|
Hi @yifannnwu thx for the (much needed) help on Verl support! Will take a look soon! |
|
Thanks for the fix! Actually we are thinking about deprecating the legacy path all-together as well so this is definitely some first-steps we need (while ensuring backward-compatibility). I will also port these changes into |
Thanks for the quick merge @listar2000! Glad this aligns with your direction on deprecating the legacy path. Thanks for your support and look forward to large-scale stable RL training. |
|
@yifannnwu I'm already working on some The idea is to (since Either way, feel free to also send PR/comments/issues on other Verl-related things -- this is one of our recent focus to improve on. |
Summary
verl 0.7.1 defaults to
use_legacy_worker_impl: disable, which activates the newEngineWorkerclass instead of the legacyTrainingWorker. This changes the worker API in ways that breakagent_workflow_trainer.py:compute_log_prob,compute_ref_log_prob, andupdate_actornow returnTensorDictinstead ofDataProtono_padding_2_paddingconversionppo_lossreadsglobal_batch_size,temperature, etc. from the batch TensorDict — these must be injected before callscompute_log_probmust setcompute_loss=Falseto skip unnecessary loss computation (which requiresglobal_batch_size)Without this fix, any rllm user with verl 0.7.1 (default config) hits a cascade of errors:
KeyError: 'temperature'KeyError: 'global_batch_size'AttributeError: 'TensorDict' object has no attribute 'batch'RuntimeError: tensor size mismatch(no-padding vs padding format)Changes
All changes are backward-compatible via
isinstance(output, TensorDict)checks:compute_log_prob(training path): ConvertDataProto→TensorDict→ no-padding before call. Setcompute_loss=False,calculate_entropy=True. Convert output back to paddedDataProtowithold_log_probs/entropyskeys. This mirrors verl's own_compute_old_log_probinray_trainer.py.compute_ref_log_prob: Same TensorDict return handling +no_padding_2_paddingconversion.update_actor: Injectmini_batch_size,epochs,seed,dataloader_kwargs,global_batch_size,temperature,calculate_entropyinto batch metadata before call. Handle TensorDict return for metrics extraction. This mirrors verl's own_update_actorinray_trainer.py.Validation/distillation
compute_log_prob: Same no-padding pattern.Test plan
isinstancechecks ensure legacyDataProtopath still works