Skip to content

fix: support verl 0.7.1 EngineWorker in agent_workflow_trainer#474

Merged
listar2000 merged 1 commit intorllm-org:mainfrom
yifannnwu:fix/verl-071-new-worker-compat
Apr 2, 2026
Merged

fix: support verl 0.7.1 EngineWorker in agent_workflow_trainer#474
listar2000 merged 1 commit intorllm-org:mainfrom
yifannnwu:fix/verl-071-new-worker-compat

Conversation

@yifannnwu
Copy link
Copy Markdown
Contributor

Summary

verl 0.7.1 defaults to use_legacy_worker_impl: disable, which activates the new EngineWorker class instead of the legacy TrainingWorker. This changes the worker API in ways that break agent_workflow_trainer.py:

  1. compute_log_prob, compute_ref_log_prob, and update_actor now return TensorDict instead of DataProto
  2. Workers operate in no-padding format internally, so outputs need no_padding_2_padding conversion
  3. ppo_loss reads global_batch_size, temperature, etc. from the batch TensorDict — these must be injected before calls
  4. compute_log_prob must set compute_loss=False to skip unnecessary loss computation (which requires global_batch_size)

Without this fix, any rllm user with verl 0.7.1 (default config) hits a cascade of errors:

  • KeyError: 'temperature'
  • KeyError: 'global_batch_size'
  • AttributeError: 'TensorDict' object has no attribute 'batch'
  • RuntimeError: tensor size mismatch (no-padding vs padding format)

Changes

All changes are backward-compatible via isinstance(output, TensorDict) checks:

  • compute_log_prob (training path): Convert DataProtoTensorDict → no-padding before call. Set compute_loss=False, calculate_entropy=True. Convert output back to padded DataProto with old_log_probs/entropys keys. This mirrors verl's own _compute_old_log_prob in ray_trainer.py.

  • compute_ref_log_prob: Same TensorDict return handling + no_padding_2_padding conversion.

  • update_actor: Inject mini_batch_size, epochs, seed, dataloader_kwargs, global_batch_size, temperature, calculate_entropy into batch metadata before call. Handle TensorDict return for metrics extraction. This mirrors verl's own _update_actor in ray_trainer.py.

  • Validation/distillation compute_log_prob: Same no-padding pattern.

Test plan

  • Verified end-to-end GRPO training (3 steps) with Qwen3-Coder-30B-A3B on 8×H200 GPUs
  • WandB metrics logged correctly (pg_loss, entropy, rewards)
  • Backward-compatible: isinstance checks ensure legacy DataProto path still works

verl 0.7.1 defaults to `use_legacy_worker_impl: disable`, which uses
`EngineWorker` instead of `TrainingWorker`. This changes the worker API:

- `compute_log_prob` / `compute_ref_log_prob` / `update_actor` now
  return `TensorDict` instead of `DataProto`
- Workers operate in no-padding format internally, so outputs must be
  converted back via `no_padding_2_padding`
- `ppo_loss` requires `global_batch_size`, `temperature` etc. in the
  batch TensorDict; `compute_log_prob` needs `compute_loss=False` and
  `calculate_entropy=True`

Without this fix, training crashes with:
  KeyError: 'temperature' / 'global_batch_size'
  AttributeError: 'TensorDict' object has no attribute 'batch'
  RuntimeError: tensor size mismatch (no-padding vs padding)

Changes:
- compute_log_prob: convert DataProto→TensorDict→no-padding before call,
  set compute_loss=False + calculate_entropy=True, convert output back
  to padded DataProto with old_log_probs/entropys keys
- compute_ref_log_prob: same TensorDict handling + no_padding_2_padding
- update_actor: inject mini_batch_size, epochs, seed, global_batch_size,
  temperature, calculate_entropy into batch before call; handle
  TensorDict return for metrics extraction
- validation/distillation compute_log_prob: same pattern
- All changes are backward-compatible (isinstance checks for TensorDict
  vs DataProto returns)
@yifannnwu
Copy link
Copy Markdown
Contributor Author

cc @jeffreysijuntan for review

@listar2000
Copy link
Copy Markdown
Collaborator

Hi @yifannnwu thx for the (much needed) help on Verl support! Will take a look soon!

@listar2000
Copy link
Copy Markdown
Collaborator

Thanks for the fix! Actually we are thinking about deprecating the legacy path all-together as well so this is definitely some first-steps we need (while ensuring backward-compatibility). I will also port these changes into VerlBackend.

@listar2000 listar2000 merged commit 21e198c into rllm-org:main Apr 2, 2026
@yifannnwu
Copy link
Copy Markdown
Contributor Author

Thanks for the fix! Actually we are thinking about deprecating the legacy path all-together as well so this is definitely some first-steps we need (while ensuring backward-compatibility). I will also port these changes into VerlBackend.

Thanks for the quick merge @listar2000! Glad this aligns with your direction on deprecating the legacy path. Thanks for your support and look forward to large-scale stable RL training.

@listar2000
Copy link
Copy Markdown
Collaborator

listar2000 commented Apr 2, 2026

@yifannnwu I'm already working on some VerlBackend changes under this branch: https://github.com/rllm-org/rllm/tree/fix/verl-new-engine

The idea is to (since VerlBackend is still within experimental, can be a bit more aggressive) even disregard backward-compatibility there and fully embrace the new engine workers (as this is also what Verl is heading for). There are a few other issues in our current Verl support (e.g. padding, metainfo, etc.) that are also linked with the legacy worker.

Either way, feel free to also send PR/comments/issues on other Verl-related things -- this is one of our recent focus to improve on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants