refactor: move experimental rollout engines to canonical rllm/engine/rollout/#488
refactor: move experimental rollout engines to canonical rllm/engine/rollout/#488listar2000 wants to merge 2 commits intomainfrom
Conversation
…rollout/ The experimental rollout engines (RolloutEngine, ModelOutput, TinkerEngine, VerlEngine, Completer, TITOCompleter, types) are now the canonical implementations in rllm/engine/rollout/. The experimental versions were strict supersets of the non-experimental ones, with additional support for TITO (token-in-token-out), weight version tracking, and training pipeline integration. Key changes: - Move all engine classes and types from rllm/experimental/rollout/ to rllm/engine/rollout/ (the canonical location) - Add backward-compat shim in rllm/experimental/rollout/__init__.py that re-exports from the canonical location with a DeprecationWarning - Update all 24+ consumer imports from rllm.experimental.rollout to rllm.engine.rollout - Standardize on .is_validation (not .validate) as the attribute name across all rollout engine consumers - OpenAIEngine and FireworksEngine remain in rllm/engine/rollout/ and inherit the enhanced base RolloutEngine automatically Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
| from rllm.tools.tool_base import ToolCall as RllmToolCall | ||
|
|
||
| # ------------------------------------------------------------------ | ||
| # Fixtures |
There was a problem hiding this comment.
Hi @listar2000, it looks like all the tool related tests were removed. we should add equivalent ones back with the new TinkerChatTemplateParser?
There was a problem hiding this comment.
Yeah @luyuzhe111 you are spot on -- these are removed from test_tinker_engine since the original functions we tested are no longer used. These utility functions are now a part of the TinkerChatTemplateParser. I will add these tests back to the corresponding test_tinker_parser.py file.
| max_model_length=2048, | ||
| sampling_params={"train": {"temperature": 0.0}, "val": {"temperature": 0.0}}, | ||
| bypass_render_with_parser=True, | ||
| disable_thinking=True, |
There was a problem hiding this comment.
this test will fail since disable_thinking: bool = False has been removed. why was it removed?
There was a problem hiding this comment.
You mean the bypass_render_with_parser flag or sth else? I think the removal of bypass_render_with_parser is the only change here.
There was a problem hiding this comment.
Oh I see the point you've made. Yeah I need to go back and check the argument list of the TinkerEngine.
|
|
||
| return tokens | ||
|
|
||
| def _prepare_messages(self, messages: list[dict], tools: list[Tool | dict] | None = None) -> list[dict]: |
There was a problem hiding this comment.
it looks like these functions assume the tool format is rllm format instead of openai format. how does it work with agentflow & gateway, where the messages and tool calls are in openai format?
There was a problem hiding this comment.
@luyuzhe111 You make a good point (I think this is where this linear issue was raised) -- after all the rLLM lacks a well-defined messages format standard (it's mostly like OpenAI messages format, but also has its own ToolCall classes). I guess for here what I will need to do is (at least temporarily) detect for both rLLM and OpenAI format, and then process differently to Tinker.
|
@listar2000 great work with
|
|
@luyuzhe111 Agree. But for now I think 2 PRs separating the purpose seems reasonable. I will try to get the migration done first -- i.e. making sure there will be no "dual" rollout engine paths. Meanwhile the Meanwhile I can work on another PR making sure the |
9f16421 to
aebb38b
Compare
|
@luyuzhe111 This PR currently only contains the 1st part (i.e. the migration of rollout engines outside As for the Verl engine-related changes: the renaming has been manually made sure consistent, and the new TITO methods are not novel things, but a breakdown of the originally super long Will also be running some real tests to see how the older trainers work with this new Verl Engine. |
| prompt_length = len(prompt_ids) | ||
| if enforce_max_prompt_length and prompt_length > self.max_prompt_length: | ||
| raise TerminationEvent(TerminationReason.MAX_PROMPT_LENGTH_EXCEEDED) | ||
| token_output: TokenOutput = await self.get_token_output_from_token_input(token_input=request_prompt_ids, **kwargs) |
There was a problem hiding this comment.
it looks like image_data never made it to **kwargs, so not self.server_manager.generate() either? might be an old bug in the experimental path.
There was a problem hiding this comment.
@kylemontgomery1 cc'ing Kyle who might be more of an expert here. If I remember this correctly, the image_data is manually processed into list[int] and combined with the text prompt.
rllm/rllm/engine/rollout/verl_engine.py
Lines 87 to 92 in aebb38b
I think the experimental Verl engine's handle here should at least match the legacy (non-experimental) one right?
There was a problem hiding this comment.
rllm/engine/rollout/verl_engine.py in main correctly passes image_data to server_manager.generate, the experimental one does not. Merging would break multimodal support for verl in the agent workflow trainer (it may have never worked in the unified trained...)
@listar2000 I'd suggest holding off on moving the experimental verl rollout engine out of experimental/ for now; the other rollout engines seem fine to move. We have some other changes coming to the verl rollout engine to enable fully async (e.g., we will use a different server manager). Besides, the verl backend for the unified trainer is still in experimental (and rightly so imo).
There was a problem hiding this comment.
@kylemontgomery1 Sounds good and thx for the review. In this case I think I will hold off this PR until the changes to (experimental) Verl rollout engine come in (might not directly to the current branch in this PR -- but I can cherrypick those changes).
I also agree that the Verl backend is still in experimental -- but considering that the (non-experimental) Verl workflow trainer isn't much more stable as well, it makes sense for us to keep working to make Verl backend more stable and then migrate it out soon.
There was a problem hiding this comment.
Hi @kylemontgomery1 I think it's time to revisit this PR.
Are there any major changes to the (experimental) Verl rollout engine in your incoming fully-async PR? If so I can wait for that before migrating it out of experimental.
Also I think this PR should be merged before your next PR on rLLM gateway -- which if I understand correctly also need to touch the rollout engines slightly.
| from .rollout.openai_engine import OpenAIEngine as _OpenAIEngine | ||
|
|
||
| return _OpenAIEngine | ||
| if name == "TinkerEngine": |
There was a problem hiding this comment.
Pre-existing pattern, this PR just extends it: Two things about the __getattr__blocks:
-
The try/except Exception converts ImportError into AttributeError, which hides the real error. If someone writes from rllm.engine.rollout import TinkerEngine and tinker isn't installed, they'll see "module has no attribute 'TinkerEngine'" instead of "No module named 'tinker'". Since these are explicit imports (not speculative hasattr checks), letting the ImportError propagate is more helpful.
-
The as _TinkerEngine underscore alias is unnecessary — Python caches whatever getattr returns on the module automatically, so the local name doesn't matter.
Could simplify to:
if name == "TinkerEngine":
from .tinker_engine import TinkerEngine
return TinkerEngine Same applies to the existing VerlEngine and OpenAIEngine blocks.
Since CC tends to pattern match a lot, would recommend putting the right pattern in this PR and fixing other cases soon too in a separate PR : )
There was a problem hiding this comment.
Fixed. Will do another PR later on reverting similar patterns across codebase.
|
@listar2000 thanks for the hard work and splitting the PR! Regarding verl related changes tho, as I mentioned above, the VerlEngine TITO refactoring (splitting get_model_response into get_token_output_from_token_input + assemble_model_output) seems to introduce a subtle image_data regression — it's computed in _get_model_response but never forwarded to get_token_output_from_token_input, so server_manager.generate() no longer receives it for VLM prompts. Would it make sense to keep the VerlEngine TITO refactor as a follow-up PR and just do the mechanical migration here (import path change + validate → is_validation)? That way this PR stays a clean move with no behavioral changes, and we can merge it faster without needing rigorous GPU testing — it already cleans things up significantly as-is. The TITO refactor can then be reviewed and tested properly on its own. |
|
@luyuzhe111 Sounds good, thx for such detailed review! Will work on these later today. |
Drop try/except that converted ImportError to AttributeError (hiding the real missing-dependency error) and remove unnecessary underscore aliases for OpenAIEngine, TinkerEngine, VerlEngine. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Close it for now since we might have a newer version for this feature. |
Summary
rllm/experimental/rollout/torllm/engine/rollout/(RolloutEngine, VerlEngine, TinkerEngine, types, completer)rllm/experimental/rollout/__init__.pywithDeprecationWarning.validate→.is_validationacross rollout engines (4 files, 8 occurrences)Test plan
pytest tests/engine/passespytest tests/parser/passesrllm.experimental.rolloutstill works (with deprecation warning)🤖 Generated with Claude Code