[sglang,tool] feat: Add support for tools that generate multimodal data#2146
Merged
zhaochenyang20 merged 19 commits intoverl-project:mainfrom Jul 4, 2025
Merged
Conversation
8d52ec2 to
4212580
Compare
SwordFaith
reviewed
Jun 30, 2025
Collaborator
SwordFaith
left a comment
There was a problem hiding this comment.
We can include a two-step end-to-end training test and a delta multi-modal input unit test, if necessary, to safeguard our code through CI. Additional reasoning for image-based multi-turn tasks, such as Deep Eyes, can be incorporated in future PRs.
398f715 to
1b4eec2
Compare
mantle2048
reviewed
Jul 1, 2025
Collaborator
|
Could you please resolve conflict? |
auto-merge was automatically disabled
July 2, 2025 23:28
Head branch was pushed to by a user without write access
mantle2048
reviewed
Jul 4, 2025
Collaborator
|
My concern is addressed |
zhaochenyang20
approved these changes
Jul 4, 2025
7 tasks
SuperCB
pushed a commit
to SuperCB/verl
that referenced
this pull request
Jul 7, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
yellowbee686
pushed a commit
to yellowbee686/verl
that referenced
this pull request
Jul 7, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
11 tasks
5 tasks
lkc233
pushed a commit
to lkc233/verl
that referenced
this pull request
Jul 10, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
oseyosey
pushed a commit
to oseyosey/verl
that referenced
this pull request
Jul 28, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
Juniper1021
pushed a commit
to Juniper1021/verl
that referenced
this pull request
Aug 7, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
whatadayG
pushed a commit
to whatadayG/verl
that referenced
this pull request
Sep 5, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
chenjiaoAngel
added a commit
to chenjiaoAngel/verl
that referenced
this pull request
Nov 14, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
TimurTaepov
pushed a commit
to giorgossideris/verl
that referenced
this pull request
Dec 20, 2025
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
oseyosey
pushed a commit
to oseyosey/verl
that referenced
this pull request
Jan 20, 2026
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
vyomakesh0728
added a commit
to vyomakesh0728/verl
that referenced
this pull request
Jan 22, 2026
…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations.
Key Features
return_multi_modal_inputsto control how multimodal inputs are processedAPI and Usage Example
In your dataset config, set:
Specific Changes
AsyncRolloutRequestto handle multimodal data from toolsadd_tool_response_messagesto process multimodal contentChecklist Before Submitting
[BREAKING]to the PR titledescriptionif it breaks any API.