[sglang,tool] feat: Add support for tools that generate multimodal data by nanjiangwill · Pull Request #2146 · verl-project/verl

nanjiangwill · 2025-06-22T17:30:54Z

What does this PR do?

This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations.

Key Features

Tools can now return images and videos as part of their response
Added support for processing multimodal inputs in the rollout system
Introduced a new configuration option return_multi_modal_inputs to control how multimodal inputs are processed
Updated documentation with examples of how to implement tools that generate multimodal data

API and Usage Example

async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}

In your dataset config, set:

data:
  return_multi_modal_inputs: False

Specific Changes

Enhanced AsyncRolloutRequest to handle multimodal data from tools
Updated add_tool_response_messages to process multimodal content
Added documentation for multimodal tool support in the RST docs
Fixed configuration in example YAML files
Added proper handling of multimodal inputs in the rollout system

Checklist Before Submitting

Read the Contribute Guide.
Apply pre-commit checks.
Add [BREAKING] to the PR title description if it breaks any API.
Update the documentation about your changes in the docs.
New CI unit test(s) are added to cover the code path.
Rely on existing unit tests on CI that covers the code path.

CLAassistant · 2025-06-25T01:12:33Z

All committers have signed the CLA.

SwordFaith

We can include a two-step end-to-end training test and a delta multi-modal input unit test, if necessary, to safeguard our code through CI. Additional reasoning for image-based multi-turn tasks, such as Deep Eyes, can be incorporated in future PRs.

verl/workers/rollout/schemas.py

vermouth1992 · 2025-07-02T01:37:43Z

Could you please resolve conflict?

verl/workers/rollout/schemas.py

verl/trainer/config/ppo_trainer.yaml

eric-haibin-lin

/gemini review

verl/workers/rollout/schemas.py

eric-haibin-lin · 2025-07-04T23:04:08Z

My concern is addressed

…l-project#2146) ### What does this PR do? This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations. ### Key Features - Tools can now return images and videos as part of their response - Added support for processing multimodal inputs in the rollout system - Introduced a new configuration option `return_multi_modal_inputs` to control how multimodal inputs are processed - Updated documentation with examples of how to implement tools that generate multimodal data ### API and Usage Example ```python async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]: # Process images or videos from verl.utils.dataset.vision_utils import process_image, process_video img1 = process_image(img1) video1 = process_video(video1) # Return multimodal data return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {} ``` In your dataset config, set: ```yaml data: return_multi_modal_inputs: False ``` ### Specific Changes - Enhanced `AsyncRolloutRequest` to handle multimodal data from tools - Updated `add_tool_response_messages` to process multimodal content - Added documentation for multimodal tool support in the RST docs - Fixed configuration in example YAML files - Added proper handling of multimodal inputs in the rollout system ### Checklist Before Submitting - [X] Read the [Contribute Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide). - [X] Apply [pre-commit checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting). - [X] Add `[BREAKING]` to the PR title `description` if it breaks any API. - [X] Update the documentation about your changes in the [docs](https://github.com/volcengine/verl/tree/main/docs). - [X] New CI unit test(s) are added to cover the code path. - [X] Rely on existing unit tests on CI that covers the code path.

nanjiangwill marked this pull request as draft June 22, 2025 17:41

nanjiangwill force-pushed the feat/support-tool-creates-multimodal branch from 8d52ec2 to 4212580 Compare June 25, 2025 02:52

nanjiangwill changed the title ~~[tool] feat: add support for tool creates multimodal data~~ [tool] feat: Add support for tools that generate multimodal data Jun 25, 2025

zhaochenyang20 marked this pull request as ready for review June 26, 2025 23:52

zhaochenyang20 requested review from PeterSH6, chenhaiq, eric-haibin-lin, tongyx361, vermouth1992 and zhaochenyang20 as code owners June 26, 2025 23:52

SwordFaith reviewed Jun 30, 2025

View reviewed changes

verl/workers/rollout/schemas.py Show resolved Hide resolved

feat: add tool image support

1b4eec2

nanjiangwill force-pushed the feat/support-tool-creates-multimodal branch from 398f715 to 1b4eec2 Compare June 30, 2025 20:15

log: add logging

de008d9

mantle2048 reviewed Jul 1, 2025

View reviewed changes

verl/workers/rollout/schemas.py Outdated Show resolved Hide resolved

nanjiangwill added 2 commits July 1, 2025 14:50

merge main

4a03cf1

lint

d858835

nanjiangwill added 2 commits July 2, 2025 12:39

merge main

05a17a2

feat: add qwen2vl position id

ab7ef80

zhaochenyang20 enabled auto-merge (squash) July 2, 2025 22:00

eric-haibin-lin reviewed Jul 2, 2025

View reviewed changes

verl/workers/rollout/schemas.py Show resolved Hide resolved

verl/trainer/config/ppo_trainer.yaml Show resolved Hide resolved

rename

52ea9fb

auto-merge was automatically disabled July 2, 2025 23:28
Head branch was pushed to by a user without write access

nanjiangwill added 5 commits July 2, 2025 16:45

update

a1b0c20

Merge branch 'main' into feat/support-tool-creates-multimodal

0bcc21d

update

d616dbe

fix: fix CI

ca66863

update

2953373

nanjiangwill added 2 commits July 3, 2025 19:17

fix: fix CI

448d4c8

fix: fix CI

f31bdf9

eric-haibin-lin reviewed Jul 4, 2025

View reviewed changes

mantle2048 reviewed Jul 4, 2025

View reviewed changes

verl/workers/rollout/schemas.py Outdated Show resolved Hide resolved

nanjiangwill added 5 commits July 4, 2025 00:48

fix: fix CI

cff192b

fix: fix CI

88606c8

fix: fix CI

6793329

Merge main

c027a79

fix: fix position ids issue

93e28f6

zhaochenyang20 approved these changes Jul 4, 2025

View reviewed changes

zhaochenyang20 merged commit 715724c into verl-project:main Jul 4, 2025
47 of 48 checks passed

xieck13 mentioned this pull request Jul 5, 2025

[tool] feat: add image zoom in tool #2372

Closed

7 tasks

nanjiangwill mentioned this pull request Jul 7, 2025

Multi-turn Update #2 VLM Support Tracker zhaochenyang20/Awesome-ML-SYS-Tutorial#137

Open

11 tasks

Maxwell-Jia mentioned this pull request Jul 9, 2025

[recipe] feat: add deepeyes recipe #2398

Merged

5 tasks

nanjiangwill changed the title ~~[tool] feat: Add support for tools that generate multimodal data~~ [sglang,tool] feat: Add support for tools that generate multimodal data Jul 9, 2025

Osilly mentioned this pull request Oct 11, 2025

multimodal support rllm-org/rllm#242

Open

nancyjlau mentioned this pull request Oct 24, 2025

Enable multimodal model support for RL training rllm-org/rllm#264

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[sglang,tool] feat: Add support for tools that generate multimodal data#2146

[sglang,tool] feat: Add support for tools that generate multimodal data#2146
zhaochenyang20 merged 19 commits intoverl-project:mainfrom
nanjiangwill:feat/support-tool-creates-multimodal

nanjiangwill commented Jun 22, 2025 •

edited

Loading

Uh oh!

CLAassistant commented Jun 25, 2025 •

edited

Loading

Uh oh!

SwordFaith left a comment

Uh oh!

Uh oh!

Uh oh!

vermouth1992 commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

eric-haibin-lin left a comment

Uh oh!

Uh oh!

eric-haibin-lin commented Jul 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Conversation

nanjiangwill commented Jun 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Key Features

API and Usage Example

Specific Changes

Checklist Before Submitting

Uh oh!

CLAassistant commented Jun 25, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SwordFaith left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

vermouth1992 commented Jul 2, 2025

Uh oh!

Uh oh!

Uh oh!

eric-haibin-lin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eric-haibin-lin commented Jul 4, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

nanjiangwill commented Jun 22, 2025 •

edited

Loading

CLAassistant commented Jun 25, 2025 •

edited

Loading