Skip to content

[sglang,tool] feat: Add support for tools that generate multimodal data#2146

Merged
zhaochenyang20 merged 19 commits intoverl-project:mainfrom
nanjiangwill:feat/support-tool-creates-multimodal
Jul 4, 2025
Merged

[sglang,tool] feat: Add support for tools that generate multimodal data#2146
zhaochenyang20 merged 19 commits intoverl-project:mainfrom
nanjiangwill:feat/support-tool-creates-multimodal

Conversation

@nanjiangwill
Copy link
Contributor

@nanjiangwill nanjiangwill commented Jun 22, 2025

What does this PR do?

This PR adds support for tools to create and return multimodal data (images and videos) during rollout. It enhances the framework to properly handle multimodal inputs that are dynamically generated by tools during multi-turn conversations.

Key Features

  • Tools can now return images and videos as part of their response
  • Added support for processing multimodal inputs in the rollout system
  • Introduced a new configuration option return_multi_modal_inputs to control how multimodal inputs are processed
  • Updated documentation with examples of how to implement tools that generate multimodal data

API and Usage Example

async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}

In your dataset config, set:

data:
  return_multi_modal_inputs: False

Specific Changes

  • Enhanced AsyncRolloutRequest to handle multimodal data from tools
  • Updated add_tool_response_messages to process multimodal content
  • Added documentation for multimodal tool support in the RST docs
  • Fixed configuration in example YAML files
  • Added proper handling of multimodal inputs in the rollout system

Checklist Before Submitting

  • Read the Contribute Guide.
  • Apply pre-commit checks.
  • Add [BREAKING] to the PR title description if it breaks any API.
  • Update the documentation about your changes in the docs.
  • New CI unit test(s) are added to cover the code path.
  • Rely on existing unit tests on CI that covers the code path.

@nanjiangwill nanjiangwill marked this pull request as draft June 22, 2025 17:41
@CLAassistant
Copy link

CLAassistant commented Jun 25, 2025

CLA assistant check
All committers have signed the CLA.

@nanjiangwill nanjiangwill force-pushed the feat/support-tool-creates-multimodal branch from 8d52ec2 to 4212580 Compare June 25, 2025 02:52
@nanjiangwill nanjiangwill changed the title [tool] feat: add support for tool creates multimodal data [tool] feat: Add support for tools that generate multimodal data Jun 25, 2025
@zhaochenyang20 zhaochenyang20 marked this pull request as ready for review June 26, 2025 23:52
Copy link
Collaborator

@SwordFaith SwordFaith left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can include a two-step end-to-end training test and a delta multi-modal input unit test, if necessary, to safeguard our code through CI. Additional reasoning for image-based multi-turn tasks, such as Deep Eyes, can be incorporated in future PRs.

@nanjiangwill nanjiangwill force-pushed the feat/support-tool-creates-multimodal branch from 398f715 to 1b4eec2 Compare June 30, 2025 20:15
@vermouth1992
Copy link
Collaborator

Could you please resolve conflict?

@zhaochenyang20 zhaochenyang20 enabled auto-merge (squash) July 2, 2025 22:00
auto-merge was automatically disabled July 2, 2025 23:28

Head branch was pushed to by a user without write access

Copy link
Collaborator

@eric-haibin-lin eric-haibin-lin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/gemini review

@eric-haibin-lin
Copy link
Collaborator

My concern is addressed

@zhaochenyang20 zhaochenyang20 merged commit 715724c into verl-project:main Jul 4, 2025
47 of 48 checks passed
@xieck13 xieck13 mentioned this pull request Jul 5, 2025
7 tasks
SuperCB pushed a commit to SuperCB/verl that referenced this pull request Jul 7, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
yellowbee686 pushed a commit to yellowbee686/verl that referenced this pull request Jul 7, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
@nanjiangwill nanjiangwill changed the title [tool] feat: Add support for tools that generate multimodal data [sglang,tool] feat: Add support for tools that generate multimodal data Jul 9, 2025
lkc233 pushed a commit to lkc233/verl that referenced this pull request Jul 10, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jul 28, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
Juniper1021 pushed a commit to Juniper1021/verl that referenced this pull request Aug 7, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
whatadayG pushed a commit to whatadayG/verl that referenced this pull request Sep 5, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
chenjiaoAngel added a commit to chenjiaoAngel/verl that referenced this pull request Nov 14, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
TimurTaepov pushed a commit to giorgossideris/verl that referenced this pull request Dec 20, 2025
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
oseyosey pushed a commit to oseyosey/verl that referenced this pull request Jan 20, 2026
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
vyomakesh0728 added a commit to vyomakesh0728/verl that referenced this pull request Jan 22, 2026
…l-project#2146)

### What does this PR do?

This PR adds support for tools to create and return multimodal data
(images and videos) during rollout. It enhances the framework to
properly handle multimodal inputs that are dynamically generated by
tools during multi-turn conversations.

### Key Features

- Tools can now return images and videos as part of their response
- Added support for processing multimodal inputs in the rollout system
- Introduced a new configuration option `return_multi_modal_inputs` to
control how multimodal inputs are processed
- Updated documentation with examples of how to implement tools that
generate multimodal data

### API and Usage Example

```python
async def execute(self, ...) -> Tuple[str | Dict[str, Any], float, dict]:
    # Process images or videos
    from verl.utils.dataset.vision_utils import process_image, process_video

    img1 = process_image(img1)
    video1 = process_video(video1)

    # Return multimodal data
    return {"image": [img1, ...], "video": [video1, ...], "text": "..."}, 0, {}
```

In your dataset config, set:
```yaml
data:
  return_multi_modal_inputs: False
```

### Specific Changes

- Enhanced `AsyncRolloutRequest` to handle multimodal data from tools
- Updated `add_tool_response_messages` to process multimodal content
- Added documentation for multimodal tool support in the RST docs
- Fixed configuration in example YAML files
- Added proper handling of multimodal inputs in the rollout system

### Checklist Before Submitting

- [X] Read the [Contribute
Guide](https://github.com/volcengine/verl?tab=readme-ov-file#contribution-guide).
- [X] Apply [pre-commit
checks](https://github.com/volcengine/verl?tab=readme-ov-file#code-linting-and-formatting).
- [X] Add `[BREAKING]` to the PR title `description` if it breaks any
API.
- [X] Update the documentation about your changes in the
[docs](https://github.com/volcengine/verl/tree/main/docs).
- [X] New CI unit test(s) are added to cover the code path.
- [X] Rely on existing unit tests on CI that covers the code path.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants