Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
41 commits
Select commit Hold shift + click to select a range
18764ba
recipe for deepeyes
Jul 7, 2025
37bd197
add deepeyes train script
Jul 7, 2025
dd3a91b
fix BaseTool return type for multi-modal tools
Jul 7, 2025
557fc2e
refactor preprocess scripts
Jul 7, 2025
d01ab68
update_image_data_format
xieck13 Jul 8, 2025
ce37285
fix fetch_image
xieck13 Jul 8, 2025
09e8f5b
add initail readme
Jul 8, 2025
b9a4716
Merge stashed changes after rebase
Jul 8, 2025
2052b8a
fix merge conflict
Jul 8, 2025
6c0bf5f
fix zoom in tool imae fetch and bbox val
Jul 9, 2025
03bd437
fix pre-commit err
Jul 9, 2025
a35b677
update readme
Jul 9, 2025
aba1e29
add LLM_AS_A_JUDGE_BASE to run script
Jul 9, 2025
bc41584
remove thinklite subset and update readme
Jul 9, 2025
b673fd9
update readme
Jul 9, 2025
ba46c79
add license
Jul 10, 2025
b5a927e
refactor deepeyes recipe
Jul 13, 2025
a449549
refactor deepeyes recipe
Jul 13, 2025
52ba622
fix system messages
xieck13 Jul 21, 2025
3f1c46c
feat: support multi_modal_data for AgentLoop
Jul 23, 2025
181dd6c
fix: position_ids error in ToolAgentLoop
Jul 24, 2025
612beab
fix: answer extract in compute score
Jul 26, 2025
02eec15
fix: AgentLoop init bug after rebase
Jul 27, 2025
16b9979
Merge branch 'main' into recipe/deepeyes
Jul 30, 2025
b121864
fix CI err
Jul 30, 2025
e9eedf3
Merge main branch into recipe/deepeyes
Aug 2, 2025
cd1d58c
fix merge bug
Aug 4, 2025
aa45acc
merge main
Aug 4, 2025
45653fa
fix(tools): update ImageZoomInTool to match BaseTool interface
Aug 5, 2025
88905f5
add performance figures to readme
Aug 6, 2025
2ee03e6
fix: image_data signature for AsyncServerBase
Aug 7, 2025
82bcb8b
fix: pydantic error in CI sgl
Aug 7, 2025
436c21e
fix(test): add ignore_reinit_error=True to prevent Ray double initial…
Aug 7, 2025
35a8298
fix: e2e_ppo_megatron CI err unexpected keyword argument 'image_data'
Aug 7, 2025
3d429f4
test: add multimodal tool test for agent loop
Aug 7, 2025
a450811
update AgentLoopOutput and change to use ToolResponse in schemas.py
Aug 7, 2025
48ea32d
fix: sgl CI
Aug 8, 2025
e7f48a5
extract multi modal agent loop test to a new file
Aug 11, 2025
b3c121d
Merge main branch with multimodal tool agent loop support
Aug 11, 2025
f3791b0
update model path in test_multi_modal.py
Aug 11, 2025
fc3a734
fix multi modal agent loop config
Aug 12, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/workflows/sgl.yml
Original file line number Diff line number Diff line change
Expand Up @@ -136,7 +136,7 @@ jobs:
pytest -s test_sglang_async_rollout_mcp_tools.py
- name: Test the latest SGLang Rollout async with agent loop
run: |
ROLLOUT_NAME=sglang pytest -svvv tests/experimental/agent_loop/test_basic_agent_loop.py
ROLLOUT_NAME=sglang pytest -svvv tests/experimental/agent_loop
# Note(haibin.lin): for any new test, please update gpu_unit_tests.yaml to avoid repeated tests
- name: Test the latest SGLang Rollout async with multimodal delta
run: |
Expand Down
55 changes: 55 additions & 0 deletions recipe/deepeyes/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
# DeepEyes: Incentivizing "Thinking with Images" via Reinforcement Learning

This directory contains the implementation for reproducing the DeepEyes paper within the verl framework, supporting multi-turn visual tool calls. This implementation is based on the original [DeepEyes paper](https://arxiv.org/abs/2505.14362) and its [official implementation](https://github.com/Visual-Agent/DeepEyes), integrated with the multi-modal and multi-turn capabilities of the verl framework.

## Reproducing the Experiment

> **Note on the 'Chart' Dataset:**
>
> The provided preprocessing script intentionally excludes `data_v0.8_visual_toolbox_v2.parquet`, which contains the 'Chart' data. This subset consists of very high-resolution images, often resembling large figures composed of multiple sub-plots, much like those found in academic papers.
>
> Consequently, even after using the zoom-in tool, the resulting cropped images remain large. This poses a significant risk of causing Out-of-Memory (OOM) errors, which can abruptly terminate the training process.
>
> **We strongly recommend against training on the 'Chart' dataset on a single node.**

> **Note on the 'thinklite' Dataset:**
> Many images in the `thinklite` dataset have a very low resolution, with either a height or width below 28 pixels. This fails to meet the minimum input size required by the Qwen-2.5VL image processor and would cause errors during data loading.
>
> To mitigate this, we upscale these low-resolution images to satisfy the processor's requirements. However, please be aware that because the original resolution is low, subsequent `crop` operations by the zoom-in tool might frequently trigger exceptions, which could in turn affect the model's tool-use performance.

First, launch an inference service to act as a judge for reward calculation. You can use the following script as a reference:

```bash
python -m sglang.launch_server --model-path /path/to/Qwen2.5-72B-Instruct \
--port 18901 \
--tp-size 8 \
--context-length 32768 \
--trust-remote-code \
--log-requests false
```

Next, you can start the training:

```bash
bash recipe/deepeyes/run_deepeyes_grpo.sh
Comment thread
Maxwell-Jia marked this conversation as resolved.
```

## Performance

![score](https://private-user-images.githubusercontent.com/82520804/474784419-b13f4f72-bb3a-4281-a43b-1f34a9037c0c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTQ0NTQxMTMsIm5iZiI6MTc1NDQ1MzgxMywicGF0aCI6Ii84MjUyMDgwNC80NzQ3ODQ0MTktYjEzZjRmNzItYmIzYS00MjgxLWE0M2ItMWYzNGE5MDM3YzBjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA4MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwODA2VDA0MTY1M1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJjNGMxMjhiOGM4MTNhYTEzYTE2MTYzY2ZjYWRhNmEzMmVjNjUxOGI3MTgzOGQyM2ZmOWJlYTZlNDYzYzU0ZDkmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.qTDX-3fyLHWdeFh9o4b6nIAB57bT0XyLjKXhNV6k5nA)

![entropy](https://private-user-images.githubusercontent.com/82520804/474785253-752106a9-e25d-4b44-aef9-1ac98015d05c.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTQ0NTQxMTMsIm5iZiI6MTc1NDQ1MzgxMywicGF0aCI6Ii84MjUyMDgwNC80NzQ3ODUyNTMtNzUyMTA2YTktZTI1ZC00YjQ0LWFlZjktMWFjOTgwMTVkMDVjLnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA4MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwODA2VDA0MTY1M1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTM4OGQ2ZGI3M2JlYWE4YTQyMzIxMWYxMzZhNDBmNmYxNzcwNDgxNThiZDRiMzQyYzUwZjc3OWE4YzdhYWEwMWUmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.PhimMTxXXEtMLPGzejPQuw-Ul0As8ey-hyy1qkeABIQ)

![num_turns](https://private-user-images.githubusercontent.com/82520804/474785462-c99c7952-14db-485a-acd2-14e5956ecc34.png?jwt=eyJhbGciOiJIUzI1NiIsInR5cCI6IkpXVCJ9.eyJpc3MiOiJnaXRodWIuY29tIiwiYXVkIjoicmF3LmdpdGh1YnVzZXJjb250ZW50LmNvbSIsImtleSI6ImtleTUiLCJleHAiOjE3NTQ0NTQxMTMsIm5iZiI6MTc1NDQ1MzgxMywicGF0aCI6Ii84MjUyMDgwNC80NzQ3ODU0NjItYzk5Yzc5NTItMTRkYi00ODVhLWFjZDItMTRlNTk1NmVjYzM0LnBuZz9YLUFtei1BbGdvcml0aG09QVdTNC1ITUFDLVNIQTI1NiZYLUFtei1DcmVkZW50aWFsPUFLSUFWQ09EWUxTQTUzUFFLNFpBJTJGMjAyNTA4MDYlMkZ1cy1lYXN0LTElMkZzMyUyRmF3czRfcmVxdWVzdCZYLUFtei1EYXRlPTIwMjUwODA2VDA0MTY1M1omWC1BbXotRXhwaXJlcz0zMDAmWC1BbXotU2lnbmF0dXJlPTJkNWYwMGVjOWM4NDVhZTkzZWI5NWMzMGVjZTcyZGM2NDExY2FmYTBlYWJmZTk5YTU5MzM3NmNkYWI4Y2U4Y2YmWC1BbXotU2lnbmVkSGVhZGVycz1ob3N0In0.Ieakk_ttMsNygVzpZZqGs1507j2GC-rqHSYH9iQQ71Q)

See [Comment](https://github.com/volcengine/verl/pull/2398#issuecomment-3157142856) for more details.

Note: AgentLoop does not directly record num_tool_calls, but records num_turns. In our scenario, you can calculate the number of tool calls by num_tool_calls = num_turns / 2 - 1.

## References and Acknowledgements

- [DeepEyes Paper](https://arxiv.org/abs/2505.14362)
- [DeepEyes Official Implementation](https://github.com/Visual-Agent/DeepEyes)

---
If you need further details for reproduction or encounter any issues, feel free to open an issue or contact the maintainers.
32 changes: 32 additions & 0 deletions recipe/deepeyes/configs/deepeyes_multiturn_grpo.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,32 @@
hydra:
searchpath:
- file://verl/trainer/config

defaults:
- ppo_trainer
- _self_

data:
max_prompt_length: 2048
max_response_length: 2048
train_batch_size: 256
return_raw_chat: True
return_multi_modal_inputs: False
custom_cls:
path: "recipe/deepeyes/deepeyes.py"
name: CustomRLHFDataset

actor_rollout_ref:
hybrid_engine: True
model:
custom_chat_template: "{% set image_count = namespace(value=0) %}{% set video_count = namespace(value=0) %}{%- if tools %}{{- '<|im_start|>system\\n' }}{%- if messages[0]['role'] == 'system' %}{%- if messages[0]['content'] is string %}{{- messages[0]['content'] }}{%- else %}{{- messages[0]['content'][0]['text'] }}{%- endif %}{%- else %}{{- 'You are a helpful assistant.' }}{%- endif %}{{- \"\\n\\n# Tools\\n\\nYou may call one or more functions to assist with the user query.\\n\\nYou are provided with function signatures within <tools></tools> XML tags:\\n<tools>\" }}{%- for tool in tools %}{{- \"\\n\" }}{{- tool | tojson }}{%- endfor %}{{- \"\\n</tools>\\n\\nFor each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:\\n<tool_call>\\n{\\\"name\\\": <function-name>, \\\"arguments\\\": <args-json-object>}\\n</tool_call><|im_end|>\\n\" }}{% for message in messages %}{% if message['role'] != 'system' or loop.first == false %}{%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{%- elif message.role == \"assistant\" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\\n<tool_call>\\n{\"name\": \"' }}{{- tool_call.name }}{{- '\", \"arguments\": ' }}{{- tool_call.arguments | tojson }}{{- '}\\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\\n' }}{%- elif message.role == \"tool\" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\\n<tool_response>\\n' }}{% if message['content'] is string %}{{ message.content }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif content['type'] == 'text' or 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}{% endif %}{{- '\\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}{{- '<|im_end|>\\n' }}{%- endif %}{%- endif %}{% endif %}{% endfor %}{%- else %}{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}{%- if (message.role == \"user\") or (message.role == \"system\" and not loop.first) or (message.role == \"assistant\" and not message.tool_calls) %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}<|im_end|>\n{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}<|im_end|>\n{% endif %}{%- elif message.role == \"assistant\" %}{{- '<|im_start|>' + message.role }}{%- if message.content %}{{- '\\n' + message.content }}{%- endif %}{%- for tool_call in message.tool_calls %}{%- if tool_call.function is defined %}{%- set tool_call = tool_call.function %}{%- endif %}{{- '\\n<tool_call>\\n{\"name\": \"' }}{{- tool_call.name }}{{- '\", \"arguments\": ' }}{{- tool_call.arguments | tojson }}{{- '}\\n</tool_call>' }}{%- endfor %}{{- '<|im_end|>\\n' }}{%- elif message.role == \"tool\" %}{%- if (loop.index0 == 0) or (messages[loop.index0 - 1].role != \"tool\") %}{{- '<|im_start|>user' }}{%- endif %}{{- '\\n<tool_response>\\n' }}{% if message['content'] is string %}{{ message.content }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}{% set image_count.value = image_count.value + 1 %}{% if add_vision_id %}Picture {{ image_count.value }}: {% endif %}<|vision_start|><|image_pad|><|vision_end|>{% elif content['type'] == 'video' or 'video' in content %}{% set video_count.value = video_count.value + 1 %}{% if add_vision_id %}Video {{ video_count.value }}: {% endif %}<|vision_start|><|video_pad|><|vision_end|>{% elif content['type'] == 'text' or 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}{% endif %}{{- '\\n</tool_response>' }}{%- if loop.last or (messages[loop.index0 + 1].role != \"tool\") %}{{- '<|im_end|>\\n' }}{%- endif %}{%- endif %}{% endfor %}{%- endif %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
rollout:
name: sglang
multi_turn:
enable: True
max_assistant_turns: 5
tool_config_path: "recipe/deepeyes/config/image_zoom_in_tool_config.yaml"

custom_reward_function:
path: "recipe/deepeyes/deepeyes.py"
name: compute_score
26 changes: 26 additions & 0 deletions recipe/deepeyes/configs/image_zoom_in_tool_config.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
tools:
- class_name: "verl.tools.image_zoom_in_tool.ImageZoomInTool"
config:
num_workers: 256
rate_limit: 256
timeout: 60
type: native
tool_schema:
type: "function"
function:
name: "image_zoom_in_tool"
description: "Zoom in on a specific region of an image by cropping it based on a bounding box (bbox) and an optional object label."
parameters:
type: "object"
properties:
bbox_2d:
type: "array"
items:
type: "number"
minItems: 4
maxItems: 4
description: "The bounding box of the region to zoom in, as [x1, y1, x2, y2], where (x1, y1) is the top-left corner and (x2, y2) is the bottom-right corner."
label:
type: "string"
description: "The name or label of the object in the specified bounding box (optional)."
required: ["bbox_2d"]
Loading