[VLM] Optimize async mm data process mechanism by yuan-luo · Pull Request #12066 · sgl-project/sglang

yuan-luo · 2025-10-24T08:36:50Z

Motivation

The inference time in VLM can be divided into several parts. Take 2MB video for example, the time slot can be described in the following chart. Currently TokenizerManager(a.k.a. TM) handle mm data preprocessing in co-routine way, depending on the backend. There's no timeout guard for a single task, any stall in the task can impact the TokenizerManager event loop and then stall the whole request pipeline. Moreover, there's no back-pressure mechanism in the TokenizerManager preprocessing. So in high parallel load, the TM is stall in handling new incoming request, in the worst case, the health_check or the metrics command can be stalled for 10s which trigger alarm in monitor system.

This PR is to introduce back-pressure mechnism and timeout mechanism for the mm_data_processing, which make sure the TM request handler is always responsive so as to make the system more robust and eventually improve the whole throughput and reduce TTFT.

$time curl http://127.0.0.1:30000/metrics

# HELP sglang:per_stage_req_latency_seconds The latency of each stage of requests.
# TYPE sglang:per_stage_req_latency_seconds histogram
sglang:per_stage_req_latency_seconds_sum{engine_type="unified",model_name="/home/admin/Qwen2.5-VL-72B-Instruct",pp_rank="0",stage="prefill_waiting",tp_rank="7"} 26.068482145667076
sglang:per_stage_req_latency_seconds_sum{engine_type="unified",model_name="/home/admin/Qwen2.5-VL-72B-Instruct",pp_rank="0",stage="prefill_waiting",tp_rank="3"} 26.218725708313286
......
sglang:e2e_request_latency_seconds_bucket{le="1200.0",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_bucket{le="1800.0",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_bucket{le="2400.0",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_count{model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0

real    0m10.013s
user    0m0.001s
sys     0m0.004s

Module	Submodule	time cost(ms)
TokenizerManager	load_time	8.90
TokenizerManager	preprocess_time	450.18
TokenizerManager	process_time	281.28
TokenizerManager	send_pyobj_time（serialize）	307.34
Scheduler	from_dict materialize/deserialize	506.36
Scheduler	waiting in queue for scheduling	2.0 - 440.0
Scheduler	forward	1018.08 - 1500.0

➜  sglang_dev2 git:(async_preprocess_vl) ✗ python ./test/srt/test_async_mm_data_processor.py
=============================================================================================================== test session starts ===============================================================================================================
platform linux -- Python 3.10.12, pytest-8.4.1, pluggy-1.6.0
rootdir: /sgl-workspace/sglang_dev2/test
configfile: pytest.ini
plugins: asyncio-1.2.0, typeguard-4.4.4, anyio-4.9.0
asyncio: mode=auto, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 17 items

test/srt/test_async_mm_data_processor.py .................                                                                                                                                                                                  [100%]

=============================================================================================================== 17 passed in 0.20s ================================================================================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist · 2025-10-24T08:37:22Z

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and responsiveness of the system's multimodal data processing pipeline. By introducing an asynchronous wrapper with built-in back-pressure and timeout mechanisms, it addresses critical issues where multimodal data preprocessing could previously stall the entire request handling, especially under high parallel load. This change ensures that the tokenizer manager remains responsive, preventing system alarms and improving overall throughput and stability.

Highlights

Introduced Asynchronous Multimodal Data Processor: A new AsyncMMDataProcessor class has been added to wrap existing multimodal processors, enabling them to handle both asynchronous and synchronous data processing methods efficiently. It automatically detects if the underlying processor has an async def process_mm_data_async method; otherwise, it falls back to running a synchronous process_mm_data method in a thread pool.
Implemented Back-Pressure and Timeout Mechanisms: The AsyncMMDataProcessor now supports optional concurrency limits via asyncio.Semaphore and per-call timeouts using asyncio.wait_for. This prevents the TokenizerManager from stalling under heavy load and ensures responsiveness.
Integrated into TokenizerManager: The TokenizerManager has been updated to utilize the new AsyncMMDataProcessor for all multimodal data preprocessing, configuring it with max_concurrent_calls and timeout_s parameters from server_args to control resource usage and prevent bottlenecks.
Enhanced Performance Logging: Performance logging has been added to the QwenVLProcessor to track the time spent in different stages of multimodal data processing (load, preprocess, process), providing better insights into potential performance bottlenecks.
Comprehensive Unit Tests: Extensive unit tests have been added for the AsyncMMDataProcessor to verify its functionality across various scenarios, including async/sync paths, concurrency limits, timeout handling, error propagation, and argument passthrough.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces an asynchronous wrapper for multimodal data processing, AsyncMMDataProcessor, to add concurrency control and timeouts. This is a valuable addition for improving system robustness and performance under load. The implementation is solid and is accompanied by a comprehensive test suite. I've identified one critical performance issue regarding the repeated instantiation of the processor, which should be addressed. I also have a couple of medium-severity suggestions for improving logging practices and test correctness. Overall, this is a great enhancement.

python/sglang/srt/managers/tokenizer_manager.py

python/sglang/srt/multimodal/processors/qwen_vl.py

test/srt/test_async_mm_data_processor.py

python/sglang/srt/managers/tokenizer_manager.py

JustinTong0323 · 2025-10-29T07:30:08Z

Also fix lint

yuan-luo · 2025-10-29T07:50:53Z

Also fix lint

Done.

yuan-luo requested review from JustinTong0323, Ying1123, hnyls2002, merrymercy, mickqian and xiezhq-hermann as code owners October 24, 2025 08:36

sglang-bot added the run-ci label Oct 24, 2025

JustinTong0323 self-assigned this Oct 24, 2025

gemini-code-assist bot reviewed Oct 24, 2025

View reviewed changes

python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved

python/sglang/srt/multimodal/processors/qwen_vl.py Outdated Show resolved Hide resolved

test/srt/test_async_mm_data_processor.py Show resolved Hide resolved

yuan-luo force-pushed the async_preprocess_vl branch 2 times, most recently from 7b80546 to 53131c6 Compare October 27, 2025 06:40

JustinTong0323 reviewed Oct 28, 2025

View reviewed changes

python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved

luoyuan.luo added 2 commits October 28, 2025 13:52

Support async mm data process mechanism

1262ca8

Address review comments

505f3c8

yuan-luo force-pushed the async_preprocess_vl branch from 8bc8d91 to 505f3c8 Compare October 28, 2025 06:17

yuan-luo changed the title ~~Support async mm data process mechanism~~ [WIP] Support async mm data process mechanism Oct 29, 2025

Refactor semaphore to put to TokenizerManager

5528baa

yuan-luo changed the title ~~[WIP] Support async mm data process mechanism~~ Support async mm data process mechanism Oct 29, 2025

yuan-luo force-pushed the async_preprocess_vl branch from 4656b86 to 52e830f Compare October 29, 2025 04:07

Adjust unit test accordingly

297098d

yuan-luo force-pushed the async_preprocess_vl branch 2 times, most recently from 194ba53 to d655354 Compare October 29, 2025 07:16

JustinTong0323 reviewed Oct 29, 2025

View reviewed changes

python/sglang/srt/managers/tokenizer_manager.py Outdated Show resolved Hide resolved

yuan-luo force-pushed the async_preprocess_vl branch from d655354 to 11fd531 Compare October 29, 2025 07:44

Address review comments

14878a7

yuan-luo force-pushed the async_preprocess_vl branch from 11fd531 to 14878a7 Compare October 29, 2025 07:50

Merge branch 'main' into async_preprocess_vl

efaeb9a

JustinTong0323 approved these changes Oct 30, 2025

View reviewed changes

Merge branch 'main' into async_preprocess_vl

d52b987

yuan-luo changed the title ~~Support async mm data process mechanism~~ [VLM] Optimize async mm data process mechanism Oct 30, 2025

yuan-luo added vlm performance Multi-modal multi-modal language model labels Oct 30, 2025

hnyls2002 approved these changes Oct 31, 2025

View reviewed changes

hnyls2002 merged commit c30ebb9 into sgl-project:main Oct 31, 2025
129 of 286 checks passed

yuan-luo deleted the async_preprocess_vl branch November 2, 2025 12:11

yhyang201 mentioned this pull request Mar 30, 2026

[VLM] remove AsyncMMDataProcessor wrapper #21651

Merged

5 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[VLM] Optimize async mm data process mechanism#12066

[VLM] Optimize async mm data process mechanism#12066
hnyls2002 merged 7 commits intosgl-project:mainfrom
antgroup:async_preprocess_vl

yuan-luo commented Oct 24, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 commented Oct 29, 2025

Uh oh!

yuan-luo commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

yuan-luo commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Uh oh!

gemini-code-assist bot commented Oct 24, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

JustinTong0323 commented Oct 29, 2025

Uh oh!

yuan-luo commented Oct 29, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

yuan-luo commented Oct 24, 2025 •

edited

Loading