Skip to content

[VLM] Optimize async mm data process mechanism#12066

Merged
hnyls2002 merged 7 commits intosgl-project:mainfrom
antgroup:async_preprocess_vl
Oct 31, 2025
Merged

[VLM] Optimize async mm data process mechanism#12066
hnyls2002 merged 7 commits intosgl-project:mainfrom
antgroup:async_preprocess_vl

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Oct 24, 2025

Motivation

The inference time in VLM can be divided into several parts. Take 2MB video for example, the time slot can be described in the following chart. Currently TokenizerManager(a.k.a. TM) handle mm data preprocessing in co-routine way, depending on the backend. There's no timeout guard for a single task, any stall in the task can impact the TokenizerManager event loop and then stall the whole request pipeline. Moreover, there's no back-pressure mechanism in the TokenizerManager preprocessing. So in high parallel load, the TM is stall in handling new incoming request, in the worst case, the health_check or the metrics command can be stalled for 10s which trigger alarm in monitor system.

This PR is to introduce back-pressure mechnism and timeout mechanism for the mm_data_processing, which make sure the TM request handler is always responsive so as to make the system more robust and eventually improve the whole throughput and reduce TTFT.

$time curl http://127.0.0.1:30000/metrics

# HELP sglang:per_stage_req_latency_seconds The latency of each stage of requests.
# TYPE sglang:per_stage_req_latency_seconds histogram
sglang:per_stage_req_latency_seconds_sum{engine_type="unified",model_name="/home/admin/Qwen2.5-VL-72B-Instruct",pp_rank="0",stage="prefill_waiting",tp_rank="7"} 26.068482145667076
sglang:per_stage_req_latency_seconds_sum{engine_type="unified",model_name="/home/admin/Qwen2.5-VL-72B-Instruct",pp_rank="0",stage="prefill_waiting",tp_rank="3"} 26.218725708313286
......
sglang:e2e_request_latency_seconds_bucket{le="1200.0",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_bucket{le="1800.0",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_bucket{le="2400.0",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_bucket{le="+Inf",model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0
sglang:e2e_request_latency_seconds_count{model_name="/home/admin/Qwen2.5-VL-72B-Instruct"} 47.0

real    0m10.013s
user    0m0.001s
sys     0m0.004s
Module Submodule time cost(ms)
TokenizerManager load_time 8.90
TokenizerManager preprocess_time 450.18
TokenizerManager process_time 281.28
TokenizerManager send_pyobj_time(serialize) 307.34
Scheduler from_dict materialize/deserialize 506.36
Scheduler waiting in queue for scheduling 2.0 - 440.0
Scheduler forward 1018.08 - 1500.0
➜  sglang_dev2 git:(async_preprocess_vl) ✗ python ./test/srt/test_async_mm_data_processor.py
=============================================================================================================== test session starts ===============================================================================================================
platform linux -- Python 3.10.12, pytest-8.4.1, pluggy-1.6.0
rootdir: /sgl-workspace/sglang_dev2/test
configfile: pytest.ini
plugins: asyncio-1.2.0, typeguard-4.4.4, anyio-4.9.0
asyncio: mode=auto, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 17 items

test/srt/test_async_mm_data_processor.py .................                                                                                                                                                                                  [100%]

=============================================================================================================== 17 passed in 0.20s ================================================================================================================

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the robustness and responsiveness of the system's multimodal data processing pipeline. By introducing an asynchronous wrapper with built-in back-pressure and timeout mechanisms, it addresses critical issues where multimodal data preprocessing could previously stall the entire request handling, especially under high parallel load. This change ensures that the tokenizer manager remains responsive, preventing system alarms and improving overall throughput and stability.

Highlights

  • Introduced Asynchronous Multimodal Data Processor: A new AsyncMMDataProcessor class has been added to wrap existing multimodal processors, enabling them to handle both asynchronous and synchronous data processing methods efficiently. It automatically detects if the underlying processor has an async def process_mm_data_async method; otherwise, it falls back to running a synchronous process_mm_data method in a thread pool.
  • Implemented Back-Pressure and Timeout Mechanisms: The AsyncMMDataProcessor now supports optional concurrency limits via asyncio.Semaphore and per-call timeouts using asyncio.wait_for. This prevents the TokenizerManager from stalling under heavy load and ensures responsiveness.
  • Integrated into TokenizerManager: The TokenizerManager has been updated to utilize the new AsyncMMDataProcessor for all multimodal data preprocessing, configuring it with max_concurrent_calls and timeout_s parameters from server_args to control resource usage and prevent bottlenecks.
  • Enhanced Performance Logging: Performance logging has been added to the QwenVLProcessor to track the time spent in different stages of multimodal data processing (load, preprocess, process), providing better insights into potential performance bottlenecks.
  • Comprehensive Unit Tests: Extensive unit tests have been added for the AsyncMMDataProcessor to verify its functionality across various scenarios, including async/sync paths, concurrency limits, timeout handling, error propagation, and argument passthrough.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@JustinTong0323 JustinTong0323 self-assigned this Oct 24, 2025
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an asynchronous wrapper for multimodal data processing, AsyncMMDataProcessor, to add concurrency control and timeouts. This is a valuable addition for improving system robustness and performance under load. The implementation is solid and is accompanied by a comprehensive test suite. I've identified one critical performance issue regarding the repeated instantiation of the processor, which should be addressed. I also have a couple of medium-severity suggestions for improving logging practices and test correctness. Overall, this is a great enhancement.

@yuan-luo yuan-luo force-pushed the async_preprocess_vl branch 2 times, most recently from 7b80546 to 53131c6 Compare October 27, 2025 06:40
@yuan-luo yuan-luo force-pushed the async_preprocess_vl branch from 8bc8d91 to 505f3c8 Compare October 28, 2025 06:17
@yuan-luo yuan-luo changed the title Support async mm data process mechanism [WIP] Support async mm data process mechanism Oct 29, 2025
@yuan-luo yuan-luo changed the title [WIP] Support async mm data process mechanism Support async mm data process mechanism Oct 29, 2025
@yuan-luo yuan-luo force-pushed the async_preprocess_vl branch from 4656b86 to 52e830f Compare October 29, 2025 04:07
@yuan-luo yuan-luo force-pushed the async_preprocess_vl branch 2 times, most recently from 194ba53 to d655354 Compare October 29, 2025 07:16
@JustinTong0323
Copy link
Copy Markdown
Collaborator

Also fix lint

@yuan-luo yuan-luo force-pushed the async_preprocess_vl branch from d655354 to 11fd531 Compare October 29, 2025 07:44
@yuan-luo yuan-luo force-pushed the async_preprocess_vl branch from 11fd531 to 14878a7 Compare October 29, 2025 07:50
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

Also fix lint

Done.

@yuan-luo yuan-luo changed the title Support async mm data process mechanism [VLM] Optimize async mm data process mechanism Oct 30, 2025
@yuan-luo yuan-luo added vlm performance Multi-modal multi-modal language model labels Oct 30, 2025
@hnyls2002 hnyls2002 merged commit c30ebb9 into sgl-project:main Oct 31, 2025
129 of 286 checks passed
@yuan-luo yuan-luo deleted the async_preprocess_vl branch November 2, 2025 12:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants