[VLM] Optimize async mm data process mechanism#12066
[VLM] Optimize async mm data process mechanism#12066hnyls2002 merged 7 commits intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the robustness and responsiveness of the system's multimodal data processing pipeline. By introducing an asynchronous wrapper with built-in back-pressure and timeout mechanisms, it addresses critical issues where multimodal data preprocessing could previously stall the entire request handling, especially under high parallel load. This change ensures that the tokenizer manager remains responsive, preventing system alarms and improving overall throughput and stability. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces an asynchronous wrapper for multimodal data processing, AsyncMMDataProcessor, to add concurrency control and timeouts. This is a valuable addition for improving system robustness and performance under load. The implementation is solid and is accompanied by a comprehensive test suite. I've identified one critical performance issue regarding the repeated instantiation of the processor, which should be addressed. I also have a couple of medium-severity suggestions for improving logging practices and test correctness. Overall, this is a great enhancement.
7b80546 to
53131c6
Compare
8bc8d91 to
505f3c8
Compare
4656b86 to
52e830f
Compare
194ba53 to
d655354
Compare
|
Also fix lint |
d655354 to
11fd531
Compare
11fd531 to
14878a7
Compare
Done. |
Motivation
The inference time in VLM can be divided into several parts. Take 2MB video for example, the time slot can be described in the following chart. Currently TokenizerManager(a.k.a. TM) handle mm data preprocessing in co-routine way, depending on the backend. There's no timeout guard for a single task, any stall in the task can impact the TokenizerManager event loop and then stall the whole request pipeline. Moreover, there's no back-pressure mechanism in the TokenizerManager preprocessing. So in high parallel load, the TM is stall in handling new incoming request, in the worst case, the health_check or the metrics command can be stalled for 10s which trigger alarm in monitor system.
This PR is to introduce back-pressure mechnism and timeout mechanism for the mm_data_processing, which make sure the TM request handler is always responsive so as to make the system more robust and eventually improve the whole throughput and reduce TTFT.
Modifications
Accuracy Tests
Benchmarking and Profiling
Checklist