vlm: remove redundant d2h movement of mm feature tensors#9987
vlm: remove redundant d2h movement of mm feature tensors#9987JustinTong0323 merged 10 commits intosgl-project:mainfrom AlienKevin:fast-hash-feature
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @AlienKevin, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request optimizes the processing of image feature tensors by re-enabling GPU-based hashing. This was made possible by a recent fix for a memory leak that previously necessitated a workaround involving CPU-based hashing. The change eliminates an unnecessary data transfer step, resulting in improved overall system throughput for multimodal benchmarks.
Highlights
- Performance Improvement: Re-enabled the optimized GPU tensor hashing for image requests, which was previously bypassed, leading to significant throughput gains.
- Code Optimization: Removed redundant device-to-host (GPU to CPU) movement of image feature tensors, streamlining the data processing pipeline.
- Impact: Achieved a 7.5% boost in throughput on MMMU (Math) and a 5.2% boost on fixed ISL1000/OSL1 benchmarks by leveraging faster GPU operations.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request removes a redundant device-to-host data transfer for image feature tensors within the process_mm_data method. The change is well-justified, as it eliminates a workaround previously implemented to mitigate a memory leak that has since been resolved. By ensuring feature tensors remain on the GPU, this modification successfully re-enables a more performant GPU-based hashing function, which is validated by the significant throughput improvements shown in the benchmarks. The code is cleaner and more efficient as a result. This is an excellent and well-documented improvement.
|
@JustinTong0323 Verified no memory leak with 7,000 requests under max-concurrency of 256. VRAM usage stays consistently around 92% throughout.
Server cmd: Client cmd: |
|
AlienKevin Hi! Very cool mr, please can you also test with --enable-mixed-chunk? |
There was a problem hiding this comment.
LGTM, and cc @yhyang201 as the original author of this code
|
I’m curious—if TP is enabled now, how are the various feature tensors transferred from GPU0 to the other GPUs? |
|
@yhyang201 thanks for bringing this up. With TP=2, rank 0 memory usage increased from 92% to 96% and stablized. Finished successfully without issue.
With TP=4, an OOM occurred: OOM traceReducing max_concurrency from 256 to 128 prevented OOM and memory usage stays at 93%:
|
|
@Swipe4057 I found there's a dynamo issue when running with dynamo error |
|
AlienKevin got it, thank you very much! I think this flag was without errors with vlm before, |
|
If we remove the GPU-to-CPU logic, it might reintroduce the issue described in #5732. For now, I haven’t found a way to address this issue without potentially impacting performance. |
|
@yhyang201 I'm not sure if there's any memory leak like in #5732, it seems that when |
|
I think adding the |
@AlienKevin kevin, could you help raise a issue with that? Thanks! |
python/sglang/srt/server_args.py
Outdated
| disable_shared_experts_fusion: bool = False | ||
| disable_chunked_prefix_cache: bool = False | ||
| disable_fast_image_processor: bool = False | ||
| offload_mm_feature: bool = False |
There was a problem hiding this comment.
As it would easily cause OOM, could we default it to true? Or just use flag like "load_mm_feature_to_gpu".
There was a problem hiding this comment.
Ok, I'll update the flag
… False to conserve device memory
|
@JustinTong0323 Renamed offload_mm_feature to keep_mm_feature_on_device and defaults to False as requested. |
Issue filed on --enable-mixed-chunk: #10179 |
|
@JustinTong0323 Just following up, is this PR ready to be merged? |
|
Better make this arg default in the future |
…#9987) Co-authored-by: Xiang (Kevin) Li <lik@nvidia.com>




Motivation
I found that surprisingly discovery is that the optimized
gpu_tensor_hashis not triggered at all and the slow SHA256 data_hash is still used for hashing image requests. The reason was a recent change from a month ago that moved pixel_values to CPU to the GPU to prevent a memory leak. Now that the memory leak is identified and fixed, we can remove this redundant movement and use the fast GPU image hashing again!Benchmarking and Profiling
After this PR, 7.5% boost in throughput on MMMU (Math) and 5.2% boost on fixed ISL1000/OSL1 benchmarks.
Details:
Hashing on CPU with SHA256 (Current behavior):
ISL1000/OSL1, max-concurrency=256 (best of 3 trials): 13483 tok/s
MMMU (Math): 10403 tok/s
Hashing on GPU with custom kernel:
ISL1000/OSL1, max-concurrency=128: 14525 tok/s
ISL1000/OSL1, max-concurrency=256 (best of 3 trials): 14190 tok/s
ISL1000/OSL1, max-concurrency=512: 13123 tok/s
ISL1000/OSL1, max-concurrency=768
OOM
MMMU (Math): 11184 tok/s
Measured on my bench_serving enhancement PR (synced with latest main):
Server command:
Client command: