[VLM] Support ViT Piecewise CUDA Graph for Qwen3-VL#15320
[VLM] Support ViT Piecewise CUDA Graph for Qwen3-VL#15320BBuf merged 1 commit intosgl-project:mainfrom
Conversation
Summary of ChangesHello @yuan-luo, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the performance of Vision Transformer (ViT) operations within the SGLang framework by integrating Piecewise CUDA Graph support for the Qwen3-VL model. It refines the existing Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request enables ViT Piecewise CUDA Graph for Qwen3-VL by generalizing the ViTCudaGraphRunner. The changes look good overall, but I've identified a critical issue with missing optional type hints that would cause a TypeError, a high-severity issue with a missing tensor-parallelism check, and a few medium-severity suggestions to improve code clarity and maintainability. Addressing these points will make the implementation more robust and easier to maintain.
|
Good job! |
3106018 to
1e2b1d3
Compare
@yhyang201 Thanks. will do it soon. |
1e2b1d3 to
9bfbd0f
Compare
|
After rebasing main and resolving conflict, as the Qwen3-VL Qwen3VLMoeVisionModel rot_pos_emb() function's return value has been changed in #15205, the function're result is broken. Fixing in progress. Setting WIP for the moment. |
9bfbd0f to
9d46a2f
Compare
|
Refactored ViTCudaGraphRunner to support new interface. Removing [WIP]. |
|
/tag-and-rerun-ci |
@yhyang201 Benchmark has been updated. |
|
/rerun-failed-ci |
3 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
9d46a2f to
5b4d6b2
Compare
|
/rerun-failed-ci |
4 similar comments
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
|
/rerun-failed-ci |
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Motivation
This PR is to enable ViT Piecewise CUDA Graph for Qwen3-VL.
Building logic upon ViTCudaGraphRunner to support both Qwen2.5-VL and Qwen3-VL.
TP>1 is supported.
Benchmark show 8xH20 Qwen3-VL-8B-Instruct TP=4
TTFT 1384.53ms --> 1120.68ms
Meanwhile fixed a bug that torch-symm-mem is not enabled for outplace allreduce. It gains about 4% e2e improvement over NCCL. (As custom all reduce has not supported CUDA Graph yet, we have to disable-custom-all-reduce. Thanks to torch-symm-mem, which gives extra 4% speedup over legacy NCCL TP. But the TTFT comparation in this PR is torch-symm-mem vs torch-symm-mem, the only difference is enable/disable ViT Piecewise CUDA Graph)
The sweet spot for this feature is each rank's compute is relatively small, (i.e. TP is 4, image size is not large, no compute bound in prefill) so the kernel launch occupies large portion. ViT in PCG can save this time cost.
Detailed Design
Accuracy Tests
The update in test/manual/nightly/test_vlms_vit_cuda_graph.py case covers the accuracy.
Benchmarking and Profiling
8xH20 Qwen3-VL-8B-Instruct TP=4
TTFT 1384.53ms --> 1120.68ms
PR:
Server:
Client:
Benchmark Result:
Baseline:
Server:
Client:
Same as above.
Result:
Checklist