Optimizations for Qwen3VL models by wili-65535 · Pull Request #18559 · sgl-project/sglang

wili-65535 · 2026-02-10T16:31:12Z

Discuss in issue #18784

gemini-code-assist · 2026-02-10T16:32:35Z

Summary of Changes

Hello @wili-65535, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the Qwen3VL models by integrating VisionFly, a library designed to accelerate vision-language models. The changes include modifications to attention mechanisms, linear layers, and CPU offloading strategies. Additionally, the PR incorporates several debugging enhancements and performance tweaks to improve the overall efficiency and stability of the models.

Highlights

VisionFly Integration: This PR integrates VisionFly to enhance the performance of Qwen3VL models, enabling optimizations such as attention and linear layer acceleration.
Code Modifications for Debugging: Several changes were made to facilitate debugging, including increased timeout values and options to skip warmup requests.
Performance Optimization: The PR introduces changes aimed at improving inference speed, such as disabling image cache for benchmarking and adjusting DeepGEMM dimension requirements.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py
- Updated minimum DeepGEMM dimension requirements for Qwen3VL models.
python/sglang/srt/entrypoints/http_server.py
- Increased timeout for model info requests for debugging purposes.
- Added option to skip warmup requests for debugging.
python/sglang/srt/layers/attention/vision.py
- Added os import for enabling vfly and reuse TP group but keep TP size as 1.
python/sglang/srt/managers/mm_utils.py
- Added nvtx import for profiling.
- Disabled image cache for benchmarking.
- Added nvtx annotation for language model.
python/sglang/srt/managers/scheduler.py
- Added nvtx import for profiling.
- Added nvtx annotations for multimodal inputs.
python/sglang/srt/model_loader/loader.py
- Added logic to enable vision fly and copy Conv3d weights to Linear.
python/sglang/srt/models/common/init.py
- Added common utilities for VisionFly examples.
python/sglang/srt/models/common/base_args.py
- Added common argument parser for VisionFly examples.
python/sglang/srt/models/common/utils.py
- Added common utility functions for VisionFly examples.
python/sglang/srt/models/qwen3_vl.py
- Added nvtx and os imports.
- Modified tp_size and tp_rank for vfly.
- Replaced original Qwen3VLVisionPatchEmbed with improved version.
- Added enable_vision_fly function to enable vfly.
python/sglang/srt/multimodal/processors/base_processor.py
- Added nvtx import for profiling.
- Added nvtx annotations for process_mm_data and load_image.
- Added a check to see if the image is already a tensor before converting it.
python/sglang/srt/utils/common.py
- Added torchvision import for decoding jpeg.
- Added logic to load jpeg images from base64 strings.
python/sglang/srt/utils/cuda_ipc_transport_utils.py
- Added option to disable recycle for debugging.

Activity

The pull request introduces several changes to optimize Qwen3VL models using VisionFly.
Debugging enhancements have been added, such as increased timeout values and options to skip warmup requests.
Performance tweaks, including disabling image cache and adjusting DeepGEMM dimensions, aim to improve inference speed.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist · 2026-02-10T16:35:49Z

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

yuan-luo · 2026-02-12T03:19:31Z

It seems the core technology lies in vfly.utils.parallel such as dit_sp_gather, dit_sp_split. I once attempted to support ViT SP but the road forward was suffering. I'm curious about the vfly lib. Could you please share more details?

wili-65535 · 2026-02-12T04:03:29Z

It seems the core technology lies in vfly.utils.parallel such as dit_sp_gather, dit_sp_split. I once attempted to support ViT SP but the road forward was suffering. I'm curious about the vfly lib. Could you please share more details?

Thank you for your attention!
You can refer to here to see the implementation of the vfly (name changes from "vfly" to "visual_gen", but the logic is the same).

yuan-luo · 2026-03-02T04:00:14Z

python/sglang/srt/models/qwen3_vl.py

+
+        return out
+
+    def fast_pos_embed_interpolate_v3(


This function has been optimized likewise in main.

yuan-luo · 2026-03-02T04:02:30Z

python/sglang/srt/utils/common.py

+        ):  # wili, for jpeg base64 on NVIDIA GPU
+            image_bytes = pybase64.b64decode(image_file, validate=True)
+            image = torch.frombuffer(image_bytes, dtype=torch.uint8)
+            image = decode_jpeg(image, device="cuda")


May need to consider not breaking other device.

Thank you! we file a separate PR for this optimization here (#19749).

v0.2: remove vfly related code temporarily v0.5: remove nvtx v0.6: fix back weight names in qwen3_vl.py

github-actions bot added Multi-modal multi-modal language model deterministic Issues on deterministic inference/kernels labels Feb 10, 2026

wili-65535 force-pushed the wili/qwen3vl-optimization branch from e0b57dd to 5af998e Compare February 12, 2026 03:02

wili-65535 force-pushed the wili/qwen3vl-optimization branch from 5af998e to 0d56d56 Compare February 12, 2026 04:04

wili-65535 mentioned this pull request Feb 13, 2026

[Feature] Optimizations for Qwen3VL models #18784

Open

2 tasks

wili-65535 changed the title ~~Optimization for Qwen3VL models~~ Optimizations for Qwen3VL models Feb 13, 2026

hlu1 mentioned this pull request Feb 20, 2026

[Tracking] Qwen3.5/Qwen3-Next Optimizations #18590

Open

38 tasks

wili-65535 force-pushed the wili/qwen3vl-optimization branch 4 times, most recently from a36c311 to 919d4be Compare February 26, 2026 14:54

yuan-luo reviewed Mar 2, 2026

View reviewed changes

wili-65535 force-pushed the wili/qwen3vl-optimization branch 2 times, most recently from 6563ade to aab79e1 Compare March 3, 2026 02:47

This was referenced Mar 3, 2026

[Feature] Optimizations for JPEG input on NVIDIA GPU #19749

Merged

[Feature] Optimizations for class Qwen3VLMoeVisionModel (Conv3d to Linear) in Qwen3VL #19788

Closed

wili-65535 force-pushed the wili/qwen3vl-optimization branch from aab79e1 to 469b4f4 Compare March 5, 2026 08:52

v0.1: add optimization with debug code

59b1d22

v0.2: remove vfly related code temporarily v0.5: remove nvtx v0.6: fix back weight names in qwen3_vl.py

wili-65535 force-pushed the wili/qwen3vl-optimization branch from 469b4f4 to 59b1d22 Compare March 6, 2026 08:40

wili-65535 closed this Mar 30, 2026

wili-65535 deleted the wili/qwen3vl-optimization branch March 30, 2026 02:24

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for Qwen3VL models#18559

Optimizations for Qwen3VL models#18559
wili-65535 wants to merge 1 commit intosgl-project:mainfrom
wili-65535:wili/qwen3vl-optimization

wili-65535 commented Feb 10, 2026 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Uh oh!

yuan-luo commented Feb 12, 2026

Uh oh!

wili-65535 commented Feb 12, 2026

Uh oh!

yuan-luo Mar 2, 2026

Uh oh!

yuan-luo Mar 2, 2026 •

edited

Loading

Uh oh!

wili-65535 Mar 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wili-65535 commented Feb 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Discuss in issue #18784

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot commented Feb 10, 2026

Uh oh!

yuan-luo commented Feb 12, 2026

Uh oh!

wili-65535 commented Feb 12, 2026

Uh oh!

yuan-luo Mar 2, 2026

Choose a reason for hiding this comment

Uh oh!

yuan-luo Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wili-65535 Mar 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

wili-65535 commented Feb 10, 2026 •

edited

Loading

yuan-luo Mar 2, 2026 •

edited

Loading