Skip to content

Optimizations for Qwen3VL models#18559

Closed
wili-65535 wants to merge 1 commit intosgl-project:mainfrom
wili-65535:wili/qwen3vl-optimization
Closed

Optimizations for Qwen3VL models#18559
wili-65535 wants to merge 1 commit intosgl-project:mainfrom
wili-65535:wili/qwen3vl-optimization

Conversation

@wili-65535
Copy link
Copy Markdown
Contributor

@wili-65535 wili-65535 commented Feb 10, 2026

Discuss in issue #18784

@github-actions github-actions bot added Multi-modal multi-modal language model deterministic Issues on deterministic inference/kernels labels Feb 10, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @wili-65535, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request focuses on optimizing the Qwen3VL models by integrating VisionFly, a library designed to accelerate vision-language models. The changes include modifications to attention mechanisms, linear layers, and CPU offloading strategies. Additionally, the PR incorporates several debugging enhancements and performance tweaks to improve the overall efficiency and stability of the models.

Highlights

  • VisionFly Integration: This PR integrates VisionFly to enhance the performance of Qwen3VL models, enabling optimizations such as attention and linear layer acceleration.
  • Code Modifications for Debugging: Several changes were made to facilitate debugging, including increased timeout values and options to skip warmup requests.
  • Performance Optimization: The PR introduces changes aimed at improving inference speed, such as disabling image cache for benchmarking and adjusting DeepGEMM dimension requirements.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/batch_invariant_ops/batch_invariant_ops.py
    • Updated minimum DeepGEMM dimension requirements for Qwen3VL models.
  • python/sglang/srt/entrypoints/http_server.py
    • Increased timeout for model info requests for debugging purposes.
    • Added option to skip warmup requests for debugging.
  • python/sglang/srt/layers/attention/vision.py
    • Added os import for enabling vfly and reuse TP group but keep TP size as 1.
  • python/sglang/srt/managers/mm_utils.py
    • Added nvtx import for profiling.
    • Disabled image cache for benchmarking.
    • Added nvtx annotation for language model.
  • python/sglang/srt/managers/scheduler.py
    • Added nvtx import for profiling.
    • Added nvtx annotations for multimodal inputs.
  • python/sglang/srt/model_loader/loader.py
    • Added logic to enable vision fly and copy Conv3d weights to Linear.
  • python/sglang/srt/models/common/init.py
    • Added common utilities for VisionFly examples.
  • python/sglang/srt/models/common/base_args.py
    • Added common argument parser for VisionFly examples.
  • python/sglang/srt/models/common/utils.py
    • Added common utility functions for VisionFly examples.
  • python/sglang/srt/models/qwen3_vl.py
    • Added nvtx and os imports.
    • Modified tp_size and tp_rank for vfly.
    • Replaced original Qwen3VLVisionPatchEmbed with improved version.
    • Added enable_vision_fly function to enable vfly.
  • python/sglang/srt/multimodal/processors/base_processor.py
    • Added nvtx import for profiling.
    • Added nvtx annotations for process_mm_data and load_image.
    • Added a check to see if the image is already a tensor before converting it.
  • python/sglang/srt/utils/common.py
    • Added torchvision import for decoding jpeg.
    • Added logic to load jpeg images from base64 strings.
  • python/sglang/srt/utils/cuda_ipc_transport_utils.py
    • Added option to disable recycle for debugging.
Activity
  • The pull request introduces several changes to optimize Qwen3VL models using VisionFly.
  • Debugging enhancements have been added, such as increased timeout values and options to skip warmup requests.
  • Performance tweaks, including disabling image cache and adjusting DeepGEMM dimensions, aim to improve inference speed.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Warning

Gemini encountered an error creating the review. You can try again by commenting /gemini review.

@wili-65535 wili-65535 force-pushed the wili/qwen3vl-optimization branch from e0b57dd to 5af998e Compare February 12, 2026 03:02
@yuan-luo
Copy link
Copy Markdown
Collaborator

It seems the core technology lies in vfly.utils.parallel such as dit_sp_gather, dit_sp_split. I once attempted to support ViT SP but the road forward was suffering. I'm curious about the vfly lib. Could you please share more details?

@wili-65535
Copy link
Copy Markdown
Contributor Author

It seems the core technology lies in vfly.utils.parallel such as dit_sp_gather, dit_sp_split. I once attempted to support ViT SP but the road forward was suffering. I'm curious about the vfly lib. Could you please share more details?

Thank you for your attention!
You can refer to here to see the implementation of the vfly (name changes from "vfly" to "visual_gen", but the logic is the same).

@wili-65535 wili-65535 force-pushed the wili/qwen3vl-optimization branch from 5af998e to 0d56d56 Compare February 12, 2026 04:04
@wili-65535 wili-65535 changed the title Optimization for Qwen3VL models Optimizations for Qwen3VL models Feb 13, 2026
@wili-65535 wili-65535 force-pushed the wili/qwen3vl-optimization branch 4 times, most recently from a36c311 to 919d4be Compare February 26, 2026 14:54

return out

def fast_pos_embed_interpolate_v3(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function has been optimized likewise in main.

): # wili, for jpeg base64 on NVIDIA GPU
image_bytes = pybase64.b64decode(image_file, validate=True)
image = torch.frombuffer(image_bytes, dtype=torch.uint8)
image = decode_jpeg(image, device="cuda")
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo Mar 2, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May need to consider not breaking other device.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you! we file a separate PR for this optimization here (#19749).

v0.2: remove vfly related code temporarily
v0.5: remove nvtx
v0.6: fix back weight names in qwen3_vl.py
@wili-65535 wili-65535 force-pushed the wili/qwen3vl-optimization branch from 469b4f4 to 59b1d22 Compare March 6, 2026 08:40
@wili-65535 wili-65535 closed this Mar 30, 2026
@wili-65535 wili-65535 deleted the wili/qwen3vl-optimization branch March 30, 2026 02:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

deterministic Issues on deterministic inference/kernels Multi-modal multi-modal language model

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants