[VLM] Replace conv3d proj with linear for GLM4V by yuan-luo · Pull Request #20033 · sgl-project/sglang

yuan-luo · 2026-03-06T12:18:23Z

Motivation

Inspired by #19788 with some optimizations:

lmms_evals no drops.

Main and PR are the same score:

➜  python git:(main) ✗ python -m sglang.launch_server --model-path zai-org/GLM-4.5V --mm-attention-backend fa3 --port 30000 --tp-size 4

2026-03-06 14:40:15 | INFO     | __main__:cli_evaluate:476 - Verbosity set to INFO
2026-03-06 14:40:18 | INFO     | __main__:cli_evaluate_single:565 - Evaluation tracker args: {}
2026-03-06 14:40:18 | INFO     | __main__:cli_evaluate_single:649 - Selected Tasks: ['mmmu_val']
2026-03-06 14:40:18 | INFO     | lmms_eval.evaluator:simple_evaluate:170 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2026-03-06 14:40:19 | INFO     | lmms_eval.evaluator:evaluate:515 - Running on rank 0 (local rank 0)
2026-03-06 14:40:19 | INFO     | lmms_eval.api.task:build_all_requests:428 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 14021.62it/s]
2026-03-06 14:40:19 | INFO     | lmms_eval.evaluator:evaluate:609 - Running generate_until requests
Model Responding:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊    | 880/900 [02:50<00:03,  5.77it/s]2026-03-06 14:43:10 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:136 - Metric summary - Total elapsed time: 4206.687s, Total gen tokens: 114995, Avg speed: 27.3 tokens/s
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [02:50<00:00,  5.27it/s]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 8781.68it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.50833}, 'Art': {'num': 30, 'acc': 0.63333}, 'Art_Theory': {'num': 30, 'acc': 0.5}, 'Design': {'num': 30, 'acc': 0.6}, 'Music': {'num': 30, 'acc': 0.3}, 'Overall-Business': {'num': 150, 'acc': 0.29333}, 'Accounting': {'num': 30, 'acc': 0.36667}, 'Economics': {'num': 30, 'acc': 0.36667}, 'Finance': {'num': 30, 'acc': 0.16667}, 'Manage': {'num': 30, 'acc': 0.3}, 'Marketing': {'num': 30, 'acc': 0.26667}, 'Overall-Science': {'num': 150, 'acc': 0.24}, 'Biology': {'num': 30, 'acc': 0.3}, 'Chemistry': {'num': 30, 'acc': 0.16667}, 'Geography': {'num': 30, 'acc': 0.33333}, 'Math': {'num': 30, 'acc': 0.23333}, 'Physics': {'num': 30, 'acc': 0.16667}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.38667}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.6}, 'Clinical_Medicine': {'num': 30, 'acc': 0.36667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.13333}, 'Pharmacy': {'num': 30, 'acc': 0.36667}, 'Public_Health': {'num': 30, 'acc': 0.46667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.325}, 'History': {'num': 30, 'acc': 0.33333}, 'Literature': {'num': 30, 'acc': 0.46667}, 'Sociology': {'num': 30, 'acc': 0.26667}, 'Psychology': {'num': 30, 'acc': 0.23333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.29048}, 'Agriculture': {'num': 30, 'acc': 0.3}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.2}, 'Computer_Science': {'num': 30, 'acc': 0.36667}, 'Electronics': {'num': 30, 'acc': 0.2}, 'Energy_and_Power': {'num': 30, 'acc': 0.46667}, 'Materials': {'num': 30, 'acc': 0.23333}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.26667}, 'Overall': {'num': 900, 'acc': 0.33222}}
fatal: not a git repository (or any of the parent directories): .git
2026-03-06 14:43:10 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:238 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=zai-org/GLM-4.5V), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.3322|±  |N/A   |

More performance test will be done soon.

Server:

python -m sglang.launch_server --model-path zai-org/GLM-4.1V-9B-Thinking --mm-attention-backend fa3 --port 30000

Client:

➜  bench_script bash test_image.sh
{"id":"3139aaab39334c4ba7d79dedbb06c3ad","object":"chat.completion","created":1772799452,"model":"nvidia/Eagle2.5-8B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>用户现在需要描述图上的内容。首先看画面主体：一个人戴着灰色针织帽，穿牛仔外套，双手举着佳能相机（Canon相机）拍照，背景是户外开阔地，光线柔和，可能是傍晚或清晨，背景有模糊的自然景观（树木、开阔地）。要分解元素：人物装扮（帽子、外套）、动作（拍照）、相机品牌、环境氛围等。\n\n所以组织语言：图中展示了一位戴着灰色针织帽、穿着牛仔外套的人，正双手举着佳能相机进行拍摄。背景是一片开阔的户外场景，光线柔和，远处有模糊的树木和开阔地，整体氛围给人一种自然、宁静的感觉，仿佛是在户外的自然环境中记录美好瞬间。需要确认细节：相机是佳能（能看到Canon字样），人物的动作是举着相机拍照，穿着牛仔外套和针织帽，背景是户外开阔地，光线偏暖，可能是黄昏或清晨时分。\n\n总结描述：图中是一位戴着灰色针织帽、身穿牛仔外套的人，正双手握着佳能相机进行拍摄，背景是开阔的户外自然场景（有树木和开阔地），整体画面呈现出一种在户外记录瞬间的氛围，光线柔和，给人宁静自然的感觉。</think><answer>图中展示了一位戴着灰色针织帽、穿着牛仔外套的人，正双手举着佳能相机进行拍摄。背景是一片开阔的户外场景，远处有模糊的树木与开阔地，光线柔和，整体氛围给人一种自然、宁静的感觉，仿佛是在户外的自然环境中记录美好瞬间。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151348}],"usage":{"prompt_tokens":1028,"total_tokens":1341,"completion_tokens":313,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m2.314s
user	0m0.002s
sys	0m0.002s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.
Follow the SGLang code style guidance.

Review Process

Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
Get approvals from CODEOWNERS and other reviewers.
Trigger CI tests with comments or contact authorized users to do so.
- /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
After green CI and required approvals, ask Merge Oncalls to merge.

gemini-code-assist · 2026-03-06T12:18:45Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the visual patch embedding mechanism within the GLM4V model to optimize memory usage and streamline the model loading process. It introduces a dynamic conversion from a Conv3d layer to an nn.Linear layer, ensuring that the Conv3d layer is only present during initial weight loading and is subsequently replaced and freed, leading to a more efficient memory footprint.

Highlights

Memory Optimization: Introduced a mechanism to convert the Conv3d projection layer to an nn.Linear layer after checkpoint loading, which allows the Conv3d layer to be deleted, reducing memory usage.
Dynamic Layer Conversion: Implemented a copy_conv3d_weight_to_linear method to transfer weights from Conv3d to nn.Linear and then remove the Conv3d instance, streamlining the model's structure post-initialization.
Model Initialization Flow: Modified the load_weights process to explicitly invoke the Conv3d to nn.Linear conversion, ensuring that the optimized layer structure is active during model operation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

python/sglang/srt/models/glm4v.py
- Added a new copy_conv3d_weight_to_linear method to the PatchEmbed class, responsible for converting Conv3d weights to an nn.Linear layer and then deleting the Conv3d instance.
- Modified the __init__ method of PatchEmbed to include an nn.Linear layer, which will be populated and used after the conversion.
- Updated the forward method of PatchEmbed to utilize the nn.Linear layer directly.
- Adjusted the dtype and device properties to correctly access the type and device of the active projection layer (either Conv3d or nn.Linear).
- Integrated the copy_conv3d_weight_to_linear call into the load_weights method of Glm4vModel to ensure the conversion happens after checkpoint loading.

Activity

The author reported "lmms_evals no drops," indicating initial accuracy validation.
The author mentioned "More performance test will be done soon," indicating ongoing benchmarking efforts.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

yuan-luo · 2026-03-06T12:19:15Z

/tag-and-rerun-ci

gemini-code-assist

Code Review

This pull request aims to optimize memory usage by replacing a Conv3d layer with a Linear layer after loading model weights. The overall strategy is sound and the changes to enable this conversion are mostly correct. However, there is a critical issue where a Linear layer is redundantly initialized in the __init__ method, which contradicts the stated goal of the PR and unnecessarily increases memory consumption. I have provided a suggestion to fix this. Additionally, there's a minor point of feedback to improve code quality by avoiding the use of .data.

python/sglang/srt/models/glm4v.py

yuan-luo · 2026-03-07T01:39:01Z

/rerun-failed-ci

yuan-luo · 2026-03-07T03:33:49Z

/rerun-failed-ci

BBuf

Good job.

BBuf · 2026-03-07T04:55:57Z

Please report some benchmark results.

yuan-luo · 2026-03-07T05:27:30Z

Please report some benchmark results.

Sure, will update.

yuan-luo · 2026-03-07T06:35:24Z

/rerun-failed-ci

yuan-luo · 2026-03-07T07:28:49Z

B200 CI failed due to no disk space.

yuan-luo · 2026-03-07T08:16:28Z

Please report some benchmark results.

Added benchmark test, linear vs conv3d speedup 24x .

➜  sglang_dev2 git:(optimize_glm4v_proj) ✗ python -m pytest test_patch_embed_perf.py -vvvv -s
=============================================================================================================== test session starts ===============================================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /sgl-workspace/sglang_dev2
plugins: asyncio-1.3.0, anyio-4.12.1, hydra-core-1.3.2, typeguard-4.4.4
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items

test_glm4v.py::test_patch_embed_linear_matches_conv3d PASSED
test_glm4v.py::test_patch_embed_linear_conv3d
[patch_embed perf] conv3d=0.6340 ms | linear=0.0256 ms | speedup=24.805x
PASSED

================================================================================================================ warnings summary =================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

<frozen importlib._bootstrap_external>:1297
  <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.

<frozen importlib._bootstrap_external>:1297
  <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================== 2 passed, 4 warnings in 9.09s ==========================================================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

yuan-luo · 2026-03-07T11:07:38Z

/rerun-failed-ci

yuan-luo · 2026-03-07T14:40:27Z

/rerun-failed-ci

yuan-luo · 2026-03-07T14:55:15Z

/rerun-failed-ci

yuan-luo · 2026-03-07T15:16:23Z

/rerun-failed-ci

yuan-luo · 2026-03-07T23:23:16Z

/rerun-failed-ci

yuan-luo · 2026-03-08T01:21:07Z

/rerun-failed-ci

yuan-luo · 2026-03-08T05:11:49Z

/rerun-failed-ci

PR sgl-project#20033 replaced Conv3d with Linear in Glm4vVisionPatchEmbed and added copy_conv3d_weight_to_linear() to glm4v.py's load_weights, but missed adding it to glm4v_moe.py and glm_ocr.py. This left the linear layer with random weights, causing the vision encoder to produce garbage embeddings — the model outputs text unrelated to the image. Fixes sgl-project#20462

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

yuan-luo requested review from JustinTong0323, mickqian and yhyang201 March 6, 2026 12:19

github-actions bot added the run-ci label Mar 6, 2026

gemini-code-assist bot reviewed Mar 6, 2026

View reviewed changes

python/sglang/srt/models/glm4v.py Show resolved Hide resolved

python/sglang/srt/models/glm4v.py Outdated Show resolved Hide resolved

yuan-luo force-pushed the optimize_glm4v_proj branch from 313b951 to 06185f0 Compare March 6, 2026 13:43

yuan-luo mentioned this pull request Mar 6, 2026

[Feature] Optimizations for class Qwen3VLMoeVisionModel (Conv3d to Linear) in Qwen3VL #19788

Open

5 tasks

yuan-luo force-pushed the optimize_glm4v_proj branch from 06185f0 to 4af41de Compare March 6, 2026 14:33

yuan-luo requested a review from BBuf March 7, 2026 01:38

BBuf approved these changes Mar 7, 2026

View reviewed changes

yuan-luo force-pushed the optimize_glm4v_proj branch from b5b7508 to 1e97895 Compare March 7, 2026 11:23

luoyuan.luo added 2 commits March 8, 2026 09:20

Optimize glm4v conv3d proj to linear

ce1ca9d

Add benchmark test

ff7f7f5

yuan-luo force-pushed the optimize_glm4v_proj branch from 1e97895 to ff7f7f5 Compare March 8, 2026 01:22

Kangyan-Zhou merged commit 97a2a9b into sgl-project:main Mar 8, 2026
204 of 220 checks passed

yuan-luo deleted the optimize_glm4v_proj branch March 8, 2026 08:11

JustinTong0323 mentioned this pull request Mar 12, 2026

[Bug] GLM-4.6V vision regression: model ignores image content after PR #20033 #20462

Closed

JustinTong0323 mentioned this pull request Mar 12, 2026

[Bugfix] Fix GLM-4.6V vision regression in glm4v_moe and glm_ocr #20463

Merged

Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026

[VLM] Replace conv3d proj with linear for GLM4V (sgl-project#20033)

5498c15

Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>

Conversation

yuan-luo commented Mar 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

Uh oh!

gemini-code-assist bot commented Mar 6, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

yuan-luo commented Mar 6, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

BBuf left a comment

Choose a reason for hiding this comment

Uh oh!

BBuf commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 7, 2026

Uh oh!

yuan-luo commented Mar 8, 2026

Uh oh!

yuan-luo commented Mar 8, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yuan-luo commented Mar 6, 2026 •

edited

Loading

yuan-luo commented Mar 7, 2026 •

edited

Loading