Skip to content

[VLM] Replace conv3d proj with linear for GLM4V#20033

Merged
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
antgroup:optimize_glm4v_proj
Mar 8, 2026
Merged

[VLM] Replace conv3d proj with linear for GLM4V#20033
Kangyan-Zhou merged 2 commits intosgl-project:mainfrom
antgroup:optimize_glm4v_proj

Conversation

@yuan-luo
Copy link
Copy Markdown
Collaborator

@yuan-luo yuan-luo commented Mar 6, 2026

Motivation

Inspired by #19788 with some optimizations:

lmms_evals no drops.

Main and PR are the same score:

➜  python git:(main) ✗ python -m sglang.launch_server --model-path zai-org/GLM-4.5V --mm-attention-backend fa3 --port 30000 --tp-size 4
2026-03-06 14:40:15 | INFO     | __main__:cli_evaluate:476 - Verbosity set to INFO
2026-03-06 14:40:18 | INFO     | __main__:cli_evaluate_single:565 - Evaluation tracker args: {}
2026-03-06 14:40:18 | INFO     | __main__:cli_evaluate_single:649 - Selected Tasks: ['mmmu_val']
2026-03-06 14:40:18 | INFO     | lmms_eval.evaluator:simple_evaluate:170 - Setting random seed to 0 | Setting numpy seed to 1234 | Setting torch manual seed to 1234
2026-03-06 14:40:19 | INFO     | lmms_eval.evaluator:evaluate:515 - Running on rank 0 (local rank 0)
2026-03-06 14:40:19 | INFO     | lmms_eval.api.task:build_all_requests:428 - Building contexts for mmmu_val on rank 0...
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 14021.62it/s]
2026-03-06 14:40:19 | INFO     | lmms_eval.evaluator:evaluate:609 - Running generate_until requests
Model Responding:  98%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▊    | 880/900 [02:50<00:03,  5.77it/s]2026-03-06 14:43:10 | INFO     | lmms_eval.models.model_utils.gen_metrics:log_metrics:136 - Metric summary - Total elapsed time: 4206.687s, Total gen tokens: 114995, Avg speed: 27.3 tokens/s
Model Responding: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [02:50<00:00,  5.27it/s]
Postprocessing: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 900/900 [00:00<00:00, 8781.68it/s]
{'Overall-Art and Design': {'num': 120, 'acc': 0.50833}, 'Art': {'num': 30, 'acc': 0.63333}, 'Art_Theory': {'num': 30, 'acc': 0.5}, 'Design': {'num': 30, 'acc': 0.6}, 'Music': {'num': 30, 'acc': 0.3}, 'Overall-Business': {'num': 150, 'acc': 0.29333}, 'Accounting': {'num': 30, 'acc': 0.36667}, 'Economics': {'num': 30, 'acc': 0.36667}, 'Finance': {'num': 30, 'acc': 0.16667}, 'Manage': {'num': 30, 'acc': 0.3}, 'Marketing': {'num': 30, 'acc': 0.26667}, 'Overall-Science': {'num': 150, 'acc': 0.24}, 'Biology': {'num': 30, 'acc': 0.3}, 'Chemistry': {'num': 30, 'acc': 0.16667}, 'Geography': {'num': 30, 'acc': 0.33333}, 'Math': {'num': 30, 'acc': 0.23333}, 'Physics': {'num': 30, 'acc': 0.16667}, 'Overall-Health and Medicine': {'num': 150, 'acc': 0.38667}, 'Basic_Medical_Science': {'num': 30, 'acc': 0.6}, 'Clinical_Medicine': {'num': 30, 'acc': 0.36667}, 'Diagnostics_and_Laboratory_Medicine': {'num': 30, 'acc': 0.13333}, 'Pharmacy': {'num': 30, 'acc': 0.36667}, 'Public_Health': {'num': 30, 'acc': 0.46667}, 'Overall-Humanities and Social Science': {'num': 120, 'acc': 0.325}, 'History': {'num': 30, 'acc': 0.33333}, 'Literature': {'num': 30, 'acc': 0.46667}, 'Sociology': {'num': 30, 'acc': 0.26667}, 'Psychology': {'num': 30, 'acc': 0.23333}, 'Overall-Tech and Engineering': {'num': 210, 'acc': 0.29048}, 'Agriculture': {'num': 30, 'acc': 0.3}, 'Architecture_and_Engineering': {'num': 30, 'acc': 0.2}, 'Computer_Science': {'num': 30, 'acc': 0.36667}, 'Electronics': {'num': 30, 'acc': 0.2}, 'Energy_and_Power': {'num': 30, 'acc': 0.46667}, 'Materials': {'num': 30, 'acc': 0.23333}, 'Mechanical_Engineering': {'num': 30, 'acc': 0.26667}, 'Overall': {'num': 900, 'acc': 0.33222}}
fatal: not a git repository (or any of the parent directories): .git
2026-03-06 14:43:10 | INFO     | lmms_eval.loggers.evaluation_tracker:save_results_aggregated:238 - Output path not provided, skipping saving results aggregated
openai_compatible (model_version=zai-org/GLM-4.5V), gen_kwargs: (), limit: None, offset: 0, num_fewshot: None, batch_size: 16
| Tasks  |Version|Filter|n-shot| Metric |   |Value |   |Stderr|
|--------|------:|------|-----:|--------|---|-----:|---|------|
|mmmu_val|      0|none  |     0|mmmu_acc|↑  |0.3322|±  |N/A   |

More performance test will be done soon.

Server:

python -m sglang.launch_server --model-path zai-org/GLM-4.1V-9B-Thinking --mm-attention-backend fa3 --port 30000

Client:

➜  bench_script bash test_image.sh
{"id":"3139aaab39334c4ba7d79dedbb06c3ad","object":"chat.completion","created":1772799452,"model":"nvidia/Eagle2.5-8B","choices":[{"index":0,"message":{"role":"assistant","content":"<think>用户现在需要描述图上的内容。首先看画面主体:一个人戴着灰色针织帽,穿牛仔外套,双手举着佳能相机(Canon相机)拍照,背景是户外开阔地,光线柔和,可能是傍晚或清晨,背景有模糊的自然景观(树木、开阔地)。要分解元素:人物装扮(帽子、外套)、动作(拍照)、相机品牌、环境氛围等。\n\n所以组织语言:图中展示了一位戴着灰色针织帽、穿着牛仔外套的人,正双手举着佳能相机进行拍摄。背景是一片开阔的户外场景,光线柔和,远处有模糊的树木和开阔地,整体氛围给人一种自然、宁静的感觉,仿佛是在户外的自然环境中记录美好瞬间。需要确认细节:相机是佳能(能看到Canon字样),人物的动作是举着相机拍照,穿着牛仔外套和针织帽,背景是户外开阔地,光线偏暖,可能是黄昏或清晨时分。\n\n总结描述:图中是一位戴着灰色针织帽、身穿牛仔外套的人,正双手握着佳能相机进行拍摄,背景是开阔的户外自然场景(有树木和开阔地),整体画面呈现出一种在户外记录瞬间的氛围,光线柔和,给人宁静自然的感觉。</think><answer>图中展示了一位戴着灰色针织帽、穿着牛仔外套的人,正双手举着佳能相机进行拍摄。背景是一片开阔的户外场景,远处有模糊的树木与开阔地,光线柔和,整体氛围给人一种自然、宁静的感觉,仿佛是在户外的自然环境中记录美好瞬间。","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":151348}],"usage":{"prompt_tokens":1028,"total_tokens":1341,"completion_tokens":313,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}
real	0m2.314s
user	0m0.002s
sys	0m0.002s

Modifications

Accuracy Tests

Benchmarking and Profiling

Checklist

Review Process

  1. Ping Merge Oncalls to start the PR flow. See the PR Merge Process.
  2. Get approvals from CODEOWNERS and other reviewers.
  3. Trigger CI tests with comments or contact authorized users to do so.
    • /tag-run-ci-label, /rerun-failed-ci, /tag-and-rerun-ci
  4. After green CI and required approvals, ask Merge Oncalls to merge.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the visual patch embedding mechanism within the GLM4V model to optimize memory usage and streamline the model loading process. It introduces a dynamic conversion from a Conv3d layer to an nn.Linear layer, ensuring that the Conv3d layer is only present during initial weight loading and is subsequently replaced and freed, leading to a more efficient memory footprint.

Highlights

  • Memory Optimization: Introduced a mechanism to convert the Conv3d projection layer to an nn.Linear layer after checkpoint loading, which allows the Conv3d layer to be deleted, reducing memory usage.
  • Dynamic Layer Conversion: Implemented a copy_conv3d_weight_to_linear method to transfer weights from Conv3d to nn.Linear and then remove the Conv3d instance, streamlining the model's structure post-initialization.
  • Model Initialization Flow: Modified the load_weights process to explicitly invoke the Conv3d to nn.Linear conversion, ensuring that the optimized layer structure is active during model operation.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • python/sglang/srt/models/glm4v.py
    • Added a new copy_conv3d_weight_to_linear method to the PatchEmbed class, responsible for converting Conv3d weights to an nn.Linear layer and then deleting the Conv3d instance.
    • Modified the __init__ method of PatchEmbed to include an nn.Linear layer, which will be populated and used after the conversion.
    • Updated the forward method of PatchEmbed to utilize the nn.Linear layer directly.
    • Adjusted the dtype and device properties to correctly access the type and device of the active projection layer (either Conv3d or nn.Linear).
    • Integrated the copy_conv3d_weight_to_linear call into the load_weights method of Glm4vModel to ensure the conversion happens after checkpoint loading.
Activity
  • The author reported "lmms_evals no drops," indicating initial accuracy validation.
  • The author mentioned "More performance test will be done soon," indicating ongoing benchmarking efforts.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 6, 2026

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Mar 6, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to optimize memory usage by replacing a Conv3d layer with a Linear layer after loading model weights. The overall strategy is sound and the changes to enable this conversion are mostly correct. However, there is a critical issue where a Linear layer is redundantly initialized in the __init__ method, which contradicts the stated goal of the PR and unnecessarily increases memory consumption. I have provided a suggestion to fix this. Additionally, there's a minor point of feedback to improve code quality by avoiding the use of .data.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

1 similar comment
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

Copy link
Copy Markdown
Collaborator

@BBuf BBuf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good job.

@BBuf
Copy link
Copy Markdown
Collaborator

BBuf commented Mar 7, 2026

Please report some benchmark results.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

Please report some benchmark results.

Sure, will update.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

B200 CI failed due to no disk space.

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

Please report some benchmark results.

Added benchmark test, linear vs conv3d speedup 24x .

➜  sglang_dev2 git:(optimize_glm4v_proj) ✗ python -m pytest test_patch_embed_perf.py -vvvv -s
=============================================================================================================== test session starts ===============================================================================================================
platform linux -- Python 3.12.3, pytest-9.0.2, pluggy-1.6.0 -- /usr/bin/python
cachedir: .pytest_cache
rootdir: /sgl-workspace/sglang_dev2
plugins: asyncio-1.3.0, anyio-4.12.1, hydra-core-1.3.2, typeguard-4.4.4
asyncio: mode=Mode.STRICT, debug=False, asyncio_default_fixture_loop_scope=None, asyncio_default_test_loop_scope=function
collected 2 items

test_glm4v.py::test_patch_embed_linear_matches_conv3d PASSED
test_glm4v.py::test_patch_embed_linear_conv3d
[patch_embed perf] conv3d=0.6340 ms | linear=0.0256 ms | speedup=24.805x
PASSED

================================================================================================================ warnings summary =================================================================================================================
<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyPacked has no __module__ attribute

<frozen importlib._bootstrap>:488
  <frozen importlib._bootstrap>:488: DeprecationWarning: builtin type SwigPyObject has no __module__ attribute

<frozen importlib._bootstrap_external>:1297
  <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.cudart module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.runtime module instead.

<frozen importlib._bootstrap_external>:1297
  <frozen importlib._bootstrap_external>:1297: FutureWarning: The cuda.nvrtc module is deprecated and will be removed in a future release, please switch to use the cuda.bindings.nvrtc module instead.

-- Docs: https://docs.pytest.org/en/stable/how-to/capture-warnings.html
========================================================================================================== 2 passed, 4 warnings in 9.09s ==========================================================================================================
sys:1: DeprecationWarning: builtin type swigvarlink has no __module__ attribute

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the optimize_glm4v_proj branch from b5b7508 to 1e97895 Compare March 7, 2026 11:23
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

3 similar comments
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 7, 2026

/rerun-failed-ci

@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 8, 2026

/rerun-failed-ci

@yuan-luo yuan-luo force-pushed the optimize_glm4v_proj branch from 1e97895 to ff7f7f5 Compare March 8, 2026 01:22
@yuan-luo
Copy link
Copy Markdown
Collaborator Author

yuan-luo commented Mar 8, 2026

/rerun-failed-ci

@Kangyan-Zhou Kangyan-Zhou merged commit 97a2a9b into sgl-project:main Mar 8, 2026
204 of 220 checks passed
@yuan-luo yuan-luo deleted the optimize_glm4v_proj branch March 8, 2026 08:11
JustinTong0323 added a commit to JustinTong0323/sglang that referenced this pull request Mar 12, 2026
PR sgl-project#20033 replaced Conv3d with Linear in Glm4vVisionPatchEmbed and
added copy_conv3d_weight_to_linear() to glm4v.py's load_weights, but
missed adding it to glm4v_moe.py and glm_ocr.py. This left the linear
layer with random weights, causing the vision encoder to produce
garbage embeddings — the model outputs text unrelated to the image.

Fixes sgl-project#20462
Wangzheee pushed a commit to Wangzheee/sglang that referenced this pull request Mar 21, 2026
Co-authored-by: luoyuan.luo <luoyuan.luo@antgroup.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants