fix: relax padding for fp4 in weight loading #12497

b8zhong · 2025-11-01T17:13:32Z

I didn't run into the original error, but acc

               ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1960, in load_model
    return super().load_model(
           ^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 595, in load_model
    self.load_weights_and_postprocess(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 614, in load_weights_and_postprocess
    quant_method.process_weights_after_loading(module)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py", line 1453, in process_weights_after_loading
    weight_scale.shape[2] % 16 == 0
AssertionError: Expected w2_weight_scale.dim(2) to be divisible by 16

Since we do swizzle padding it may be unecessary.

python3 -m sglang.launch_server \
  --model-path iAzure/GLM-4.5-NVFP4 \
  --quantization modelopt_fp4 \
  --tp 4 \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'

python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:42<00:00, 30.73it/s]
Accuracy: 0.707
Invalid: 0.002
Latency: 43.290 s
Output throughput: 3781.237 token/s

I'm not sure the baseline result of GSM8K before. @Azure-Tang May you let me know if it looks reasonable

gemini-code-assist · 2025-11-01T17:13:48Z

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a model loading issue related to NVFP4 quantization by modifying the validation logic for weight scales. It changes a rigid assertion about dimension divisibility into a more flexible warning, acknowledging that certain padding schemes might make the stricter check unnecessary. This change aims to prevent crashes during model loading for specific quantized models, improving robustness without compromising the core functionality.

Highlights

Relaxed Padding Requirement: The strict padding requirement for NVFP4 weight loading, specifically that weight_scale.shape[2] must be divisible by 16, has been relaxed. The new check only requires divisibility by 4.
Assertion to Warning Conversion: A critical assertion that would previously cause model loading to fail if the padding condition was not met has been replaced with a non-blocking warning. This allows models to load even if the weight_scale dimension is not perfectly divisible by 4, while still alerting developers to potential non-optimal configurations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request addresses an issue where an overly strict assertion on weight scale padding for FP4 quantization was causing model loading to fail. The change correctly relaxes this assertion to a warning, which is appropriate as the subsequent code handles the necessary padding. This is a good fix that improves model compatibility. I've added one suggestion to improve the clarity of the new warning message.

gemini-code-assist · 2025-11-01T17:15:29Z

python/sglang/srt/layers/quantization/modelopt_quant.py

+                logger.warning(
+                    "NVFP4 %s_weight_scale K' not multiple of 4: shape=%s, group_size=%s",
+                    name,
+                    tuple(weight_scale.shape),
+                    getattr(self.quant_config, "group_size", None),
+                )


The warning message is a bit cryptic with K'. It would be more helpful for developers if the message explicitly stated which dimension is being referred to and what the consequence is (i.e., that padding will be applied).

logger.warning( "NVFP4 %s_weight_scale last dim not a multiple of 4 (shape=%s, " "group_size=%s). Padding will be applied.", name, tuple(weight_scale.shape), getattr(self.quant_config, "group_size", None), )

Copilot

Pull Request Overview

This PR relaxes the validation constraint for NVFP4 weight scale dimensions in MoE (Mixture of Experts) processing. The change replaces a hard assertion requiring the K' dimension (shape[2]) to be divisible by 16 with a warning when it's not divisible by 4, allowing models with different group sizes to load successfully.

Replaces assertion with conditional warning for weight scale shape validation
Changes divisibility requirement from 16 to 4 for the K' dimension
Adds diagnostic information (shape and group_size) to the warning message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

b8zhong added 3 commits November 1, 2025 09:47

more

4dc3369

more

bab84be

more

8ffc89e

b8zhong requested review from BBuf, Edwardf0t1, FlamingoPg and ch-wan as code owners November 1, 2025 17:13

b8zhong mentioned this pull request Nov 1, 2025

[Bug] GLM-4.5 NVFP4 quant output stuck looping. #12208

Open

5 tasks

b8zhong requested a review from Copilot November 1, 2025 17:15

gemini-code-assist bot reviewed Nov 1, 2025

View reviewed changes

Copilot AI reviewed Nov 1, 2025

View reviewed changes

b8zhong added the run-ci label Nov 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: relax padding for fp4 in weight loading #12497

fix: relax padding for fp4 in weight loading #12497

b8zhong commented Nov 1, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot commented Nov 1, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Nov 1, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

fix: relax padding for fp4 in weight loading #12497

Are you sure you want to change the base?

fix: relax padding for fp4 in weight loading #12497

Conversation

b8zhong commented Nov 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist bot commented Nov 1, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Nov 1, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

b8zhong commented Nov 1, 2025 •

edited

Loading