Skip to content

Conversation

@b8zhong
Copy link
Collaborator

@b8zhong b8zhong commented Nov 1, 2025

Fix #12208

I didn't run into the original error, but acc

               ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/__init__.py", line 28, in get_model
    return loader.load_model(
           ^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 1960, in load_model
    return super().load_model(
           ^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 595, in load_model
    self.load_weights_and_postprocess(
  File "/sgl-workspace/sglang/python/sglang/srt/model_loader/loader.py", line 614, in load_weights_and_postprocess
    quant_method.process_weights_after_loading(module)
  File "/sgl-workspace/sglang/python/sglang/srt/layers/quantization/modelopt_quant.py", line 1453, in process_weights_after_loading
    weight_scale.shape[2] % 16 == 0
AssertionError: Expected w2_weight_scale.dim(2) to be divisible by 16

Since we do swizzle padding it may be unecessary.

python3 -m sglang.launch_server \
  --model-path iAzure/GLM-4.5-NVFP4 \
  --quantization modelopt_fp4 \
  --tp 4 \
  --model-loader-extra-config '{"enable_multithread_load": true, "num_threads": 8}'
python3 benchmark/gsm8k/bench_sglang.py --num-shots 8 --num-questions 1319 --parallel 1319
/usr/local/lib/python3.12/dist-packages/torch/cuda/__init__.py:63: FutureWarning: The pynvml package is deprecated. Please install nvidia-ml-py instead. If you did not install pynvml directly, please report this to the maintainers of the package that installed pynvml for you.
  import pynvml  # type: ignore[import]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1319/1319 [00:42<00:00, 30.73it/s]
Accuracy: 0.707
Invalid: 0.002
Latency: 43.290 s
Output throughput: 3781.237 token/s

I'm not sure the baseline result of GSM8K before. @Azure-Tang May you let me know if it looks reasonable

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @b8zhong, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a model loading issue related to NVFP4 quantization by modifying the validation logic for weight scales. It changes a rigid assertion about dimension divisibility into a more flexible warning, acknowledging that certain padding schemes might make the stricter check unnecessary. This change aims to prevent crashes during model loading for specific quantized models, improving robustness without compromising the core functionality.

Highlights

  • Relaxed Padding Requirement: The strict padding requirement for NVFP4 weight loading, specifically that weight_scale.shape[2] must be divisible by 16, has been relaxed. The new check only requires divisibility by 4.
  • Assertion to Warning Conversion: A critical assertion that would previously cause model loading to fail if the padding condition was not met has been replaced with a non-blocking warning. This allows models to load even if the weight_scale dimension is not perfectly divisible by 4, while still alerting developers to potential non-optimal configurations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@b8zhong b8zhong requested a review from Copilot November 1, 2025 17:15
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request addresses an issue where an overly strict assertion on weight scale padding for FP4 quantization was causing model loading to fail. The change correctly relaxes this assertion to a warning, which is appropriate as the subsequent code handles the necessary padding. This is a good fix that improves model compatibility. I've added one suggestion to improve the clarity of the new warning message.

Comment on lines +1442 to +1447
logger.warning(
"NVFP4 %s_weight_scale K' not multiple of 4: shape=%s, group_size=%s",
name,
tuple(weight_scale.shape),
getattr(self.quant_config, "group_size", None),
)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The warning message is a bit cryptic with K'. It would be more helpful for developers if the message explicitly stated which dimension is being referred to and what the consequence is (i.e., that padding will be applied).

                logger.warning(
                    "NVFP4 %s_weight_scale last dim not a multiple of 4 (shape=%s, "
                    "group_size=%s). Padding will be applied.",
                    name,
                    tuple(weight_scale.shape),
                    getattr(self.quant_config, "group_size", None),
                )

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR relaxes the validation constraint for NVFP4 weight scale dimensions in MoE (Mixture of Experts) processing. The change replaces a hard assertion requiring the K' dimension (shape[2]) to be divisible by 16 with a warning when it's not divisible by 4, allowing models with different group sizes to load successfully.

  • Replaces assertion with conditional warning for weight scale shape validation
  • Changes divisibility requirement from 16 to 4 for the K' dimension
  • Adds diagnostic information (shape and group_size) to the warning message

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@b8zhong b8zhong added the run-ci label Nov 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] GLM-4.5 NVFP4 quant output stuck looping.

1 participant