Skip to content

[MXFP4] Support for linear layers + compressed-tensors integration#41664

Merged
mgoin merged 11 commits into
vllm-project:mainfrom
neuralmagic:ct_mxfp4
May 12, 2026
Merged

[MXFP4] Support for linear layers + compressed-tensors integration#41664
mgoin merged 11 commits into
vllm-project:mainfrom
neuralmagic:ct_mxfp4

Conversation

@dsikka
Copy link
Copy Markdown
Contributor

@dsikka dsikka commented May 4, 2026

Purpose

Test Plan

  • Added basic smoke tests
  • LM Eval validation

LM-Eval

lm_eval \
  --model vllm \
  --model_args pretrained="nm-testing/Meta-Llama-3-8B-Instruct-MXFP4-GPTQ",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
  --tasks gsm8k_cot_llama \
  --fewshot_as_multiturn \
  --apply_chat_template \
  --num_fewshot 8 \
  --batch_size auto

flashinfer

|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      4|flexible-extract|     8|exact_match|↑  |0.6892|±  |0.0127|
|               |       |strict-match    |     8|exact_match|↑  |0.6846|±  |0.0128|

marlin (VLLM_MXFP4_USE_MARLIN=1)

|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      4|flexible-extract|     8|exact_match|↑  |0.7604|±  |0.0118|
|               |       |strict-match    |     8|exact_match|↑  |0.7551|±  |0.0118|

dense (meta-llama/Meta-Llama-3-8B-Instruct)

|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      4|flexible-extract|     8|exact_match|↑  |0.7998|±  | 0.011|
|               |       |strict-match    |     8|exact_match|↑  |0.7991|±  | 0.011|

dsikka added 2 commits May 4, 2026 20:30
Signed-off-by: Dipika <dipikasikka1@gmail.com>
Signed-off-by: Dipika <dipikasikka1@gmail.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request renames the MXFP4 quantization scheme to CompressedTensorsW4A4Mxfp4 and introduces support for true W4A4 quantization using FlashInfer on SM100+ devices. The changes include new utility functions for FlashInfer FP4 operations and logic to handle activation quantization. A critical issue was identified in the weight scale processing where swizzle_mxfp4_scales might cause a RuntimeError during reshaping if the output feature size is not a multiple of 128 due to internal padding.

N, scale_K = layer.weight_scale.shape
K = scale_K * self.group_size
layer.weight_scale = Parameter(
swizzle_mxfp4_scales(layer.weight_scale.data, N, K).reshape(N, -1),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The swizzle_mxfp4_scales function pads the N dimension to the nearest multiple of 128. If N (the output feature size) is not a multiple of 128, the total number of elements in the swizzled tensor will be padded_N * padded_scale_cols, which is not necessarily divisible by N. This will cause a RuntimeError during the .reshape(N, -1) call. Even if it were divisible, the resulting 2D tensor would have misaligned scale data because of the padding introduced during swizzling. You should ensure that the scale tensor's shape is compatible with what the FlashInfer kernel expects, which likely involves keeping the padded dimensions or ensuring the kernel handles the original N correctly with the swizzled layout.

Signed-off-by: Dipika <dipikasikka1@gmail.com>
dsikka added 2 commits May 5, 2026 15:28
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
@dsikka dsikka marked this pull request as ready for review May 5, 2026 19:32
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@dsikka
Copy link
Copy Markdown
Contributor Author

dsikka commented May 5, 2026

@yewentao256 @mgoin

dsikka added a commit to vllm-project/llm-compressor that referenced this pull request May 5, 2026
SUMMARY:
- Move out of experimental as supported in vLLM as of:
vllm-project/vllm#41664

---------

Signed-off-by: Dipika Sikka <ds3822@columbia.edu>
Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work!

Comment thread vllm/model_executor/kernels/linear/mxfp4/flashinfer.py
@dsikka dsikka changed the title [MXFP4] Support for compressed-tensors linear layers [MXFP4] Support for linear layers + compressed-tensors integration May 6, 2026
Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Copy Markdown
Contributor

@kylesayrs kylesayrs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Support looks good, was able to verify locally

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>
Copy link
Copy Markdown
Member

@mgoin mgoin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@github-project-automation github-project-automation Bot moved this to Ready in NVIDIA May 8, 2026
@mgoin mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels May 8, 2026
@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented May 8, 2026

Hi @dsikka, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?
mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:
# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

dsikka added 2 commits May 8, 2026 21:46
Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>
@dsikka dsikka requested a review from zyongye as a code owner May 11, 2026 14:58
Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to block for a while, please take a look at my previous comment

@github-project-automation github-project-automation Bot moved this from Ready to In review in NVIDIA May 11, 2026
@dsikka
Copy link
Copy Markdown
Contributor Author

dsikka commented May 11, 2026

Sorry to block for a while, please take a look at my previous comment

Sorry to block for a while, please take a look at my previous comment

@yewentao256 Please take a look at the latest commits. This has been addressed to use the padded_N for the reshape

@dsikka dsikka requested a review from yewentao256 May 11, 2026 15:35
@yewentao256 yewentao256 dismissed their stale review May 11, 2026 17:09

Dismiss request change as already solved

Copy link
Copy Markdown
Member

@yewentao256 yewentao256 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work! A small update

Comment thread vllm/model_executor/kernels/linear/mxfp4/flashinfer.py Outdated
Comment thread vllm/model_executor/kernels/linear/mxfp4/flashinfer.py Outdated
dsikka and others added 2 commits May 11, 2026 18:19
Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com>
Signed-off-by: Dipika Sikka <ds3822@columbia.edu>
@dsikka dsikka requested a review from yewentao256 May 12, 2026 10:09
@mgoin mgoin merged commit a7b801e into vllm-project:main May 12, 2026
78 checks passed
@github-project-automation github-project-automation Bot moved this from In review to Done in NVIDIA May 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nvidia quantization ready ONLY add when PR is ready to merge/full CI is needed

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

4 participants