[MXFP4] Support for linear layers + compressed-tensors integration by dsikka · Pull Request #41664 · vllm-project/vllm

dsikka · 2026-05-04T21:12:40Z

Purpose

Update scheme name from a16 to a4 to be agnostic for activation quantization
Extend flashinfer_mm_fp4 and flashinfer_scaled_fp4_mm to take in a configurable block_size and boolean flag use_nvfp4 to enable the mxfp4 linear forward pass based on https://github.com/flashinfer-ai/flashinfer/blob/393e83ea8497ff9fb9ad61e170b89797a6b682a3/flashinfer/gemm/gemm_base.py#L5511
Follow the pattern of linear kernels to allow selection between marlin and flashinfer cutlass based on env overrides and platform

Test Plan

Added basic smoke tests
LM Eval validation

LM-Eval

lm_eval \
  --model vllm \
  --model_args pretrained="nm-testing/Meta-Llama-3-8B-Instruct-MXFP4-GPTQ",dtype=auto,add_bos_token=True,max_model_len=4096,max_gen_toks=1024,tensor_parallel_size=1 \
  --tasks gsm8k_cot_llama \
  --fewshot_as_multiturn \
  --apply_chat_template \
  --num_fewshot 8 \
  --batch_size auto

flashinfer

|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      4|flexible-extract|     8|exact_match|↑  |0.6892|±  |0.0127|
|               |       |strict-match    |     8|exact_match|↑  |0.6846|±  |0.0128|

marlin (VLLM_MXFP4_USE_MARLIN=1)

|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      4|flexible-extract|     8|exact_match|↑  |0.7604|±  |0.0118|
|               |       |strict-match    |     8|exact_match|↑  |0.7551|±  |0.0118|

dense (meta-llama/Meta-Llama-3-8B-Instruct)

|     Tasks     |Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|---------------|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k_cot_llama|      4|flexible-extract|     8|exact_match|↑  |0.7998|±  | 0.011|
|               |       |strict-match    |     8|exact_match|↑  |0.7991|±  | 0.011|

Signed-off-by: Dipika <dipikasikka1@gmail.com>

gemini-code-assist

Code Review

This pull request renames the MXFP4 quantization scheme to CompressedTensorsW4A4Mxfp4 and introduces support for true W4A4 quantization using FlashInfer on SM100+ devices. The changes include new utility functions for FlashInfer FP4 operations and logic to handle activation quantization. A critical issue was identified in the weight scale processing where swizzle_mxfp4_scales might cause a RuntimeError during reshaping if the output feature size is not a multiple of 128 due to internal padding.

gemini-code-assist · 2026-05-04T21:14:30Z

+            N, scale_K = layer.weight_scale.shape
+            K = scale_K * self.group_size
+            layer.weight_scale = Parameter(
+                swizzle_mxfp4_scales(layer.weight_scale.data, N, K).reshape(N, -1),


The swizzle_mxfp4_scales function pads the N dimension to the nearest multiple of 128. If N (the output feature size) is not a multiple of 128, the total number of elements in the swizzled tensor will be padded_N * padded_scale_cols, which is not necessarily divisible by N. This will cause a RuntimeError during the .reshape(N, -1) call. Even if it were divisible, the resulting 2D tensor would have misaligned scale data because of the padding introduced during swizzling. You should ensure that the scale tensor's shape is compatible with what the FlashInfer kernel expects, which likely involves keeping the padded dimensions or ensuring the kernel handles the original N correctly with the swizzled layout.

Signed-off-by: Dipika <dipikasikka1@gmail.com>

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

dsikka · 2026-05-05T19:32:10Z

@yewentao256 @mgoin

SUMMARY: - Move out of experimental as supported in vLLM as of: vllm-project/vllm#41664 --------- Signed-off-by: Dipika Sikka <ds3822@columbia.edu> Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

yewentao256

Thanks for the work!

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs

Support looks good, was able to verify locally

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mgoin

LGTM, thanks!

mergify · 2026-05-08T18:19:58Z

Hi @dsikka, the pre-commit checks have failed. Please run:

uv pip install pre-commit>=4.5.1
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

Tip

Is mypy failing?

mypy is run differently in CI. If the failure is related to this check, please use the following command to run it locally:

# For mypy (substitute "3.10" with the failing version if needed)
pre-commit run --hook-stage manual mypy-3.10

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

yewentao256

Sorry to block for a while, please take a look at my previous comment

dsikka · 2026-05-11T15:19:04Z

Sorry to block for a while, please take a look at my previous comment

@yewentao256 Please take a look at the latest commits. This has been addressed to use the padded_N for the reshape

Dismiss request change as already solved

yewentao256

Thanks for the work! A small update

Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Dipika Sikka <ds3822@columbia.edu>

…llm-project#41664)

dsikka added 2 commits May 4, 2026 20:30

update

baafb30

Signed-off-by: Dipika <dipikasikka1@gmail.com>

update

8e28f85

Signed-off-by: Dipika <dipikasikka1@gmail.com>

mergify Bot added the nvidia label May 4, 2026

github-project-automation Bot added this to NVIDIA May 4, 2026

gemini-code-assist Bot reviewed May 4, 2026

View reviewed changes

use linear kernel abstraction

f935818

Signed-off-by: Dipika <dipikasikka1@gmail.com>

dsikka mentioned this pull request May 5, 2026

[MXFP4] Move out of experimental folder vllm-project/llm-compressor#2685

Merged

dsikka added 2 commits May 5, 2026 15:28

add test models

fadc701

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

Merge branch 'main' into ct_mxfp4

aade555

dsikka marked this pull request as ready for review May 5, 2026 19:32

dsikka requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners May 5, 2026 19:32

claude Bot reviewed May 5, 2026

View reviewed changes

yewentao256 reviewed May 5, 2026

View reviewed changes

Comment thread vllm/model_executor/kernels/linear/mxfp4/flashinfer.py

dsikka changed the title ~~[MXFP4] Support for compressed-tensors linear layers~~ [MXFP4] Support for linear layers + compressed-tensors integration May 6, 2026

fix padded mx linear

e50f724

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

kylesayrs approved these changes May 7, 2026

View reviewed changes

update dummy shape

eda7aa3

Signed-off-by: Kyle Sayers <kylesayrs@gmail.com>

mgoin approved these changes May 8, 2026

View reviewed changes

github-project-automation Bot moved this to Ready in NVIDIA May 8, 2026

mgoin added ready ONLY add when PR is ready to merge/full CI is needed quantization labels May 8, 2026

dsikka added 2 commits May 8, 2026 21:46

Merge branch 'main' into ct_mxfp4

8efe81c

Merge branch 'main' into ct_mxfp4

13d2a97

Signed-off-by: Dipika Sikka <dipikasikka1@gmail.com>

dsikka requested a review from zyongye as a code owner May 11, 2026 14:58

yewentao256 previously requested changes May 11, 2026

View reviewed changes

github-project-automation Bot moved this from Ready to In review in NVIDIA May 11, 2026

dsikka requested a review from yewentao256 May 11, 2026 15:35

yewentao256 reviewed May 11, 2026

View reviewed changes

Comment thread vllm/model_executor/kernels/linear/mxfp4/flashinfer.py Outdated

Comment thread vllm/model_executor/kernels/linear/mxfp4/flashinfer.py Outdated

dsikka and others added 2 commits May 11, 2026 18:19

Apply suggestions from code review

fca2eed

Co-authored-by: Wentao Ye <44945378+yewentao256@users.noreply.github.com> Signed-off-by: Dipika Sikka <ds3822@columbia.edu>

Merge branch 'main' into ct_mxfp4

ece7377

dsikka requested a review from yewentao256 May 12, 2026 10:09

dsikka mentioned this pull request May 12, 2026

Q2 2026 Roadmap vllm-project/llm-compressor#2624

Open

20 tasks

mgoin merged commit a7b801e into vllm-project:main May 12, 2026
78 checks passed

github-project-automation Bot moved this from In review to Done in NVIDIA May 12, 2026

vllm-agent mentioned this pull request May 13, 2026

Revert "[MXFP4] Support for linear layers + compressed-tensors integration" (#41664) #42473

Closed

weifang231 pushed a commit to weifang231/eb-vllm that referenced this pull request May 13, 2026

[MXFP4] Support for linear layers + compressed-tensors integration (v…

c6009e8

…llm-project#41664)

mfylcek pushed a commit to mfylcek/vllm that referenced this pull request May 19, 2026

[MXFP4] Support for linear layers + compressed-tensors integration (v…

3ca3dcf

…llm-project#41664)

jhu960213 pushed a commit to jhu960213/vllm that referenced this pull request May 20, 2026

[MXFP4] Support for linear layers + compressed-tensors integration (v…

ceb2b08

…llm-project#41664)

h1t35h pushed a commit to h1t35h/vllm that referenced this pull request May 21, 2026

[MXFP4] Support for linear layers + compressed-tensors integration (v…

2b1f466

…llm-project#41664)

Uh oh!

Conversation

dsikka commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

LM-Eval

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 4, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

dsikka commented May 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

kylesayrs left a comment

Choose a reason for hiding this comment

Uh oh!

mgoin left a comment

Choose a reason for hiding this comment

Uh oh!

mergify Bot commented May 8, 2026

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

dsikka commented May 11, 2026

Uh oh!

yewentao256 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

dsikka commented May 4, 2026 •

edited

Loading

dsikka commented May 5, 2026 •

edited

Loading