[Transform] [Quantization] Add QuTLASS support to vLLM #24440

LopezCastroRoberto · 2025-09-08T12:02:04Z

Purpose

This pull request brings in the QuTLASS library: https://github.com/iST-DASLab/qutlass

QuTLASS is a high-performance library designed for low-precision kernel support in deep learning quantization, built on top of NVIDIA CUTLASS.

QuTLASS v0.1.0 introduces 4-bit microscaling routines tailored for Large Language Model (LLM) inference on NVIDIA Blackwell GPUs.

Online rotations:
- Fused transform + quantization + scale computation.
  - Rotation matrices loaded at runtime, allowing any transformation to be applied.
- Support for both NVFP4 and MXFP4 microscaling formats.
- Multiple rotation sizes (16/32/64/128).
MXFP4 matmul kernel support powered by CUTLASS.
- QuTLASS is compatible with any matmul backend supporting microscaling formats (e.g., CUTLASS, FlashInfer).
Multiple quantization schemes:
- Quartet (i.e., Quest-like).
- Abs-Max.

Microbenchmarking

benchmarks/kernels/bench_mxfp4_qutlass.py
benchmarks/kernels/bench_nvfp4_qutlass.py

QuTLASS performance on a single Qwen3-32B layer with NVIDIA RTX5090 GPU

QuTLASS performance on a single Llama-3.1-70B layer with NVIDIA B200 GPU

[WIP] End-to-end

python benchmarks/benchmark_latency.py
- daslab-testing/Llama-3.3-70B-Instruct-FPQuant-GPTQ-MXFP4-hadamard
- meta-llama/Llama-3.3-70B-Instruct

FP16

Quantization	Llama: MMLU-CoT	GSM8k	Hellaswag	Winogrande	Average	Recovery %
N/A	0.866	0.951	0.862	0.849	0.882	-

MXFP4

Quantization	Llama: MMLU-CoT	GSM8k	Hellaswag	Winogrande	Average	Recovery %
RTN	0.834	0.927	0.839	0.815	0.854	96.8
RTN + HAD (GS32)	0.839	0.936	0.841	0.836	0.863	97.8
GPTQ	0.838	0.945	0.844	0.826	0.863	97.9
GPTQ + HAD (GS32)	0.848	0.945	0.847	0.831	0.868	98.4

Testing

tests/kernels/quantization/test_mxfp4_qutlass.py
tests/kernels/quantization/test_nvfp4_qutlass.py

gemini-code-assist

Code Review

This pull request integrates the QuTLASS library to add support for 4-bit quantization kernels, including new custom ops, benchmarks, and tests. The changes are well-structured. I have two high-severity suggestions: one to improve build reproducibility by pinning the QuTLASS dependency to a specific version, and another to fix a bug in a new test file to prevent future issues.

gemini-code-assist · 2025-09-08T12:06:03Z

cmake/external_projects/qutlass.cmake

+  FetchContent_Declare(
+    qutlass
+    GIT_REPOSITORY https://github.com/IST-DASLab/qutlass.git
+    GIT_TAG main


Using main as the GIT_TAG can lead to non-reproducible builds and may break the build if there are incompatible changes in the QuTLASS repository's main branch. It is highly recommended to pin this to a specific commit hash or a release tag (like v0.1.0 as mentioned in the PR description) to ensure build stability and reproducibility.

GIT_TAG v0.1.0

gemini-code-assist · 2025-09-08T12:06:03Z

tests/kernels/quantization/test_mxfp4_qutlass.py

+        b_dq, *_ = _dq_fp4(b_e2m1, b_e8m0[:n, :k], alpha=1.)
+        out_ref = a_dq @ b_dq.transpose(-2, -1)
+
+        out = qutlass.matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)


The run_problem_ada function attempts to call qutlass.matmul_ada_mxf4_bf16_tn, but qutlass is not defined or imported. This will result in a NameError. Although this function is not currently called, it's best to fix it to prevent future issues.

To fix this, you should add matmul_ada_mxf4_bf16_tn to your imports at the top of the file:

from vllm._custom_ops import matmul_mxf4_bf16_tn, fusedQuantizeMx, matmul_ada_mxf4_bf16_tn

And then update this line accordingly.

Suggested change

out = qutlass.matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)

out = matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)

voipmonitor · 2025-09-08T12:14:03Z

@LopezCastroRoberto does this PR support gpt-oss on sm120 ? How to exactly test some mxfp4 models with this PR? Would love to test rtx 6000 pro on this

kylesayrs · 2025-09-08T19:15:53Z

benchmarks/kernels/bench_mxfp4_qutlass.py

+    return torch.tensor(
+        hadamard(group_size) * group_size**-0.5, dtype=dtype, device=device
+    )


Can you use our hadamard utility for consistency?

from compressed_tensors.transform.utils.hadamard import deterministic_hadamard_matrix

Suggested change

return torch.tensor(

hadamard(group_size) * group_size**-0.5, dtype=dtype, device=device

)

deterministic_hadamard_matrix(group_size, dtype=dtype, device=device) * group_size**-0.5

kylesayrs · 2025-09-08T19:17:01Z

benchmarks/kernels/bench_mxfp4_qutlass.py

+
+def build_mxfp4_runner(cfg, a, b, forward_hadamard_matrix, dtype, device):
+    weight_hf_e2m1, weight_hf_scale_block = _quant_weight_mxfp4(b, forward_hadamard_matrix, device)
+    alpha = torch.Tensor([1.]).to("cuda")


Suggested change

alpha = torch.Tensor([1.]).to("cuda")

alpha = torch.Tensor([1.], device="cuda")

kylesayrs · 2025-09-08T19:17:57Z

benchmarks/kernels/bench_nvfp4_qutlass.py

+
+def get_hadamard_matrix(group_size: int, dtype: torch.dtype, device: torch.device):
+    return torch.tensor(
+        hadamard(group_size) * group_size**-0.5, dtype=dtype, device=device


Same here, use our util

kylesayrs · 2025-09-08T19:19:30Z

benchmarks/kernels/bench_mxfp4_qutlass.py

+    'Llama-3.1-70B': [(8192, 8192), (8192, 57344), (28672, 8192)]
+}
+
+for model, layers in MODELS.items():


Please wrap in `if name == "main"

Consider adding some user arguments

kylesayrs · 2025-09-08T19:21:48Z

benchmarks/kernels/bench_nvfp4_qutlass.py

+    'Llama-3.1-70B': [(8192, 8192), (8192, 57344), (28672, 8192)]
+}
+
+for model, layers in MODELS.items():


Please wrap in `if name == "main"

Consider allowing users to specify arguments, that way you don't have to have commented code

kylesayrs · 2025-09-08T19:27:31Z

vllm/_custom_ops.py

+
+def fusedQuantizeMx(a: torch.Tensor,
+                    b: torch.Tensor,
+                    *,


What's the point of this *?

Means all arguments that come after the * must be passed by keyword, not by position. My point was to make the API clearer and less error-prone.

That's fair!

kylesayrs · 2025-09-08T19:35:54Z

vllm/_custom_ops.py

+    xh_e8m0      = torch.empty(padded_rows, padded_cols, dtype=torch.float8_e8m0fnu, device=a.device)
+
+    if method=="quest":
+        return torch.ops._qutlass_C.fusedQuantizeMxQuest(a, b, xh_e2m1, xh_e8m0)


Because these functions have a return value, you'll want to register a fake function so torch compile works right

if hasattr(torch.ops._C, "_qutlass_C"): @register_fake("_C::_qutlass_C::fusedQuantizeMxQuest") def fake_qutlass_mx_quest(a: torch.Tensor, b: torch.Tensor, xh_e2m1: torch.Tensor, xh_e8m0: torch.Tensor) -> tuple[torch.Tensor, torch.Tensor]: return (torch.empty(...), torch.empty(...))

kylesayrs · 2025-09-08T19:36:34Z

vllm/qutlass_utils/utils.py

+        output_block_stride,
+        BLOCK_ROWS: tl.constexpr,
+        BLOCK_COLS: tl.constexpr,
+    ):


Is this over-indented? I think we should standardize on 4 space indent

kylesayrs · 2025-09-08T19:41:57Z

vllm/qutlass_utils/utils.py

+    return (a + b - 1) // b
+
+
+def to_blocked(input_matrix, use_triton_kernel: bool = False) -> Tensor:


Just as a style thing, consider calling triton_mx_block_rearrange in cases where you want to use the triton kernel and to_blocked otherwise

How about keeping one to_blocked but making the backend explicit (e.g. backend="torch" | "triton" | "auto")?

kylesayrs · 2025-09-08T19:44:54Z

benchmarks/kernels/bench_nvfp4_qutlass.py

+    # Quantize activation on-the-fly
+    def run():
+        input_hf_e2m1, input_hf_e8m0 = fusedQuantizeNv(a, forward_hadamard_matrix, global_scale)
+        input_hf_scale_block = to_blocked(input_hf_e8m0, True).view(-1,K//16)


Will the triton jit affect benchmarked runtime? Ie, first time compile causes the first graph to take longer than normal?

yes—the very first time is slower, but after that it's cached

LopezCastroRoberto · 2025-09-09T11:24:07Z

@voipmonitor This PR supports dense models only, and it's perfectly fine to use an RTX 6000 Pro. We will add usage examples to this PR soon.

We’re actively working on MoE support in QuTLASS—stay tuned :)

mgoin · 2025-09-22T19:44:48Z

tests/kernels/quantization/test_mxfp4_qutlass.py

Please convert these to use pytest like other tests and add a skipif based on compute capability. You can add these tests to the blackwell test runner

vllm/.buildkite/test-pipeline.yaml

Line 772 in 8db2939

- label: Blackwell Test # 38 min

mgoin · 2025-09-22T23:10:05Z

cmake/external_projects/qutlass.cmake

Does this require some minimum CUDA version?

BlackSamorez · 2025-09-30T12:58:09Z

Fixed register_fakes. They had a wrong namespace (_C::_qutlass_C instead of just _qutlass_C) and quite a few kernels (matmul_mxf4_bf16_tn, matmul_ada_mxf4_bf16_tn, fused_quantize_nv) didn't have fake impls at all.

Signed-off-by: Andrei Panferov <[email protected]>

Signed-off-by: LopezCastroRoberto <[email protected]>

mgoin · 2025-10-02T15:36:13Z

@LopezCastroRoberto it looks like the blackwell tests are broken at the moment ERROR : Arch conditional MMA instruction used without targeting appropriate compute capability. Aborting.

Signed-off-by: LopezCastroRoberto <[email protected]>

Signed-off-by: Andrei Panferov <[email protected]> minor fixes eager works eager tests custom ops fake fix eager works eager tests removed extra op style Signed-off-by: Andrei Panferov <[email protected]>

Integration

mergify · 2025-10-06T09:53:46Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @LopezCastroRoberto.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

Signed-off-by: LopezCastroRoberto <[email protected]>

Signed-off-by: Roberto L. Castro <[email protected]>

Signed-off-by: LopezCastroRoberto <[email protected]>

LopezCastroRoberto requested review from tlrmchlsmth, WoosukKwon, yewentao256 and LucasWilkinson as code owners September 8, 2025 12:02

mergify bot added ci/build performance Performance-related issues labels Sep 8, 2025

gemini-code-assist bot reviewed Sep 8, 2025

View reviewed changes

jeejeelee requested a review from mgoin September 8, 2025 15:33

kylesayrs reviewed Sep 8, 2025

View reviewed changes

LopezCastroRoberto closed this Sep 9, 2025

LopezCastroRoberto reopened this Sep 9, 2025

LopezCastroRoberto marked this pull request as draft September 9, 2025 10:24

LopezCastroRoberto force-pushed the transforms branch from f9ca647 to dce5334 Compare September 11, 2025 09:09

github-project-automation bot added this to Structured Output Sep 11, 2025

mergify bot added speculative-decoding v1 tpu Related to Google TPUs tool-calling labels Sep 11, 2025

github-project-automation bot added this to Tool Calling Sep 11, 2025

mgoin added quantization kernel and removed tool-calling llama Related to Llama models qwen Related to Qwen models deepseek Related to DeepSeek models labels Sep 22, 2025

mgoin removed this from Structured Output Sep 22, 2025

mgoin removed this from Tool Calling Sep 22, 2025

mergify bot added the performance Performance-related issues label Sep 22, 2025

mgoin reviewed Sep 22, 2025

View reviewed changes

BlackSamorez and others added 2 commits October 1, 2025 20:53

custom ops fake fix

9c6f15c

Signed-off-by: Andrei Panferov <[email protected]>

use pytest and cuda version check

f4a6f15

Signed-off-by: LopezCastroRoberto <[email protected]>

BlackSamorez force-pushed the transforms branch from d0805ae to f4a6f15 Compare October 1, 2025 18:53

LopezCastroRoberto and others added 8 commits October 2, 2025 09:10

adding hint to cmake for CI

7d378e0

Signed-off-by: LopezCastroRoberto <[email protected]>

Merge branch 'vllm-project:main' into transforms

c5e0595

re-define fake methods

84d6dda

Signed-off-by: LopezCastroRoberto <[email protected]>

style: apply auto-fixes from pre-commit

6bdaff3

Signed-off-by: LopezCastroRoberto <[email protected]>

[CI/Build] limit to sm100

a167efd

Signed-off-by: LopezCastroRoberto <[email protected]>

[CI/Build] flip sm order

d32c180

Signed-off-by: LopezCastroRoberto <[email protected]>

Basic FP-Quant integration

05543ed

Signed-off-by: Andrei Panferov <[email protected]> minor fixes eager works eager tests custom ops fake fix eager works eager tests removed extra op style Signed-off-by: Andrei Panferov <[email protected]>

Merge pull request #1 from LopezCastroRoberto/integration

fe63388

Integration

LopezCastroRoberto requested a review from robertgshaw2-redhat as a code owner October 6, 2025 09:50

mergify bot added the needs-rebase label Oct 6, 2025

LopezCastroRoberto and others added 2 commits October 6, 2025 07:21

fix FP-Quant integration

f55e5b0

Signed-off-by: LopezCastroRoberto <[email protected]>

Merge branch 'main' into transforms

dedb28a

Signed-off-by: Roberto L. Castro <[email protected]>

mergify bot removed the needs-rebase label Oct 7, 2025

style: apply auto-fixes from pre-commit

e754b83

Signed-off-by: LopezCastroRoberto <[email protected]>

	out = qutlass.matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)
	out = matmul_ada_mxf4_bf16_tn(a_e2m1, b_e2m1, a_e8m0, b_e8m0, alpha)

	alpha = torch.Tensor([1.]).to("cuda")
	alpha = torch.Tensor([1.], device="cuda")

		return (a + b - 1) // b


		def to_blocked(input_matrix, use_triton_kernel: bool = False) -> Tensor:

Uh oh!

[Transform] [Quantization] Add QuTLASS support to vLLM #24440

Are you sure you want to change the base?

[Transform] [Quantization] Add QuTLASS support to vLLM #24440

Conversation

LopezCastroRoberto commented Sep 8, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Microbenchmarking

[WIP] End-to-end

FP16

MXFP4

Testing

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Sep 8, 2025

Choose a reason for hiding this comment

Uh oh!

voipmonitor commented Sep 8, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

LopezCastroRoberto commented Sep 9, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

BlackSamorez commented Sep 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mgoin commented Oct 2, 2025

Uh oh!

mergify bot commented Oct 6, 2025

Uh oh!

Uh oh!

LopezCastroRoberto commented Sep 8, 2025 •

edited by github-actions bot

Loading

kylesayrs Sep 8, 2025 •

edited

Loading

kylesayrs Sep 8, 2025 •

edited

Loading

kylesayrs Sep 8, 2025 •

edited

Loading

BlackSamorez commented Sep 30, 2025 •

edited

Loading