[Feature][Quantization] Add SVDQuant W4A4 (nunchaku backend) by ultism · Pull Request #43471 · vllm-project/vllm

ultism · 2026-05-23T07:27:27Z

Summary

Implements RFC #37908 — adds SVDQuantConfig + SVDQuantLinearMethod that wrap the nunchaku W4A4 SVDQuant kernels, so SVDQuant-quantized checkpoints (mainly used for diffusion transformers today) can load directly through vLLM's standard QuantizationConfig flow instead of being patched in via downstream glue.

SVDQuant (https://arxiv.org/abs/2411.05007) is 4-bit weights + 4-bit activations with a low-rank SVD residual that absorbs quantization error. It is currently the dominant practical quantization for diffusion transformers, delivering 2x+ speedup vs BF16 with minimal quality loss.

Why this is not duplicating an existing PR

gh pr list --repo vllm-project/vllm --state open --search "svdquant" → 0 hits
gh pr list --repo vllm-project/vllm --state all --search "svdquant in:title" → 0 hits
RFC [RFC]: Add Nunchaku SVDQuant W4A4 quantization backend #37908 (open) is the design discussion for this; no other implementation exists

What's in this PR

File	Purpose
`vllm/model_executor/layers/quantization/svdquant.py`	`SVDQuantConfig` (rank, precision, act_unsigned, modules_to_not_convert) + `SVDQuantLinearMethod` (create_weights / apply / process_weights_after_loading)
`vllm/model_executor/layers/quantization/utils/svdquant_dispatch.py`	`assert_svdquant_supported()` — hardware gate (rejects Hopper SM_90, rejects datacenter Blackwell SM_100/103 with FlashInfer-planned note, rejects NVFP4 on pre-Blackwell, raises ImportError if nunchaku missing)
`vllm/model_executor/layers/quantization/utils/svdquant_nvfp4_layout.py`	Bit-preserving pack/unpack between canonical row-major NVFP4 (on-disk) and nunchaku's PTX-MMA fragment layout (load-time repack in `process_weights_after_loading`)
`vllm/utils/nunchaku.py`	Lazy import wrappers for nunchaku, so non-CUDA / non-consumer hosts don't pull it in at module load
`tests/quantization/test_svdquant.py`	Registry wiring, config from-dict, hardware gate (CUDA / Hopper / datacenter-Blackwell / NVFP4-on-Ampere), `create_weights` parameter-layout smoke tests, `modules_to_not_convert` skip path
`vllm/model_executor/layers/quantization/__init__.py`	Registry entry

Hardware support

Arch	Cap	Precision	Backend
Turing	SM_75	int4	nunchaku ✅
Ampere	SM_80/86	int4	nunchaku ✅
Ada	SM_89	int4	nunchaku ✅
Hopper	SM_90	—	unsupported (no SVDQuant kernel for SM_90 tensor unit shape)
Datacenter Blackwell	SM_100/103	nvfp4	out of scope here (FlashInfer-planned)
Consumer Blackwell	SM_120	int4 / nvfp4	nunchaku ✅

Design notes

Canonical row-major NVFP4 on disk. The vLLM-loadable checkpoint stores qweight / wscales / proj_up / proj_down in straightforward row-major layout. The nunchaku PTX-MMA fragment permutation is applied once in process_weights_after_loading. This decouples the on-disk format from the kernel backend, so when the FlashInfer datacenter Blackwell path lands later, the same checkpoint will work — no second copy needed.
Hardware gate is single-source-of-truth. assert_svdquant_supported() runs in SVDQuantLinearMethod.__init__, so unsupported arches raise before any weights are allocated.
No diffusion-specific code here. Diffusion-pipeline glue (per-component config wiring, offline converter from nunchaku's merged safetensors → vLLM-loadable diffusers folder) lives in a companion vllm-omni PR (link to be added once that PR is opened).

Test plan

# Unit / smoke tests (most are CPU-friendly; CUDA-gated tests skip themselves)
.venv/bin/python -m pytest tests/quantization/test_svdquant.py -v

End-to-end NVFP4 validation on consumer Blackwell (RTX 5090, SM_120) via Z-Image-Turbo:

1024×1024, 20 steps, seed=42
BF16: 11.07s, 24.26 GiB peak
SVDQuant W4A4 NVFP4 (nunchaku): 4.94s, 17.14 GiB peak
2.24× speedup, -29% peak VRAM, -34% weights (20.87 → 13.74 GiB)
LPIPS (8-prompt mean, alex backbone): 0.232

Full quantitative tables + visual gallery + 8-prompt side-by-side images are in the companion vllm-omni PR.

Follow-ups (not in this PR)

Datacenter Blackwell SM_100/103 via FlashInfer. The native W4A4 CuTe DSL kernel will be hosted in FlashInfer so SGLang and vLLM can share the same primitive. Once that lands, assert_svdquant_supported gets a third arm that routes SM_100/103 to a flashinfer.svdquant.* callable — no on-disk format change required.
Companion vllm-omni PR: diffusion-side glue + offline converter (link to be added).

Closes / Refs

Implements RFC [RFC]: Add Nunchaku SVDQuant W4A4 quantization backend #37908 (open)
Replaces the design from the closed [Feature] Integrate Nunchaku SVDQuant W4A4 for diffusion models vllm-omni#1986 (which tried to do this entirely on the omni side)

AI assistance: this PR's commits and PR description were produced with Claude Code assistance. Every change was reviewed and validated end-to-end on RTX 5090 by the human submitter.

gemini-code-assist

Code Review

This pull request introduces SVDQuant W4A4 quantization support, leveraging the nunchaku backend for efficient inference on NVIDIA GPUs. It includes the necessary configuration, linear layer methods, hardware dispatching, and layout conversion utilities. Feedback identifies unused parameters and variables, specifically smooth_factor_orig, is_row_parallel, and is_col_parallel, which should be removed to optimize VRAM usage and improve code maintainability.

gemini-code-assist · 2026-05-23T07:29:30Z

+        smooth_factor_orig = Parameter(
+            torch.empty(input_size_per_partition, dtype=lora_dtype),
+            requires_grad=False,
+        )
+        _set_attrs(
+            smooth_factor_orig,
+            input_dim=0,
+            weight_loader=default_weight_loader,
+        )


The smooth_factor_orig parameter is created and registered but never used in the apply method or the process_weights_after_loading hook. Allocating and loading this parameter wastes VRAM (proportional to the model's input dimension and number of layers). If it is not required for inference by the nunchaku kernels, it should be removed from the linear method to optimize memory usage.

gemini-code-assist · 2026-05-23T07:29:30Z

+        layer.register_parameter("proj_down", proj_down)
+        layer.register_parameter("proj_up", proj_up)
+        layer.register_parameter("smooth_factor", smooth_factor)
+        layer.register_parameter("smooth_factor_orig", smooth_factor_orig)


This line registers the unused smooth_factor_orig parameter. As noted above, this should be removed if the parameter is not used during inference.

gemini-code-assist · 2026-05-23T07:29:30Z

+        is_row_parallel = input_size_per_partition != input_size
+        is_col_parallel = output_size_per_partition != output_size


These variables are defined but never used in the method, and are explicitly deleted later on line 317. To improve code clarity and avoid unnecessary operations, they should be removed entirely.

Add SVDQuant (https://arxiv.org/abs/2411.05007), the practical 4-bit-weight + 4-bit-activation quantization with low-rank SVD residual that drives most modern diffusion-transformer quantization. Layout: canonical row-major NVFP4 on disk. The nunchaku consumer-GPU kernel (Turing through consumer Blackwell SM_120) repacks once at load time into its PTX-MMA fragment layout in `SVDQuantLinearMethod.process_weights_after_loading`. Pack/unpack pair is bit-exact, verified against nunchaku.ops.gemm.svdq_gemm_w4a4_cuda. Apply path: svdq_quantize_w4a4_act_fuse_lora_cuda → svdq_gemm_w4a4_cuda; scalar alpha and act_unsigned are plumbed through. Hardware gate in utils/svdquant_dispatch.py::assert_svdquant_supported. Hopper SM_90 is unsupported by design — nunchaku targets older PTX-MMA shapes that the SM_90 tensor unit does not implement. Datacenter Blackwell SM_100/103 (B200/GB300) is out of scope here; that path is planned in FlashInfer so SGLang can share the primitive. vllm.utils.nunchaku provides lazy import wrappers so non-CUDA / non-consumer hosts never pull in the nunchaku package at module load. Co-authored-by: Claude (Anthropic) Signed-off-by: ultism <www913363043@gmail.com>

Isotr0py

IMO, if this quantization method is just for diffusion models, we shouldn't upstream it to vLLM which only host AR models.

Actually, vLLM-Omni's quantization can be independent with vLLM, because there're lots of diffusion-specific quantization research which is not designed for AR mdoels.

I strongly feel that we should put this implementation at vLLM-Omni side directly. You can consolidate this implementation with vllm-project/vllm-omni#3830.

…d into vllm-omni Migrate the SVDQuant W4A4 (Nunchaku family) quantization plumbing from the vllm in-tree proposal (vllm-project/vllm#43471, now closed) into vllm-omni. Per reviewer feedback on that PR (Isotr0py): "the linear method here looks pretty much like a special vLLM-omni quant method because it uses some custom 3rd party operator" — the caller side is willing to take on wrapper and dispatch work, so SVDQuant lands here next to the other Diffusion*Config siblings (Int8, MXFP4/8, GGUF, INC). Structure (mirrors existing per-config files, but split for FlashInfer forward-compat): svdquant_config.py DiffusionSVDQuantConfig + LinearMethod; backend-agnostic. `_backend` is selected at __init__ via select_backend(); apply() and process_weights_after_loading delegate to it. svdquant_dispatch.py select_backend(precision) -> module, assert_svdquant_supported() hardware gate. Only this file knows the SM-to-backend mapping. To add FlashInfer for SM_100/103 later: drop a new svdquant_flashinfer.py exposing (supports, prepare_weights, apply), and prepend it in _candidate_backends(). svdquant_nunchaku.py Nunchaku backend: has_nunchaku() capability detection, lazy importlib wrappers around svdq_gemm_w4a4 / svdq_quantize_w4a4_act_fuse_lora (PyPI 'nunchaku' is a different project; the install hint points to the GitHub releases), plus prepare_weights() that repacks canonical row-major NVFP4 into the PTX-MMA fragment layout the kernel expects. tools/svdquant_nvfp4_layout.py Bit-preserving fragment ↔ row-major helpers for qweight / wscales. Previously a shim re-exporting from vllm; now the real impl. Factory registers a new "svdquant" entry in _OVERRIDES. The converter is updated to (1) import the layout helpers from the new vllm-omni local path, and (2) drop nunchaku's `smooth_factor_orig` suffix at group time — upstream nunchaku itself marks it "Unused" (nunchaku/models/linear.py:54), it's never consumed in either int4 or nvfp4 path, and keeping it triggers a load-time KeyError because the LinearMethod does not register a destination parameter. Verified end-to-end on RTX 5060 Ti (SM_120, 16 GiB) with --enable-cpu-offload: 512x512 / 8 steps generates an image in 5.9 s, peak VRAM 8.5 GiB. select_backend() correctly picks vllm_omni.quantization.svdquant_nunchaku. 12/12 tests in tests/diffusion/quantization/test_svdquant_config.py pass (registry, factory routing, hardware gate for Hopper / datacenter Blackwell / pre-Blackwell NVFP4, create_weights parameter layout, skip-list). Signed-off-by: ultism <www913363043@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>

ultism · 2026-05-24T06:44:54Z

Closing in favor of vllm-project/vllm-omni#3830.

Per @Isotr0py's review feedback ("the linear method here looks pretty much like a special vLLM-omni quant method because it uses some custom 3rd party operator"), the consensus is that SVDQuant — being a Python wrapper around nunchaku's W4A4 CUDA kernels — fits the vllm-omni "diffusion-only quantization config" pattern (DiffusionInt8Config, DiffusionMXFP4Config, DiffusionMXFP8Config, etc.) rather than vllm proper. Hosting it here would have been the only third-party-CUDA-Python-lib quant method in vllm outside of bitsandbytes, with a much narrower hardware envelope (consumer GPUs only).

The full migration is in vllm-project/vllm-omni#3830, structured for forward compatibility with the planned FlashInfer SM_100/103 path:

DiffusionSVDQuantConfig + LinearMethod — backend-agnostic
svdquant_dispatch.py — hardware gate + select_backend(precision)
svdquant_nunchaku.py — nunchaku capability detection, lazy importlib wrappers, prepare_weights() / apply()

When the FlashInfer SVDQuant kernel ships for datacenter Blackwell, it drops in as a sibling svdquant_flashinfer.py exposing the same three-function interface, with no changes to the LinearMethod or any caller.

Thanks @Isotr0py + everyone who weighed in.

ultism requested review from mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and zyongye as code owners May 23, 2026 07:27

ultism mentioned this pull request May 23, 2026

[Diffusion][Quantization] SVDQuant W4A4 (Nunchaku) for Z-Image-Turbo vllm-project/vllm-omni#3830

Open

gemini-code-assist Bot reviewed May 23, 2026

View reviewed changes

depthfirst-app Bot reviewed May 23, 2026

View reviewed changes

Comment thread vllm/utils/nunchaku.py Outdated

Comment thread vllm/model_executor/layers/quantization/utils/svdquant_dispatch.py Outdated

ultism force-pushed the svdquant-pr branch from 83152cb to 9936ea8 Compare May 23, 2026 09:12

This was referenced May 23, 2026

[RFC]: Add Nunchaku SVDQuant W4A4 quantization backend #37908

Closed

[RFC] Add SVDQuant W4A4 (NVFP4 + low-rank) GEMM and activation quantize for SM_100 / SM_103 flashinfer-ai/flashinfer#3380

Open

Isotr0py requested changes May 23, 2026

View reviewed changes

ultism closed this May 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feature][Quantization] Add SVDQuant W4A4 (nunchaku backend)#43471

[Feature][Quantization] Add SVDQuant W4A4 (nunchaku backend)#43471
ultism wants to merge 1 commit into
vllm-project:mainfrom
ultism:svdquant-pr

ultism commented May 23, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

gemini-code-assist Bot May 23, 2026

Uh oh!

Uh oh!

Uh oh!

Isotr0py left a comment

Uh oh!

ultism commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		is_row_parallel = input_size_per_partition != input_size
		is_col_parallel = output_size_per_partition != output_size

Uh oh!

Conversation

ultism commented May 23, 2026

Summary

Why this is not duplicating an existing PR

What's in this PR

Hardware support

Design notes

Test plan

Follow-ups (not in this PR)

Closes / Refs

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Isotr0py left a comment

Choose a reason for hiding this comment

Uh oh!

ultism commented May 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants