Skip to content

[Feature][Quantization] Add SVDQuant W4A4 (nunchaku backend)#43471

Closed
ultism wants to merge 1 commit into
vllm-project:mainfrom
ultism:svdquant-pr
Closed

[Feature][Quantization] Add SVDQuant W4A4 (nunchaku backend)#43471
ultism wants to merge 1 commit into
vllm-project:mainfrom
ultism:svdquant-pr

Conversation

@ultism
Copy link
Copy Markdown
Contributor

@ultism ultism commented May 23, 2026

Summary

Implements RFC #37908 — adds SVDQuantConfig + SVDQuantLinearMethod that wrap the nunchaku W4A4 SVDQuant kernels, so SVDQuant-quantized checkpoints (mainly used for diffusion transformers today) can load directly through vLLM's standard QuantizationConfig flow instead of being patched in via downstream glue.

SVDQuant (https://arxiv.org/abs/2411.05007) is 4-bit weights + 4-bit activations with a low-rank SVD residual that absorbs quantization error. It is currently the dominant practical quantization for diffusion transformers, delivering 2x+ speedup vs BF16 with minimal quality loss.

Why this is not duplicating an existing PR

What's in this PR

File Purpose
vllm/model_executor/layers/quantization/svdquant.py SVDQuantConfig (rank, precision, act_unsigned, modules_to_not_convert) + SVDQuantLinearMethod (create_weights / apply / process_weights_after_loading)
vllm/model_executor/layers/quantization/utils/svdquant_dispatch.py assert_svdquant_supported() — hardware gate (rejects Hopper SM_90, rejects datacenter Blackwell SM_100/103 with FlashInfer-planned note, rejects NVFP4 on pre-Blackwell, raises ImportError if nunchaku missing)
vllm/model_executor/layers/quantization/utils/svdquant_nvfp4_layout.py Bit-preserving pack/unpack between canonical row-major NVFP4 (on-disk) and nunchaku's PTX-MMA fragment layout (load-time repack in process_weights_after_loading)
vllm/utils/nunchaku.py Lazy import wrappers for nunchaku, so non-CUDA / non-consumer hosts don't pull it in at module load
tests/quantization/test_svdquant.py Registry wiring, config from-dict, hardware gate (CUDA / Hopper / datacenter-Blackwell / NVFP4-on-Ampere), create_weights parameter-layout smoke tests, modules_to_not_convert skip path
vllm/model_executor/layers/quantization/__init__.py Registry entry

Hardware support

Arch Cap Precision Backend
Turing SM_75 int4 nunchaku ✅
Ampere SM_80/86 int4 nunchaku ✅
Ada SM_89 int4 nunchaku ✅
Hopper SM_90 unsupported (no SVDQuant kernel for SM_90 tensor unit shape)
Datacenter Blackwell SM_100/103 nvfp4 out of scope here (FlashInfer-planned)
Consumer Blackwell SM_120 int4 / nvfp4 nunchaku ✅

Design notes

  • Canonical row-major NVFP4 on disk. The vLLM-loadable checkpoint stores qweight / wscales / proj_up / proj_down in straightforward row-major layout. The nunchaku PTX-MMA fragment permutation is applied once in process_weights_after_loading. This decouples the on-disk format from the kernel backend, so when the FlashInfer datacenter Blackwell path lands later, the same checkpoint will work — no second copy needed.
  • Hardware gate is single-source-of-truth. assert_svdquant_supported() runs in SVDQuantLinearMethod.__init__, so unsupported arches raise before any weights are allocated.
  • No diffusion-specific code here. Diffusion-pipeline glue (per-component config wiring, offline converter from nunchaku's merged safetensors → vLLM-loadable diffusers folder) lives in a companion vllm-omni PR (link to be added once that PR is opened).

Test plan

# Unit / smoke tests (most are CPU-friendly; CUDA-gated tests skip themselves)
.venv/bin/python -m pytest tests/quantization/test_svdquant.py -v

End-to-end NVFP4 validation on consumer Blackwell (RTX 5090, SM_120) via Z-Image-Turbo:

  • 1024×1024, 20 steps, seed=42
  • BF16: 11.07s, 24.26 GiB peak
  • SVDQuant W4A4 NVFP4 (nunchaku): 4.94s, 17.14 GiB peak
  • 2.24× speedup, -29% peak VRAM, -34% weights (20.87 → 13.74 GiB)
  • LPIPS (8-prompt mean, alex backbone): 0.232

Full quantitative tables + visual gallery + 8-prompt side-by-side images are in the companion vllm-omni PR.

Follow-ups (not in this PR)

  • Datacenter Blackwell SM_100/103 via FlashInfer. The native W4A4 CuTe DSL kernel will be hosted in FlashInfer so SGLang and vLLM can share the same primitive. Once that lands, assert_svdquant_supported gets a third arm that routes SM_100/103 to a flashinfer.svdquant.* callable — no on-disk format change required.
  • Companion vllm-omni PR: diffusion-side glue + offline converter (link to be added).

Closes / Refs


AI assistance: this PR's commits and PR description were produced with Claude Code assistance. Every change was reviewed and validated end-to-end on RTX 5090 by the human submitter.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces SVDQuant W4A4 quantization support, leveraging the nunchaku backend for efficient inference on NVIDIA GPUs. It includes the necessary configuration, linear layer methods, hardware dispatching, and layout conversion utilities. Feedback identifies unused parameters and variables, specifically smooth_factor_orig, is_row_parallel, and is_col_parallel, which should be removed to optimize VRAM usage and improve code maintainability.

Comment on lines +265 to +273
smooth_factor_orig = Parameter(
torch.empty(input_size_per_partition, dtype=lora_dtype),
requires_grad=False,
)
_set_attrs(
smooth_factor_orig,
input_dim=0,
weight_loader=default_weight_loader,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The smooth_factor_orig parameter is created and registered but never used in the apply method or the process_weights_after_loading hook. Allocating and loading this parameter wastes VRAM (proportional to the model's input dimension and number of layers). If it is not required for inference by the nunchaku kernels, it should be removed from the linear method to optimize memory usage.

layer.register_parameter("proj_down", proj_down)
layer.register_parameter("proj_up", proj_up)
layer.register_parameter("smooth_factor", smooth_factor)
layer.register_parameter("smooth_factor_orig", smooth_factor_orig)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This line registers the unused smooth_factor_orig parameter. As noted above, this should be removed if the parameter is not used during inference.

Comment on lines +179 to +180
is_row_parallel = input_size_per_partition != input_size
is_col_parallel = output_size_per_partition != output_size
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

These variables are defined but never used in the method, and are explicitly deleted later on line 317. To improve code clarity and avoid unnecessary operations, they should be removed entirely.

Comment thread vllm/utils/nunchaku.py Outdated
Comment thread vllm/model_executor/layers/quantization/utils/svdquant_dispatch.py Outdated
Add SVDQuant (https://arxiv.org/abs/2411.05007), the practical
4-bit-weight + 4-bit-activation quantization with low-rank SVD
residual that drives most modern diffusion-transformer quantization.

Layout: canonical row-major NVFP4 on disk. The nunchaku consumer-GPU
kernel (Turing through consumer Blackwell SM_120) repacks once at
load time into its PTX-MMA fragment layout in
`SVDQuantLinearMethod.process_weights_after_loading`. Pack/unpack
pair is bit-exact, verified against nunchaku.ops.gemm.svdq_gemm_w4a4_cuda.

Apply path: svdq_quantize_w4a4_act_fuse_lora_cuda → svdq_gemm_w4a4_cuda;
scalar alpha and act_unsigned are plumbed through.

Hardware gate in utils/svdquant_dispatch.py::assert_svdquant_supported.
Hopper SM_90 is unsupported by design — nunchaku targets older
PTX-MMA shapes that the SM_90 tensor unit does not implement.
Datacenter Blackwell SM_100/103 (B200/GB300) is out of scope here;
that path is planned in FlashInfer so SGLang can share the primitive.

vllm.utils.nunchaku provides lazy import wrappers so non-CUDA /
non-consumer hosts never pull in the nunchaku package at module
load.

Co-authored-by: Claude (Anthropic)
Signed-off-by: ultism <www913363043@gmail.com>
Copy link
Copy Markdown
Member

@Isotr0py Isotr0py left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, if this quantization method is just for diffusion models, we shouldn't upstream it to vLLM which only host AR models.

Actually, vLLM-Omni's quantization can be independent with vLLM, because there're lots of diffusion-specific quantization research which is not designed for AR mdoels.

I strongly feel that we should put this implementation at vLLM-Omni side directly. You can consolidate this implementation with vllm-project/vllm-omni#3830.

ultism added a commit to ultism/vllm-omni that referenced this pull request May 24, 2026
…d into vllm-omni

Migrate the SVDQuant W4A4 (Nunchaku family) quantization plumbing from
the vllm in-tree proposal (vllm-project/vllm#43471, now closed) into
vllm-omni. Per reviewer feedback on that PR (Isotr0py): "the linear
method here looks pretty much like a special vLLM-omni quant method
because it uses some custom 3rd party operator" — the caller side is
willing to take on wrapper and dispatch work, so SVDQuant lands here
next to the other Diffusion*Config siblings (Int8, MXFP4/8, GGUF, INC).

Structure (mirrors existing per-config files, but split for FlashInfer
forward-compat):

  svdquant_config.py        DiffusionSVDQuantConfig + LinearMethod;
                            backend-agnostic. `_backend` is selected at
                            __init__ via select_backend(); apply() and
                            process_weights_after_loading delegate to it.

  svdquant_dispatch.py      select_backend(precision) -> module,
                            assert_svdquant_supported() hardware gate.
                            Only this file knows the SM-to-backend
                            mapping. To add FlashInfer for SM_100/103
                            later: drop a new svdquant_flashinfer.py
                            exposing (supports, prepare_weights, apply),
                            and prepend it in _candidate_backends().

  svdquant_nunchaku.py      Nunchaku backend: has_nunchaku() capability
                            detection, lazy importlib wrappers around
                            svdq_gemm_w4a4 / svdq_quantize_w4a4_act_fuse_lora
                            (PyPI 'nunchaku' is a different project; the
                            install hint points to the GitHub releases),
                            plus prepare_weights() that repacks
                            canonical row-major NVFP4 into the
                            PTX-MMA fragment layout the kernel expects.

  tools/svdquant_nvfp4_layout.py
                            Bit-preserving fragment ↔ row-major helpers
                            for qweight / wscales. Previously a shim
                            re-exporting from vllm; now the real impl.

Factory registers a new "svdquant" entry in _OVERRIDES. The converter
is updated to (1) import the layout helpers from the new vllm-omni
local path, and (2) drop nunchaku's `smooth_factor_orig` suffix at
group time — upstream nunchaku itself marks it "Unused"
(nunchaku/models/linear.py:54), it's never consumed in either int4
or nvfp4 path, and keeping it triggers a load-time KeyError because
the LinearMethod does not register a destination parameter.

Verified end-to-end on RTX 5060 Ti (SM_120, 16 GiB) with
--enable-cpu-offload: 512x512 / 8 steps generates an image in 5.9 s,
peak VRAM 8.5 GiB. select_backend() correctly picks
vllm_omni.quantization.svdquant_nunchaku. 12/12 tests in
tests/diffusion/quantization/test_svdquant_config.py pass (registry,
factory routing, hardware gate for Hopper / datacenter Blackwell /
pre-Blackwell NVFP4, create_weights parameter layout, skip-list).

Signed-off-by: ultism <www913363043@gmail.com>
Co-authored-by: Claude <noreply@anthropic.com>
Signed-off-by: ultism <www913363043@gmail.com>
@ultism
Copy link
Copy Markdown
Contributor Author

ultism commented May 24, 2026

Closing in favor of vllm-project/vllm-omni#3830.

Per @Isotr0py's review feedback ("the linear method here looks pretty much like a special vLLM-omni quant method because it uses some custom 3rd party operator"), the consensus is that SVDQuant — being a Python wrapper around nunchaku's W4A4 CUDA kernels — fits the vllm-omni "diffusion-only quantization config" pattern (DiffusionInt8Config, DiffusionMXFP4Config, DiffusionMXFP8Config, etc.) rather than vllm proper. Hosting it here would have been the only third-party-CUDA-Python-lib quant method in vllm outside of bitsandbytes, with a much narrower hardware envelope (consumer GPUs only).

The full migration is in vllm-project/vllm-omni#3830, structured for forward compatibility with the planned FlashInfer SM_100/103 path:

  • DiffusionSVDQuantConfig + LinearMethod — backend-agnostic
  • svdquant_dispatch.py — hardware gate + select_backend(precision)
  • svdquant_nunchaku.py — nunchaku capability detection, lazy importlib wrappers, prepare_weights() / apply()

When the FlashInfer SVDQuant kernel ships for datacenter Blackwell, it drops in as a sibling svdquant_flashinfer.py exposing the same three-function interface, with no changes to the LinearMethod or any caller.

Thanks @Isotr0py + everyone who weighed in.

@ultism ultism closed this May 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants