[Feature][Quantization] Add SVDQuant W4A4 (nunchaku backend)#43471
[Feature][Quantization] Add SVDQuant W4A4 (nunchaku backend)#43471ultism wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces SVDQuant W4A4 quantization support, leveraging the nunchaku backend for efficient inference on NVIDIA GPUs. It includes the necessary configuration, linear layer methods, hardware dispatching, and layout conversion utilities. Feedback identifies unused parameters and variables, specifically smooth_factor_orig, is_row_parallel, and is_col_parallel, which should be removed to optimize VRAM usage and improve code maintainability.
| smooth_factor_orig = Parameter( | ||
| torch.empty(input_size_per_partition, dtype=lora_dtype), | ||
| requires_grad=False, | ||
| ) | ||
| _set_attrs( | ||
| smooth_factor_orig, | ||
| input_dim=0, | ||
| weight_loader=default_weight_loader, | ||
| ) |
There was a problem hiding this comment.
The smooth_factor_orig parameter is created and registered but never used in the apply method or the process_weights_after_loading hook. Allocating and loading this parameter wastes VRAM (proportional to the model's input dimension and number of layers). If it is not required for inference by the nunchaku kernels, it should be removed from the linear method to optimize memory usage.
| layer.register_parameter("proj_down", proj_down) | ||
| layer.register_parameter("proj_up", proj_up) | ||
| layer.register_parameter("smooth_factor", smooth_factor) | ||
| layer.register_parameter("smooth_factor_orig", smooth_factor_orig) |
| is_row_parallel = input_size_per_partition != input_size | ||
| is_col_parallel = output_size_per_partition != output_size |
Add SVDQuant (https://arxiv.org/abs/2411.05007), the practical 4-bit-weight + 4-bit-activation quantization with low-rank SVD residual that drives most modern diffusion-transformer quantization. Layout: canonical row-major NVFP4 on disk. The nunchaku consumer-GPU kernel (Turing through consumer Blackwell SM_120) repacks once at load time into its PTX-MMA fragment layout in `SVDQuantLinearMethod.process_weights_after_loading`. Pack/unpack pair is bit-exact, verified against nunchaku.ops.gemm.svdq_gemm_w4a4_cuda. Apply path: svdq_quantize_w4a4_act_fuse_lora_cuda → svdq_gemm_w4a4_cuda; scalar alpha and act_unsigned are plumbed through. Hardware gate in utils/svdquant_dispatch.py::assert_svdquant_supported. Hopper SM_90 is unsupported by design — nunchaku targets older PTX-MMA shapes that the SM_90 tensor unit does not implement. Datacenter Blackwell SM_100/103 (B200/GB300) is out of scope here; that path is planned in FlashInfer so SGLang can share the primitive. vllm.utils.nunchaku provides lazy import wrappers so non-CUDA / non-consumer hosts never pull in the nunchaku package at module load. Co-authored-by: Claude (Anthropic) Signed-off-by: ultism <www913363043@gmail.com>
Isotr0py
left a comment
There was a problem hiding this comment.
IMO, if this quantization method is just for diffusion models, we shouldn't upstream it to vLLM which only host AR models.
Actually, vLLM-Omni's quantization can be independent with vLLM, because there're lots of diffusion-specific quantization research which is not designed for AR mdoels.
I strongly feel that we should put this implementation at vLLM-Omni side directly. You can consolidate this implementation with vllm-project/vllm-omni#3830.
…d into vllm-omni Migrate the SVDQuant W4A4 (Nunchaku family) quantization plumbing from the vllm in-tree proposal (vllm-project/vllm#43471, now closed) into vllm-omni. Per reviewer feedback on that PR (Isotr0py): "the linear method here looks pretty much like a special vLLM-omni quant method because it uses some custom 3rd party operator" — the caller side is willing to take on wrapper and dispatch work, so SVDQuant lands here next to the other Diffusion*Config siblings (Int8, MXFP4/8, GGUF, INC). Structure (mirrors existing per-config files, but split for FlashInfer forward-compat): svdquant_config.py DiffusionSVDQuantConfig + LinearMethod; backend-agnostic. `_backend` is selected at __init__ via select_backend(); apply() and process_weights_after_loading delegate to it. svdquant_dispatch.py select_backend(precision) -> module, assert_svdquant_supported() hardware gate. Only this file knows the SM-to-backend mapping. To add FlashInfer for SM_100/103 later: drop a new svdquant_flashinfer.py exposing (supports, prepare_weights, apply), and prepend it in _candidate_backends(). svdquant_nunchaku.py Nunchaku backend: has_nunchaku() capability detection, lazy importlib wrappers around svdq_gemm_w4a4 / svdq_quantize_w4a4_act_fuse_lora (PyPI 'nunchaku' is a different project; the install hint points to the GitHub releases), plus prepare_weights() that repacks canonical row-major NVFP4 into the PTX-MMA fragment layout the kernel expects. tools/svdquant_nvfp4_layout.py Bit-preserving fragment ↔ row-major helpers for qweight / wscales. Previously a shim re-exporting from vllm; now the real impl. Factory registers a new "svdquant" entry in _OVERRIDES. The converter is updated to (1) import the layout helpers from the new vllm-omni local path, and (2) drop nunchaku's `smooth_factor_orig` suffix at group time — upstream nunchaku itself marks it "Unused" (nunchaku/models/linear.py:54), it's never consumed in either int4 or nvfp4 path, and keeping it triggers a load-time KeyError because the LinearMethod does not register a destination parameter. Verified end-to-end on RTX 5060 Ti (SM_120, 16 GiB) with --enable-cpu-offload: 512x512 / 8 steps generates an image in 5.9 s, peak VRAM 8.5 GiB. select_backend() correctly picks vllm_omni.quantization.svdquant_nunchaku. 12/12 tests in tests/diffusion/quantization/test_svdquant_config.py pass (registry, factory routing, hardware gate for Hopper / datacenter Blackwell / pre-Blackwell NVFP4, create_weights parameter layout, skip-list). Signed-off-by: ultism <www913363043@gmail.com> Co-authored-by: Claude <noreply@anthropic.com> Signed-off-by: ultism <www913363043@gmail.com>
|
Closing in favor of vllm-project/vllm-omni#3830. Per @Isotr0py's review feedback ("the linear method here looks pretty much like a special vLLM-omni quant method because it uses some custom 3rd party operator"), the consensus is that SVDQuant — being a Python wrapper around nunchaku's W4A4 CUDA kernels — fits the vllm-omni "diffusion-only quantization config" pattern ( The full migration is in vllm-project/vllm-omni#3830, structured for forward compatibility with the planned FlashInfer SM_100/103 path:
When the FlashInfer SVDQuant kernel ships for datacenter Blackwell, it drops in as a sibling Thanks @Isotr0py + everyone who weighed in. |
Summary
Implements RFC #37908 — adds
SVDQuantConfig+SVDQuantLinearMethodthat wrap the nunchaku W4A4 SVDQuant kernels, so SVDQuant-quantized checkpoints (mainly used for diffusion transformers today) can load directly through vLLM's standardQuantizationConfigflow instead of being patched in via downstream glue.SVDQuant (https://arxiv.org/abs/2411.05007) is 4-bit weights + 4-bit activations with a low-rank SVD residual that absorbs quantization error. It is currently the dominant practical quantization for diffusion transformers, delivering 2x+ speedup vs BF16 with minimal quality loss.
Why this is not duplicating an existing PR
gh pr list --repo vllm-project/vllm --state open --search "svdquant"→ 0 hitsgh pr list --repo vllm-project/vllm --state all --search "svdquant in:title"→ 0 hitsWhat's in this PR
vllm/model_executor/layers/quantization/svdquant.pySVDQuantConfig(rank, precision, act_unsigned, modules_to_not_convert) +SVDQuantLinearMethod(create_weights / apply / process_weights_after_loading)vllm/model_executor/layers/quantization/utils/svdquant_dispatch.pyassert_svdquant_supported()— hardware gate (rejects Hopper SM_90, rejects datacenter Blackwell SM_100/103 with FlashInfer-planned note, rejects NVFP4 on pre-Blackwell, raises ImportError if nunchaku missing)vllm/model_executor/layers/quantization/utils/svdquant_nvfp4_layout.pyprocess_weights_after_loading)vllm/utils/nunchaku.pytests/quantization/test_svdquant.pycreate_weightsparameter-layout smoke tests,modules_to_not_convertskip pathvllm/model_executor/layers/quantization/__init__.pyHardware support
Design notes
process_weights_after_loading. This decouples the on-disk format from the kernel backend, so when the FlashInfer datacenter Blackwell path lands later, the same checkpoint will work — no second copy needed.assert_svdquant_supported()runs inSVDQuantLinearMethod.__init__, so unsupported arches raise before any weights are allocated.Test plan
# Unit / smoke tests (most are CPU-friendly; CUDA-gated tests skip themselves) .venv/bin/python -m pytest tests/quantization/test_svdquant.py -vEnd-to-end NVFP4 validation on consumer Blackwell (RTX 5090, SM_120) via Z-Image-Turbo:
Full quantitative tables + visual gallery + 8-prompt side-by-side images are in the companion vllm-omni PR.
Follow-ups (not in this PR)
assert_svdquant_supportedgets a third arm that routes SM_100/103 to aflashinfer.svdquant.*callable — no on-disk format change required.Closes / Refs
AI assistance: this PR's commits and PR description were produced with Claude Code assistance. Every change was reviewed and validated end-to-end on RTX 5090 by the human submitter.