[Quantization] Add TurboQuant online weight compression (Linear-only) by varjoranta · Pull Request #39970 · vllm-project/vllm

varjoranta · 2026-04-16T05:55:47Z

Purpose

Adds an online weight-only quantization scheme for Linear layers. Follows the pattern established in #38138 — a new enum value on OnlineQuantScheme, a matching LinearMethodBase implementation, wired into OnlineQuantizationConfig.get_quant_method.

The technique implemented is the scalar case of HIGGS (Malinovskii et al., Pushing the Limits of Large Language Model Quantization via the Linearity Theorem, NAACL 2025; preprint arXiv:2411.17525): Random Hadamard Transform + MSE-optimal Lloyd–Max grid + per-group normalization. ~4x compression at 3 bits with zero calibration data. A reference implementation also exists in HF transformers.

Walsh–Hadamard randomized rotation projects weight groups so each coordinate is approximately N(0, 1/d)
Lloyd–Max optimal scalar quantization at 2/3/4 bits into a small shared codebook
Per-group shape-gain decomposition (classical VQ technique, Gray 1984) — store original_norm / reconstruction_norm rather than raw L2; halves error at 3 bits in practice

The implementation was originally based on TurboQuant (Zandieh et al., ICLR 2026, arXiv:2504.19874), which is actually an online vector quantizer for KV-cache and ANN vector search, not weights. Engineering simplifications (scalar over vector, WHT over general random rotations, Lloyd-Max over learned grids) converged the weight path onto the HIGGS scalar algorithm. The KV-cache application of TurboQuant is implemented separately in #38479 by @vibhavagarwal5. The turboquant API name is kept here for plugin-package compatibility; HIGGS is the correct primary citation for this PR. Thanks to @dalistarh for the attribution catches.

Scope: Linear layers only. MoE dispatch explicitly falls back to UnquantizedFusedMoEMethod with a comment — MoE compression needs per-expert scratch pool management and is deferred to a follow-up. The OnlineQuantScheme.TURBOQUANT enum value makes the extension point obvious.

Usage:

vllm serve <model> --quantization turboquant

Or via the Python API:

from vllm import LLM
llm = LLM(model="...", quantization="turboquant")

Key implementation points:

Meta-device init — bf16 weights never materialize on GPU; compression runs in process_weights_after_loading per-layer immediately after load.
Two Triton kernels registered as torch.library.custom_op with register_fake for fullgraph / torch.compile compat:
- Fused dequant-GEMM for small output dims (loads rotation matrix into shared memory, fuses unpack+rotate+matmul)
- FWHT-on-input GEMM for larger output dims (rotates activation once on host, avoids N inverse rotations)
BF16 tensor cores — main GEMM casts inputs to the output dtype before tl.dot; accumulator stays FP32 for precision.
Robustness — handles zero-token batches (M=0 chunked prefill), non-aligned hidden dims (auto-padding), shared memory caps (Ada/Hopper safe), PyTorch fallback when Triton unavailable.

Test Plan

pytest tests/quantization/test_turboquant_online.py -v

33 CPU-only tests covering the known hard parts:

3-bit packing cross-byte boundaries (positions 2 and 5 span bytes)
WHT rotation invariants (orthogonality, involution)
Padding logic for irregular tensor dimensions
PolarQuant norm correction round-trip error bounds
process_weights_after_loading idempotency (double-call guard)
Weight kept as empty(0) for MLA compatibility
M=0 zero-token batch early exit
Cosine similarity validation (>0.85 for TQ3, >0.92 for TQ4)

Integration validated end-to-end on RTX 6000 Ada 48GB (sm_89) with Qwen2.5-0.5B:

vllm serve Qwen/Qwen2.5-0.5B --quantization turboquant --enforce-eager

— model loads, compresses, and serves coherent output via /v1/chat/completions.

Both Triton kernels verified against a PyTorch dequant-then-matmul reference:

Kernel	Dim	cosine_sim vs reference
Fused dequant-GEMM	64	1.0000
FWHT-on-input GEMM	4096	1.0000

Test Result

============================= test session starts ==============================
platform darwin -- Python 3.13.7, pytest-9.0.2
collected 33 items

tests/quantization/test_turboquant_online.py ..............................
============================== 33 passed in 8.94s ==============================

Hardware coverage: Ampere (A100), Ada (L40S / RTX 6000 Ada), Hopper (H100/H200). Minimum compute capability 8.0 — BF16 Tensor Cores were introduced in Ampere; Turing's 2nd-gen Tensor Cores only support FP16, so Turing is not supported for BF16 models. Documented in docs/features/quantization/turboquant.md and the hardware support table.

Builds on [Frontend] new online quantization frontend #38138 (online quantization frontend by @vkuzo)
Complements @vibhavagarwal5's KV-cache variant in [Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity #38479
Primary algorithm reference: HIGGS — Malinovskii et al., Pushing the Limits of LLM Quantization via the Linearity Theorem, NAACL 2025 (preprint arXiv:2411.17525); reference implementation in HF transformers
Original framework that this work simplified from: TurboQuant — Zandieh et al., ICLR 2026, arXiv:2504.19874 (online vector quantizer for KV-cache + ANN; the KV-cache application is in @vibhavagarwal5's [Attention Backend] TurboQuant: 2-bit KV cache compression with 4x capacity #38479)

Follow-ups (explicitly out of scope)

MoE compression (per-expert packed format, shared scratch pool) — tracked separately
Parametrized integration test in tests/quantization/test_online.py
Autotune configs for the fused dequant-GEMM kernel (currently heuristic on group_size)
Per-scheme get_min_capability on OnlineQuantizationConfig (currently all schemes share the 75 floor; TurboQuant actually needs 80)
Shared host-side turboquant math module — pack/unpack, fast_wht_batch, and PolarQuant pipeline currently live here; converge with the KV-cache turboquant module once bit layouts harmonize (see thread w/ @vibhavagarwal5)

Essential Elements of an Effective PR Description Checklist

The purpose of the PR
The test plan
The test results
Documentation updated (docs/features/quantization/turboquant.md + hardware table)
(Optional) Release notes — will update if reviewers request

github-actions · 2026-04-16T05:55:57Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

PRs do not trigger a full CI run by default. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

If you have any questions, please reach out to us on Slack at https://slack.vllm.ai.

Agent Guidelines

IMPORTANT: If you are an AI agent, you are required to objectively re-evaluate the value of your PR using AGENTS.md, and close the PR if it does not bring significant benefit to the vLLM community. Failure to do so may result in an immediate ban.

🚀

mergify · 2026-04-16T05:56:27Z

Documentation preview: https://vllm--39970.org.readthedocs.build/en/39970/

gemini-code-assist

Code Review

This pull request introduces TurboQuant, an online weight-only quantization scheme that compresses weights to 3 or 4 bits at model load time using Walsh-Hadamard randomized rotation and Lloyd-Max optimal scalar quantization. The implementation includes specialized Triton kernels for efficient dequantization and GEMM operations, along with comprehensive unit tests and documentation. Feedback identifies a critical dependency on scipy for centroid calculations, which should be replaced with native math or torch implementations to maintain vLLM's standard dependency profile.

vibhavagarwal5 · 2026-04-16T06:07:32Z

Lets reuse the turboquant basic functions instead or rewriting. Else modify if there's some difference

varjoranta · 2026-04-16T06:23:46Z

@vibhavagarwal5 @gemini-code-assist Addressed in d477fa0. Dropped scipy entirely and now reuse get_centroids from vllm.model_executor.layers.quantization.turboquant.centroids (the module added in #38479). Net −44 lines of duplicated math, −38 lines of redundant centroid tests (already covered by tests/quantization/test_turboquant.py). Thanks for the pointers!

gemini-code-assist · 2026-04-16T06:23:57Z

Thanks for the update, @varjoranta! Reusing get_centroids from the existing turboquant module is definitely the right approach to avoid duplication and ensure consistency. The cleanup looks great. I'll take a look at the updated implementation now.

vibhavagarwal5 · 2026-04-16T06:29:52Z

@varjoranta what about llyod codebook etc? can we reuse those code as well?

varjoranta · 2026-04-16T06:37:12Z

Good catch — pushed a067f9a. Now using generate_wht_signs(d, seed=seed) and generate_wht_signs(d, seed=seed+1) so signs1/signs2 stay uncorrelated.

All the basic math in vllm/.../turboquant/ is now reused:

get_centroids (d477fa0, pulls in solve_lloyd_max transitively)
generate_wht_signs (a067f9a)
TurboQuantConfig / TQ_PRESETS — KV-specific, not applicable here

The remaining helpers in this PR (_fast_wht_batch, _pack_indices/_unpack_indices, _PolarQuant.quantize/dequantize) are host-side for offline weight compression. Your KV-cache path does the equivalents inside Triton store/decode kernels, so the split makes sense — host-side pipeline for weights, device-side pipeline for hot KV. If we want to converge these later (e.g., share the bit-packing bitwise logic), happy to do it in a follow-up.

dalistarh · 2026-04-16T07:24:26Z

Hi @varjoranta,

Thank you for your work!

The technique being implemented in steps 1 and 2 of your PR (Walsh-Hadamard + learned grid) is not TurboQuant, it's HIGGS in the scalar quantization case: https://arxiv.org/pdf/2411.17525
Please have a look at Section 4 of the paper and let me know if I'm missing anything.

All the best,
Dan

varjoranta · 2026-04-16T07:52:35Z

Thanks @dalistarh — you're right. Re-reading Section 4 of HIGGS, the scalar case is exactly what's implemented here: Random Hadamard Transform + per-group normalization + Gaussian MSE-optimal grid.

The attribution slip-up was honest: the implementation started from TurboQuant's more general vector/online framework, then took a series of practical simplifications during development — scalar over vector quantization (faster kernels, simpler bit-packing), WHT over general random rotations (O(n log n) vs O(n²)), Lloyd–Max over learned grids — and converged precisely on the HIGGS scalar algorithm. The path from "implementing TurboQuant" to "implementing HIGGS scalar" was gradual enough that the citation didn't get updated. Apologies.

Pushed b054aa6 with citations updated:

online/turboquant.py module docstring now cites HIGGS as the algorithm source, with TurboQuant noted as the framework the implementation came from
docs/features/quantization/turboquant.md has a new "Background and naming" section explaining the convergence
PR description updated

The --quantization turboquant API name and OnlineQuantScheme.TURBOQUANT enum stay (the plugin is named turboquant-vllm and renaming the user-facing flag mid-PR seems disruptive), but HIGGS is now the primary algorithm citation everywhere it matters. Appreciate the careful read.

dalistarh · 2026-04-16T07:58:32Z

Thanks @varjoranta !

The correct citation is this: https://aclanthology.org/2025.naacl-long.543/ (the NeurIPS 24 paper is a different result).

In case it's helpful to have access to another implementation, weight HIGGS has already been supported in HF transformers for a while:
https://github.com/huggingface/transformers/blob/main/docs/source/en/quantization/higgs.md

Cheers,
Dan

Thanks @dalistarh — you're right. Re-reading Section 4 of HIGGS, the scalar case is exactly what's implemented here: Random Hadamard Transform + per-group normalization + Gaussian MSE-optimal grid.

The attribution slip-up was honest: the implementation started from TurboQuant's more general vector/online framework, then took a series of practical simplifications during development — scalar over vector quantization (faster kernels, simpler bit-packing), WHT over general random rotations (O(n log n) vs O(n²)), Lloyd–Max over learned grids — and converged precisely on the HIGGS scalar algorithm. The path from "implementing TurboQuant" to "implementing HIGGS scalar" was gradual enough that the citation didn't get updated. Apologies.

Pushed b054aa6 with citations updated:

online/turboquant.py module docstring now cites HIGGS as the algorithm source, with TurboQuant noted as the framework the implementation came from

docs/features/quantization/turboquant.md has a new "Background and naming" section explaining the convergence

PR description updated

The --quantization turboquant API name and OnlineQuantScheme.TURBOQUANT enum stay (the plugin is named turboquant-vllm and renaming the user-facing flag mid-PR seems disruptive), but HIGGS is now the primary algorithm citation everywhere it matters. Appreciate the careful read.

varjoranta · 2026-04-16T08:32:34Z

Thanks again — fixed in 868a33c8a. NAACL 2025 is now the primary HIGGS citation across the docstring, docs page, and PR body, with the HF transformers HIGGS implementation noted as a reference.

Validates the Linear-only `--quantization turboquant` path end-to-end on real hardware. Two phases: direct Triton kernel vs PyTorch-reference cosine similarity check (both fused dequant-GEMM and FWHT-on-input), followed by a full vllm LLM.generate on Qwen2.5-0.5B. Runtime ~2 min on any GPU with ≥4 GB VRAM; used to validate the PR at vllm-project/vllm#39970 on RTX 6000 Ada (sm_89). Requires vLLM built from the PR branch: pip install git+https://github.com/varjoranta/vllm-1.git@feat/turboquant-online-weight-quant

gaby · 2026-04-16T12:26:41Z

I think having MoE support is critical, almost all the top-tier Open Source models are MoE.

gaby · 2026-04-16T14:04:13Z

@varjoranta Would it be possible to have:

--quantization turboquant
--quantization turboquant_tq4

Basically a way to enable 4-bit packing instead of 3-bit.

varjoranta · 2026-04-16T14:24:02Z

Thanks @gaby — both make sense.

On bits: the bits arg is already on TurboQuantOnlineLinearMethod; the config wiring just doesn't pass it through. Plan: route via --quantization-config '{"bits": 4}' rather than a per-bit-width flag, so 2-bit also works without a new enum. Small fix, will add to this PR.

On MoE: agree it's the critical gap for real-world use. I have a working MoE path (3D Triton kernels + per-forward scratch-pool choreography) in our plugin repo that served GLM-5.1 and Gemma 4 MoE on-GPU. Porting it here is real work though — roughly ~500-1000 lines net and a new class of review surface (CUDA graph compat, expert-dispatch correctness, routing). My instinct is to land this Linear-only PR first and follow up with MoE as a second PR so each stays reviewable. But I'll defer to you and the maintainers on scope — would you rather see MoE in this PR, or as an immediate follow-up once this one lands?

gaby · 2026-04-16T14:26:35Z

@varjoranta The quantization-config is a great solution

Regarding MoE, if doing it in a separate PR makes this one easier to merge, then it makes sense to keep them separate.

3-4 bit weight compression via WHT rotation + Lloyd-Max codebook. Compress any BF16 checkpoint at startup with zero calibration data. - PolarQuant quantizer with norm correction - 2/3/4-bit packing into uint8 - Two Triton GEMM kernels (FWHT-on-input + fused dequant) registered as torch.library.custom_op for fullgraph compatibility - BF16 tensor core GEMM with FP32 accumulator - Input padding for non-aligned hidden dimensions - M=0 early exit for chunked prefill - Shared memory cap for Ada/Hopper compatibility - PyTorch fallback for non-Triton environments - Inherits LinearMethodBase, meta-device init, online processing Tested on RTX 6000 Ada 48GB: Triton kernels cos_sim=1.0, vLLM generate OK. Usage: vllm serve <model> --quantization turboquant Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

CPU-only tests covering known hard parts: - 3-bit cross-byte packing (positions 2 and 5 split across bytes) - Norm correction ratio (original_norm / recon_norm) - Non-power-of-2 dim padding in PolarQuant and apply() - process_weights_after_loading idempotency (double-call guard) - Weight kept as empty(0) for MLA compatibility - Zero-token batch (M=0) early exit - Full compress→matmul quality check (cosine similarity) - PyTorch fallback path with bias and 3D input All tensor creations explicitly pin device="cpu" so the fallback path is exercised even when the default device resolves to MPS or CUDA. Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

Adds docs/features/quantization/turboquant.md describing usage and scope (Linear-only, MoE deferred). Links from the quantization README and adds a row to the hardware support table (Ampere+ via BF16 tensor cores). Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

@vibhavagarwal5

Addresses reviewer feedback from @vibhavagarwal5 and gemini-code-assist: - Import get_centroids() from vllm.model_executor.layers.quantization.turboquant.centroids (added by the KV-cache TurboQuant PR vllm-project#38479 and already merged). - Remove local _gaussian_cond_expect / _lloyd_max_centroids / _optimal_centroids. - Remove scipy imports (vLLM keeps its core dependency surface minimal). - Remove duplicated codebook unit tests (centroid correctness is already covered by tests/quantization/test_turboquant.py). Net -44 lines in the implementation, -38 lines in tests. The serving behavior is unchanged: both modules compute the same Lloyd-Max centroids for N(0, 1/d), just without scipy. Tested locally; the bitwise-identical test subset still passes (pack/unpack, WHT, PolarQuant roundtrip, idempotency, M=0, 3D input, fallback path). Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

@vibhavagarwal5

Addresses follow-up from @vibhavagarwal5. Switches our inline two-sign-vector generator to two generate_wht_signs() calls with seeds (seed, seed+1) so signs1 and signs2 remain uncorrelated. Aligns all basic turboquant math (centroids, signs) on the shared module added in vllm-project#38479. Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

@dalistarh

Per @dalistarh's review: the technique implemented (WHT rotation + Lloyd-Max scalar grid + per-group normalization) is the scalar case of HIGGS — Malinovskii, Panferov, Ilin, Guo, Richtárik, Alistarh, "Pushing the Limits of LLM Quantization via the Linearity Theorem", NAACL 2025 (preprint arXiv:2411.17525). Reference implementation also exists in HuggingFace transformers (HiggsConfig). Also corrects the TurboQuant arXiv ID: was 2503.19878 (CausalRAG, an unrelated RAG paper), should be 2504.19874 — the real TurboQuant (Zandieh et al., ICLR 2026) is an online vector quantizer for KV-cache and ANN vector search, not weight quantization. The KV-cache application is implemented in @vibhavagarwal5's vllm-project#38479. Updates citations in: - online/turboquant.py module docstring - docs/features/quantization/turboquant.md - (PR description updated separately) API name (--quantization turboquant, OnlineQuantScheme.TURBOQUANT) is kept for plugin-package compatibility; HIGGS is the primary algorithm citation for this weight-compression path. Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

Global Walsh-Hadamard rotation conflicts with per-block scaled FP4 formats: a global rotation spreads outlier mass across block boundaries and pollutes the per-block scales. Block-aligned rotation for MXFP4/NVFP4 is a separate PR. Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

Adds optional ``bits`` and ``group_size`` fields to ``OnlineQuantizationConfigArgs``. When turboquant is selected they flow into ``TurboQuantOnlineLinearMethod`` so vllm serve <model> --quantization turboquant \ --quantization-config '{"bits": 4}' now picks 4-bit instead of the 3-bit default. The existing defaults (bits=3, group_size=128) are preserved when these fields are unset. Adds constructor-side validation that bits is in {2, 3, 4} and group_size is a positive multiple of 8, so a bad config fails at model load with a clear error rather than deep inside a Triton kernel launch. Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

varjoranta · 2026-04-17T04:54:50Z

Rebased on latest upstream (resolved merge conflicts with the new INT8 online quant scheme). --quantization-config '{"bits": 4}' plumbing is in the latest push per the discussion above.

Side note — while working on the plugin's Apple Silicon port, I validated the same compression math (HIGGS scalar codebook + shape-gain norms + WHT rotation) end-to-end on Qwen3.5-35B-A3B (256 experts, 40 layers). Coherent generation from a 70 GB checkpoint compressed to 15 GB. Wrote up the journey: varjosoft.com/70gb-on-48gb-mac.html

For the MoE follow-up: the key architectural lesson is that dequanting all experts per forward is a memory bomb (120 GB of int32 intermediates for 256 experts). The working design gathers only the top-k active experts' packed weights and dequants on the fly — same math, ~32× less memory. Will structure the CUDA MoE PR around that pattern.

The prior 3-bit decode did two 2D scatter loads per thread pair with a non-unit-stride index pattern (bi0[k]=[0,0,0,1,1,1,2,2,...]) that Triton could not vectorize, forcing each byte to a separate transaction. Replace with a single coalesced bulk load of all 48 packed bytes per row (padded to 64 for Triton's power-of-two tile constraint), then two in-register tl.gather lookups to select the per-k bytes. Measured 5x bs=1 latency improvement on Qwen3-8B (1x A100 80GB). Decoded bit layout, weight memory footprint, and kernel correctness contract are unchanged. CPU bit-equivalence verified across 100 random + edge cases. Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

varjoranta · 2026-04-17T19:26:36Z

Quick follow-up on this PR with a small kernel fix and a transparent look at where the bs=1 latency currently stands.

Change in `216b63bfe`

The 3-bit decode in both _polar_fused_gemm_kernel and _tq_fused_gemm_kernel was doing two 2D scatter loads per thread pair with an index pattern bi0[k] = [0,0,0,1,1,1,2,2,3,3,3,...] — Triton couldn't vectorize that stride and it fell back to per-byte transactions.

Replaced with one coalesced 64-byte bulk load per row (padded to pow-of-2, tail masked) + two in-register tl.gathers to pick the per-k bytes. Same bit math, same decoded layout. 46-line diff.

Measured on Qwen3-8B, 1× A100 80GB:

bs=1 decode: 2 tok/s → 8.35 tok/s (≈ 5× kernel-path speedup; end-to-end lm-eval loglikelihood went from 18 s/batch to 0.4 s/batch at concurrency 8)
Weight memory unchanged: 4.96 GiB TQ3 vs 15.27 GiB BF16 (3.08× compression)
Quality unchanged: coherent Qwen3-8B output, CPU bit-equivalence verified across 100 random + edge cases

Where bs=1 latency still stands

Honest numbers vs BF16 after the fix:

BF16 Qwen3-8B bs=1: 90.1 tok/s
TQ3 Qwen3-8B bs=1: 8.35 tok/s — still ~10× slower than BF16

torch.profiler shows _polar_fused_gemm_kernel at 96.56% of total GPU time over the generation window. Per-call latency (A100, shape-weighted over Qwen3-8B's real layer mix): ~250–700 μs vs a ~6 μs memory-bound floor and ~23 μs for BF16 torch.nn.functional.linear on the same hardware.

Ablations ruled out the usual suspects:

suspect	outcome
`_rotate_input` Python eager overhead	0.8% of GPU time per profile. Noise.
`try/except` in `apply()` graph-breaking Dynamo	Ablation: +19% when running graph-mode; eager is a clean floor. Minor.
CUDA graph capture broken	Not broken; also not helping by much at current kernel cost.
`BLOCK_M=1` tensor-core underutilization	Specialization (elementwise mul + `tl.sum`) yielded 0 measurable win.
`tl.gather` naive codegen on sm_80	Real, but replacing with static-unpack + `tl.join`/`tl.trans` gave no relative improvement — same ALU pressure in different form.

Tried swapping tl.gather for fully-static unpack via tl.join + tl.trans + tl.reshape: matched tl.gather on H100 per-call within measurement error. Reverted — the Triton primitives we'd need to escape the ALU bottleneck all carry comparable PTX-level costs on sm_80. Full diagnosis + per-call measurements in a writeup here: when-triton-stops.

Roadmap

The shipped diff above is the clean, self-contained Triton win. Closing the remaining ~10× gap to BF16 at bs=1 requires a dedicated hand-written CUDA GEMV kernel — same class as AWQ/Marlin/FLUTE/QuIP#. Scaffold + design notes started in a separate branch so the follow-up stays distinct: 1-D grid over N, warp-shuffle reductions (no tensor cores at M=1), 10-values-per-int32 pack format for clean static decode, lop3.b32 for bit combines, async weight pipeline.

Targeting bs=1 ≈ 1.3× BF16 latency (~70 tok/s on A100 for Qwen3-8B), matching AWQ class. Will land as a second PR once it's benchmarked.

For this PR: memory is the solid win today (3.08×, zero-calibration, correct), and the bs=1 latency gap has a specific, tracked path to close.

Adds a README + skeleton kernel.cu for a hand-written CUDA GEMV kernel targeting batch size 1 decode. Companion to PR vllm-project#39970 — separate branch so the shipped Triton-only PR stays small and reviewable. The existing Triton kernel lands ~10x slower than BF16 at bs=1 on Qwen3-8B (measured 8.35 tok/s TQ3 vs 90.1 tok/s BF16 on A100). The gap is structural: Triton's 2D-tile abstractions saturate ALUs at M=1 regardless of the decode strategy. Full diagnosis in the research notes. Raw CUDA with a 1D grid and warp-shuffle reductions is the established path past this ceiling (Marlin, AWQ, FLUTE, QuIP#). This commit only adds the scaffolding: planned architecture, reference implementations to port from, work items, file layout. No compilable kernel yet. Signed-off-by: Hannu Varjoranta <hannu@varjosoft.com>

varjoranta · 2026-04-18T07:56:46Z

Quick update with measured numbers for the Phase 3 follow-up I mentioned:

Branch: feat/turboquant-gemv-bs1-cuda

On Qwen3-8B, single-request, A100 bs=1 (same harness as the bench table):

variant	tok/s
BF16 baseline	87.6
TQ3 with this PR (Phase 0 Triton)	8.1
TQ3 with Phase 3 kernel, compiled	17.2

2.12× decode speedup over the Triton path at bs=1, captured into the size-1 CUDA graph.

One implementation note worth flagging for reviewers: vLLM's Dynamo traces the model once on profile_run (batch ≫ 1), so a Python-level if x.shape[0] == 1: cuda_path branch in apply() specializes against the traced shape and compiles out. The Phase 3 branch pushes the M-check inside a torch.library.custom_op which Dynamo treats as opaque; each size-specific CUDA-graph capture then re-runs the internal branch with the actual shape. Without that, the CUDA kernel is dormant in compiled mode.

Will open the follow-up PR once this one lands or if reviewers prefer seeing it sooner.

dalistarh · 2026-04-18T08:08:48Z

Note that the FLUTE kernel https://github.com/hanguo97/flute aims to solve the performance issue in this PR (fast online decode over a custom scalar grid). Maybe this helps. I think this even used to be supported in vLLM at some point:

https://github.com/hanguo97/flute#flute--vllm

Best,
Dan

Quick update with measured numbers for the Phase 3 follow-up I mentioned:

Branch: feat/turboquant-gemv-bs1-cuda

On Qwen3-8B, single-request, A100 bs=1 (same harness as the bench table):

variant tok/s
BF16 baseline 87.6
TQ3 with this PR (Phase 0 Triton) 8.1
TQ3 with Phase 3 kernel, compiled 17.2
2.12× decode speedup over the Triton path at bs=1, captured into the size-1 CUDA graph.

One implementation note worth flagging for reviewers: vLLM's Dynamo traces the model once on profile_run (batch ≫ 1), so a Python-level if x.shape[0] == 1: cuda_path branch in apply() specializes against the traced shape and compiles out. The Phase 3 branch pushes the M-check inside a torch.library.custom_op which Dynamo treats as opaque; each size-specific CUDA-graph capture then re-runs the internal branch with the actual shape. Without that, the CUDA kernel is dormant in compiled mode.

Will open the follow-up PR once this one lands or if reviewers prefer seeing it sooner.

varjoranta · 2026-04-18T19:22:02Z

Thanks @dalistarh — you're right to point to FLUTE. It was on the shortlist when designing the Phase 3 kernel (flagged above as the reference class alongside AWQ/Marlin/QuIP#). The shipped kernel chose warp-per-output-channel with fp32 accumulation and no tensor cores at M=1 (AWQ-style at bs=1), which got us to 17.2 tok/s but clearly leaves a sizable gap to what FLUTE's tensor-core + cp.async design achieves on the HIGGS scalar case.

Given that FLUTE is already the HF HIGGS integration's fast path (pip install flute-kernel), the right move is to evaluate adoption rather than keep iterating on the custom kernel: (a) whether the --quantization turboquant flag here can plug into FLUTE's qmap format with our Lloyd-Max codebook as input, and (b) what state the prior vLLM FLUTE wiring is in today. I'll report back on this PR.

Appreciate the continuing careful reads.

varjoranta requested review from ProExpertProg, WoosukKwon, hmellor, houseroad, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth, yewentao256 and youkaichao as code owners April 16, 2026 05:55

mergify bot added the documentation Improvements or additions to documentation label Apr 16, 2026

varjoranta force-pushed the feat/turboquant-online-weight-quant branch from 1c47a44 to 06bd3ad Compare April 16, 2026 05:59

gemini-code-assist bot reviewed Apr 16, 2026

View reviewed changes

Comment thread vllm/model_executor/layers/quantization/online/turboquant.py Outdated

Comment thread vllm/model_executor/layers/quantization/online/turboquant.py Outdated

varjoranta force-pushed the feat/turboquant-online-weight-quant branch from 2948814 to d477fa0 Compare April 16, 2026 06:20

varjoranta force-pushed the feat/turboquant-online-weight-quant branch from a067f9a to a99a47d Compare April 16, 2026 07:01

varjoranta mentioned this pull request Apr 16, 2026

Add support For Gemma4 31B varjoranta/turboquant-vllm#28

Closed

varjoranta force-pushed the feat/turboquant-online-weight-quant branch from b054aa6 to 868a33c Compare April 16, 2026 08:28

varjoranta force-pushed the feat/turboquant-online-weight-quant branch from 868a33c to e7952d1 Compare April 16, 2026 09:57

varjoranta force-pushed the feat/turboquant-online-weight-quant branch from e7952d1 to 6772613 Compare April 16, 2026 11:49

varjoranta added 8 commits April 17, 2026 07:16

varjoranta force-pushed the feat/turboquant-online-weight-quant branch from 1ea2fdd to 11a29d9 Compare April 17, 2026 04:17

Uh oh!

Conversation

varjoranta commented Apr 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Test Plan

Test Result

Related

Follow-ups (explicitly out of scope)

Uh oh!

github-actions bot commented Apr 16, 2026

Uh oh!

mergify bot commented Apr 16, 2026

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

vibhavagarwal5 commented Apr 16, 2026

Uh oh!

varjoranta commented Apr 16, 2026

Uh oh!

gemini-code-assist bot commented Apr 16, 2026

Uh oh!

vibhavagarwal5 commented Apr 16, 2026

Uh oh!

varjoranta commented Apr 16, 2026

Uh oh!

dalistarh commented Apr 16, 2026

Uh oh!

varjoranta commented Apr 16, 2026

Uh oh!

dalistarh commented Apr 16, 2026

Uh oh!

varjoranta commented Apr 16, 2026

Uh oh!

gaby commented Apr 16, 2026

Uh oh!

gaby commented Apr 16, 2026

Uh oh!

varjoranta commented Apr 16, 2026

Uh oh!

gaby commented Apr 16, 2026

Uh oh!

varjoranta commented Apr 17, 2026

Uh oh!

varjoranta commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Change in 216b63bfe

Where bs=1 latency still stands

Roadmap

Uh oh!

varjoranta commented Apr 18, 2026

Uh oh!

dalistarh commented Apr 18, 2026

Uh oh!

varjoranta commented Apr 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

varjoranta commented Apr 16, 2026 •

edited

Loading

varjoranta commented Apr 17, 2026 •

edited

Loading

Change in `216b63bfe`