Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
4ea5bb9
[Quantization] Add TurboQuant KV cache quantization
lishunyang12 Mar 26, 2026
f3fd27d
[Quantization] TurboQuant KV cache quantization with outlier-aware al…
lishunyang12 Mar 28, 2026
137b3a0
Separate TurboQuant backend + fused Triton encode/decode kernels
lishunyang12 Mar 29, 2026
608862c
Fix fused kernels: avoid bitcast, use scratch memory reinterpretation
lishunyang12 Mar 29, 2026
85ccf00
Rewrite fused kernels: avoid Triton bitcast/pointer casting
lishunyang12 Mar 29, 2026
5f5a2f0
Add debug script for fused kernel issues
lishunyang12 Mar 29, 2026
7839ff8
Fix: use same BLOCK_D for all tl.arange to avoid thread-mapping issues
lishunyang12 Mar 29, 2026
4af4dbc
Fix decode: direct arithmetic unpack, no scratch interleave
lishunyang12 Mar 29, 2026
0449ed6
Fix round-trip test: compare fused vs reference quality, not absolute
lishunyang12 Mar 29, 2026
f79a272
Fix unfused encode/decode: add missing 4-bit pack/unpack cases
lishunyang12 Mar 29, 2026
2b4ebc6
Add TQ_UNFUSED env var to bypass fused kernels for debugging
lishunyang12 Mar 29, 2026
41d965b
Add e2e debug script and TQ_UNFUSED env var
lishunyang12 Mar 29, 2026
cab66b9
Fix debug script: handle empty kv_cache during profiling
lishunyang12 Mar 29, 2026
85f64b2
Add inline debug prints in forward() for e2e diagnosis
lishunyang12 Mar 29, 2026
910961d
Increase debug print limit to capture real request
lishunyang12 Mar 29, 2026
5d77dbf
Fix: filter padding tokens in calibration and encode, remove debug files
lishunyang12 Mar 29, 2026
96cf682
Fix: skip calibration during profiling by checking forward context
lishunyang12 Mar 29, 2026
ef0cb7b
Add quality comparison: emulation vs packed cache path
lishunyang12 Mar 29, 2026
4d59393
Add attention integration test: bf16 vs TurboQuant cache
lishunyang12 Mar 29, 2026
3e50f03
Fix attention test: separate K/V caches
lishunyang12 Mar 29, 2026
1b3d6e5
Add mild RoPE mode and value cosine check to attention test
lishunyang12 Mar 29, 2026
f9a3f3a
Add test: 109 vs 128 Hadamard coefficients quality comparison
lishunyang12 Mar 29, 2026
41114c8
Fix Hadamard butterfly: add barriers between iterations for inter-war…
lishunyang12 Mar 29, 2026
5ac95bd
Fix: use num_warps=1 for Hadamard butterfly to eliminate inter-warp r…
lishunyang12 Mar 29, 2026
bad9e31
Add Hadamard round-trip test: 109 vs 128 coefficients with Triton
lishunyang12 Mar 29, 2026
2f3e6e5
Add comprehensive Hadamard round-trip quality test
lishunyang12 Mar 29, 2026
665f97b
Fix roundtrip test: use 4-bit codebook only
lishunyang12 Mar 29, 2026
23b4574
Fix codebook scale: use 1/sqrt(hadamard_d) not 1/sqrt(normal_size)
lishunyang12 Mar 29, 2026
366b5e8
Remove debug scripts after root cause fix
lishunyang12 Mar 29, 2026
f2725f1
Fix CUDA graph capture: replace valid.any() with clamp(min=0)
lishunyang12 Mar 29, 2026
b338f19
Add TurboQuant vs baseline benchmark
lishunyang12 Mar 29, 2026
910afbd
Restore num_warps=4 with barriers (codebook fix should improve quality)
lishunyang12 Mar 29, 2026
1c9a822
Clean up benchmark script
lishunyang12 Mar 29, 2026
81483c9
Optimize decode: trim block_table to needed blocks only
lishunyang12 Mar 29, 2026
4c4f1fc
Re-add benchmark script
lishunyang12 Mar 29, 2026
4c2cc42
Fix benchmark: add __main__ guard for spawn multiprocessing
lishunyang12 Mar 29, 2026
93c34a1
Fully fused decode kernel: slot bytes → decoded head in one launch
lishunyang12 Mar 29, 2026
cd0b13e
Optimize decode: move norm reading into kernel, cache block_table
lishunyang12 Mar 29, 2026
1e0507d
Revert norm reading to Python: Triton pointer casting unreliable
lishunyang12 Mar 29, 2026
117d46d
Add fused_paged_decode: cleaner decode API for backend
lishunyang12 Mar 29, 2026
203691a
Add TQ_LITE mode and asymmetric K/V bit allocation
lishunyang12 Mar 29, 2026
ddff724
Fix 3-bit double-padding bug in unfused encode
lishunyang12 Mar 29, 2026
763dc68
Add CUDA warp-per-head decode kernel (Phase 3)
lishunyang12 Mar 29, 2026
b1b5a7b
Add TQ_QJL and TQ_BITS env vars for QJL residual correction
lishunyang12 Mar 29, 2026
a52433d
Add QJL 1-bit residual correction to packed cache (Phase 5.1)
lishunyang12 Mar 29, 2026
4354afe
Fix QJL: generate S matrix on CPU to avoid device mismatch
lishunyang12 Mar 29, 2026
bfcb724
Fix QJL dtype: cast S matrix to float32 for matmul
lishunyang12 Mar 29, 2026
f0afe1b
Fix QJL: use mse_bits for packing, not bit_width
lishunyang12 Mar 29, 2026
bd79bee
Clean up: remove dead code, broken QJL, unused fields (-392 lines)
lishunyang12 Mar 30, 2026
38cba45
Add comprehensive TurboQuant test suite (55 tests, ~145 parametrized …
lishunyang12 Mar 31, 2026
b927e80
Fix unused variable lint error in benchmark
lishunyang12 Mar 31, 2026
aa6e58e
Replace torch.cuda with torch.accelerator in benchmarks
lishunyang12 Mar 31, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions tests/kernels/bench_turboquant.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
"""Benchmark TurboQuant vs baseline throughput."""

import time

import torch

from vllm import LLM, SamplingParams

MODEL = "Qwen/Qwen2.5-7B-Instruct"
PROMPT = "Explain the theory of general relativity in detail."


def benchmark(kv_cache_dtype="auto", label="baseline"):
print(f"\n{'=' * 50}")
print(f" {label}")
print(f"{'=' * 50}")
llm = LLM(
MODEL,
kv_cache_dtype=kv_cache_dtype,
max_model_len=4096,
gpu_memory_utilization=0.5,
)
params = SamplingParams(max_tokens=128, temperature=0.0)
llm.generate([PROMPT], params) # warmup
torch.accelerator.synchronize()
t0 = time.perf_counter()
out = llm.generate([PROMPT] * 8, params)
torch.accelerator.synchronize()
t1 = time.perf_counter()
toks = sum(len(o.outputs[0].token_ids) for o in out)
tp = toks / (t1 - t0)
print(f" {toks} tokens in {t1 - t0:.2f}s = {tp:.1f} tok/s")
print(f" Sample: {out[0].outputs[0].text[:80]}...")
del llm
torch.accelerator.empty_cache()
return tp


if __name__ == "__main__":
t1 = benchmark("auto", "BASELINE (bf16)")
t2 = benchmark("turboquant", "TURBOQUANT (4-bit)")
print(f"\nRatio: {t2 / t1:.2f}x")
Loading
Loading