Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head) by eldarkurtic · Pull Request #30141 · vllm-project/vllm

eldarkurtic · 2025-12-05T15:41:11Z

TLDR: this PR adds support to load and run llm-compressor models with FP8 KV-cache and attention quantization. In addition to the standard "per-tensor" quantization, it adds support for "per-attention-head" quantization.

Summary

enable using the existing pathway of "per-tensor" KV-cache (and query) FP8 quantization with scales calibrated through llm-compressor
Flash Attention v3 backend supports "finer-grained" scales, i.e. one scale per attention head. This is currently not supported, and this PR enables it in the following way:

2.1 for query quantization:

expand QuantFP8 to support per-channel static quantization (queries are of the shape num_tokens x hidden_size so we expand the per-attention-head scales to accommodate per-channel scaling
this is enabled further by expanding the static_scaled_fp8_quant kernel to work with an array of scales

2.2 for kv-quantization:

expand the reshape_and_cache_flash kernel by adding support for an array of k/v_scale
it also covers both cache layouts (NHD and HND)

reorganized some things around KV-cache in vllm/attention/layer.py as so far it has been hardcoded for calculate_kv_scales pathway only
expands tests to cover all kernel-related updates
update documentation on the newly supported KV-cache and attention quantization technique in vLLM

Tests

To confirm that the existing pathway with calculate_kv_scales=True/False isn't affected, I ran GSM8k evals on the following models: Llama-2-7b-chat-hf (MHA), Llama-3.1-8B-Instruct (GQA), Qwen/Qwen3-8B (different model family), and got the same results before and after the PR. They are as follows:

Model = meta-llama/Llama-2-7b-chat-hf

unquantized baseline
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2509|±  |0.0119|
|     |       |strict-match    |     5|exact_match|↑  |0.1933|±  |0.0109|

calculate_kv_scales=False
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2563|±  |0.0120|
|     |       |strict-match    |     5|exact_match|↑  |0.1911|±  |0.0108|

calculate_kv_scales=True
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.0053|±  | 0.002|
|     |       |strict-match    |     5|exact_match|↑  |0.0000|±  | 0.000|

Note: The score of 0 in calculate_kv_scales=True is already present on main; this PR does not introduce it.

Model = meta-llama/Llama-3.1-8B-Instruct

unquantized baseline
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8522|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8249|±  |0.0105|


calculate_kv_scales=False
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8302|±  |0.0103|
|     |       |strict-match    |     5|exact_match|↑  |0.8036|±  |0.0109|


calculate_kv_scales=True
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8324|±  |0.0103|
|     |       |strict-match    |     5|exact_match|↑  |0.8127|±  |0.0107|

Model = Qwen/Qwen3-8B

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8810|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8764|±  |0.0091|

calculate_kv_scales=True
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8787|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8734|±  |0.0092|


calculate_kv_scales=False
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8741|±  |0.0091|
|     |       |strict-match    |     5|exact_match|↑  |0.8688|±  |0.0093|

And to confirm that the new support for llm-compressor models with both, per-tensor and per-attention-head scales is working correctly, I ran the same models from above with both configurations and observed the expected results:

Model = meta-llama/Llama-2-7b-chat-hf

unquantized baseline
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2509|±  |0.0119|
|     |       |strict-match    |     5|exact_match|↑  |0.1933|±  |0.0109|

per-tensor
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2555|±  |0.0120|
|     |       |strict-match    |     5|exact_match|↑  |0.1835|±  |0.0107|

per-attn-head
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.2441|±  |0.0118|
|     |       |strict-match    |     5|exact_match|↑  |0.1736|±  |0.0104|

Model = meta-llama/Llama-3.1-8B-Instruct

unquantized baseline
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8522|±  |0.0098|
|     |       |strict-match    |     5|exact_match|↑  |0.8249|±  |0.0105|


per-tensor
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8362|±  |0.0102|
|     |       |strict-match    |     5|exact_match|↑  |0.8021|±  |0.0110|

per-attn-head
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8393|±  |0.0101|
|     |       |strict-match    |     5|exact_match|↑  |0.8112|±  |0.0108|

Model = Qwen/Qwen3-8B

|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8810|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8764|±  |0.0091|

per-tensor
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8810|±  |0.0089|
|     |       |strict-match    |     5|exact_match|↑  |0.8757|±  |0.0091|

per-attn-head
|Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
|-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
|gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8787|±  |0.0090|
|     |       |strict-match    |     5|exact_match|↑  |0.8757|±  |0.0091

The llm-compressor code to produce these models is available in diffs of docs/features/quantization/quantized_kvcache.md. I haven't done any tuning of the calibration parameters for the testing purposes, just ran the defaults so better results are expected with better tuning.
I've also verified that changes are working with both, LLM class and vllm serve.

Note: some model implementations have remapping of scales guarded by something like this: if "scale" in name so I had to expand it to if "scale" in name or "zero_point" in name: to support loading of zero_points which are present in llm-compressor checkpoints.

mergify · 2025-12-05T15:41:47Z

Documentation preview: https://vllm--30141.org.readthedocs.build/en/30141/

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

vllm/v1/attention/backends/flash_attn.py

mergify · 2025-12-05T15:45:17Z

Hi @eldarkurtic, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

gemini-code-assist

Code Review

This pull request introduces support for per-attention-head FP8 KV cache quantization, primarily for use with Flash Attention and llm-compressor. The changes involve modifying CUDA kernels to handle both per-tensor and per-attention-head scaling factors, adding a new kernel for per-channel static FP8 quantization, and updating the reshape_and_cache_flash_kernel to use a kv_scale_stride for flexible scale access. The Python _custom_ops are updated to support per-channel scales, and the attention layer initialization logic is refactored to correctly set KV cache quantization attributes and query quantization group shapes based on the loaded llm-compressor configuration. Model weight loading utilities are extended to remap q_scale and zero_point parameters, and a new CompressedTensorsConfig method is added to process llm-compressor loaded scales, including reducing q_scale for Flash Attention and repeating it for QuantFP8 operations. Comprehensive unit tests were added to cover per-attention-head scaling in reshape_and_cache_flash and per-channel static FP8 quantization. The documentation for FP8 KV Cache has been significantly updated to detail per-attention-head quantization, various scale calibration approaches, and provides an example using llm-compressor. A review comment highlighted a brittle logic in CompressedTensorsConfig.from_config for filtering attention quantization config groups, suggesting a more robust check for empty target lists and iterating through all targets.

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

mergify · 2025-12-05T16:06:08Z

Hi @eldarkurtic, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2025-12-05T16:50:31Z

Hi @eldarkurtic, the pre-commit checks have failed. Please run:

uv pip install pre-commit
pre-commit install
pre-commit run --all-files

Then, commit the changes and push to your branch.

For future commits, pre-commit will run automatically on changed files before each commit.

mergify · 2025-12-05T18:26:36Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eldarkurtic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

vllm/model_executor/layers/quantization/input_quant_fp8.py

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

…llm-project#30141) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: 陈建华 <1647430658@qq.com>

…llm-project#30141) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

@hshen14

Adapt the update in vllm-project/vllm#30141 ```python # llm-compressor mdls need to set cache_dtype to "fp8" manually. if getattr(quant_config, "kv_cache_scheme", None) is not None: kv_cache_dtype = "fp8" calculate_kv_scales = False if cache_config is not None: cache_config.cache_dtype = "fp8" cache_config.calculate_kv_scales = False self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype( kv_cache_dtype, vllm_config.model_config ) self.kv_cache_dtype = kv_cache_dtype ``` cc @hshen14 @thuang6 @lkk12014402 --------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

@hshen14

Adapt the update in vllm-project/vllm#30141 ```python # llm-compressor mdls need to set cache_dtype to "fp8" manually. if getattr(quant_config, "kv_cache_scheme", None) is not None: kv_cache_dtype = "fp8" calculate_kv_scales = False if cache_config is not None: cache_config.cache_dtype = "fp8" cache_config.calculate_kv_scales = False self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype( kv_cache_dtype, vllm_config.model_config ) self.kv_cache_dtype = kv_cache_dtype ``` cc @hshen14 @thuang6 @lkk12014402 --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

…llm-project#30141) Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com> Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

eldarkurtic requested review from 22quinn, LucasWilkinson, WoosukKwon, mgoin, pavanimajety, robertgshaw2-redhat, tlrmchlsmth and yewentao256 as code owners December 5, 2025 15:41

mergify bot added documentation Improvements or additions to documentation llama Related to Llama models speculative-decoding v1 labels Dec 5, 2025

chatgpt-codex-connector bot reviewed Dec 5, 2025

View reviewed changes

vllm/v1/attention/backends/flash_attn.py Outdated Show resolved Hide resolved

gemini-code-assist bot reviewed Dec 5, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py Outdated Show resolved Hide resolved

kylesayrs mentioned this pull request Dec 5, 2025

Q Scaling factor for FP8 KV Cache ? vllm-project/llm-compressor#1294

Closed

mergify bot added the needs-rebase label Dec 5, 2025

ProExpertProg reviewed Dec 5, 2025

View reviewed changes

vllm/model_executor/layers/quantization/input_quant_fp8.py Outdated Show resolved Hide resolved

ProExpertProg reviewed Dec 6, 2025

View reviewed changes

vllm/model_executor/layers/quantization/compressed_tensors/compressed_tensors.py Outdated Show resolved Hide resolved

mgoin added the ready ONLY add when PR is ready to merge/full CI is needed label Dec 11, 2025

eldarkurtic requested review from alexm-redhat, njhill, youkaichao and zhuohan123 as code owners December 14, 2025 19:13

eldarkurtic force-pushed the expand-static-scaled-fp8-quant branch from 09ff5ce to 0dbf3be Compare December 15, 2025 07:53

mergify bot removed the needs-rebase label Dec 15, 2025

eldarkurtic added 12 commits January 22, 2026 10:56

fix compressed-tensors test for kv-cache quant

773ce01

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

prepare scales for group broadcast

08fa0fc

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

fix ruff

309142f

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

fix

6d83b18

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

fix tests for attn_head

af994b1

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

fix compilation

a0b0d8a

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

tmp

ea368be

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

fix per-tensor case and reshaping

dad0ab2

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

fix format

307d0c2

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

remove usage of env vars

efb7f8e

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

simplify test

7234561

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

simplify convert_fp8_local

cf5c848

Signed-off-by: eldarkurtic <8884008+eldarkurtic@users.noreply.github.com>

eldarkurtic force-pushed the expand-static-scaled-fp8-quant branch from cabdcc9 to cf5c848 Compare January 22, 2026 15:56

mergify bot removed the needs-rebase label Jan 22, 2026

LucasWilkinson merged commit 44f08af into vllm-project:main Jan 22, 2026
95 of 96 checks passed

mawong-amd mentioned this pull request Jan 23, 2026

[Hardware][AMD][CI][Bugfix] Fix Kernels Attention Cache test #32904

Merged

5 tasks

jikunshang mentioned this pull request Jan 26, 2026

[CHUNK_PREFILL] fp8kv cache vllm-project/vllm-xpu-kernels#128

Merged

kylesayrs mentioned this pull request Jan 30, 2026

[Bugfix] Fix Sparse24 Compressed Tensors models #33446

Merged

yiliu30 mentioned this pull request Feb 4, 2026

[CT] Fix CT Config to honor fp8_inc KV cache dtype vllm-project/vllm-gaudi#929

Merged

eldarkurtic mentioned this pull request Feb 10, 2026

[Attn,KV-cache] Use per-head scales in the attention selector #34281

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head)#30141

Add llmcompressor fp8 kv-cache quant (per-tensor and per-attn_head)#30141
LucasWilkinson merged 25 commits intovllm-project:mainfrom
eldarkurtic:expand-static-scaled-fp8-quant

eldarkurtic commented Dec 5, 2025 •

edited by github-actions bot

Loading

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

Conversation

eldarkurtic commented Dec 5, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Tests

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

mergify bot commented Dec 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

eldarkurtic commented Dec 5, 2025 •

edited by github-actions bot

Loading