[Transform] Attention/Cache transforms #436

kylesayrs · 2025-08-26T23:46:45Z

Purpose

Support fully-expressive attention and kv cache quantization
Support running kv cache quantization evals with hf transformers

Prerequisites

Must be merged at the same time as [Quantization] Attention/ KV Cache Refactor llm-compressor#1651

Changes

New Classes

Add hookable attention and kvcache implementations which are registered to the attention module as submodules
- QuantizedAttentionImpl injects itself into the model by registering a new attention implementation called ct_hooked_attention overriding model.config._attn_implementation to be the new implementation name
- QuantizedKVCache injects itself into the model by overriding the past_key_values input kwarg to attention, and wrapping the functionality of the original cache
- Calibration and transform hooks can be added to these modules via the hook functions
  - register_query_hook,
  - register_key_hook
  - register_value_hook

Quantization Lifecycle Changes

Apply
- The kv_cache_scheme field of the quantization config is now used to call initialize_hooked_kv_cache
- Attention modules can now be targeted, and are used to call initialize_hooked_attention if attention modules are explicitly targeted (see is_narrow_match)
- Remove logic for "merging" kv cache schemes (this doesn't really make any sense, I'm not sure why it was ever included)
Initialize
- Hooked kv cache and attention modules have their quantization parameters initialized by initialize_module_for_quantization
- The presence of attention or kvcache submodules is what determines whether attention or kv cache only quantization is being applied
Serialization
- QuantizationConfig.from_pretrained was cleaned up with additional comments
- The kv_cache_scheme field is added if there are any attention modules with a quantization_scheme attached

Helpers

is_narrow_match is used to check that attention modules are being specifically targeted (rather than targeting all modules in a layer)
get_num_attn_heads, get_num_kv_heads, get_head_dim get attention config values from config

Testing

Added tests for is_narrow_match
Added tests for added attention and kvcache classes
Quantized models
- kylesayrs/Llama-3.2-1B-Instruct-attention-fp8-head
- kylesayrs/Llama-3.2-1B-Instruct-attention-nvfp4-head

Evaluation

eval.py

import sys
import lm_eval

model_id = sys.argv[1]

print(model_id)
results = lm_eval.simple_evaluate(
    # 3) hf serialized
    model="hf",
    model_args={
        "pretrained": model_id,
        "add_bos_token": False,
        "dtype": "auto",
        "device_map": "cuda",
        #"max_length": 128000,
    },
    device="cuda",
    # 3/)

    #tasks=["gsm8k_platinum", "mmlu_llama", "longbench2_single"],
    tasks=["gsm8k_platinum"],
    batch_size=64,
    apply_chat_template=True,
    fewshot_as_multiturn=True,
)
print(model_id)
print(lm_eval.utils.make_table(results))

compress.py

from datasets import load_dataset
from transformers import AutoModelForCausalLM, AutoTokenizer

from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
from llmcompressor.utils import dispatch_for_generation
from compressed_tensors.quantization import QuantizationScheme, QuantizationArgs

# Select model and load it.
#model_id = "Qwen/Qwen2.5-14B-Instruct-1M"
model_id = "meta-llama/Llama-3.1-8B-Instruct"
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype="auto")
tokenizer = AutoTokenizer.from_pretrained(model_id)

# Select calibration dataset.
DATASET_ID = "ultrachat_200k"
DATASET_SPLIT = "train_sft"

# Select number of samples. 512 samples is a good place to start.
# Increasing the number of samples can improve accuracy.
NUM_CALIBRATION_SAMPLES = 512
MAX_SEQUENCE_LENGTH = 2048

# Configure the quantization algorithm to run.
args = QuantizationArgs(
    num_bits=8,
    type="float",
    strategy="attn_head",
    symmetric=True,
    observer="static_minmax",
)
recipe = QuantizationModifier(
    # config_groups={
    #     "attention": QuantizationScheme(
    #         #targets=["Qwen2Attention"],
    #         targets=["LlamaAttention"],
    #         input_activations=args,
    #     )
    # }
    kv_cache_scheme=args,
)

# Apply algorithms.
oneshot(
    model=model,
    dataset=DATASET_ID,
    splits={"calibration": f"{DATASET_SPLIT}[:{NUM_CALIBRATION_SAMPLES}]"},
    recipe=recipe,
    max_seq_length=MAX_SEQUENCE_LENGTH,
    num_calibration_samples=NUM_CALIBRATION_SAMPLES,
)

# Confirm generations of the quantized model look sane.
print("\n\n")
print("========== SAMPLE GENERATION ==============")
dispatch_for_generation(model)
sample = tokenizer("Hello my name is", return_tensors="pt")
sample = {key: value.to(model.device) for key, value in sample.items()}
output = model.generate(**sample, max_new_tokens=100)
print(tokenizer.decode(output[0]))
print("==========================================\n\n")

# Save to disk compressed.
SAVE_DIR = model_id.rstrip("/").split("/")[-1] + f"-KV-FP8-{args.strategy}-{args.observer}"
model.save_pretrained(SAVE_DIR, save_compressed=True)
tokenizer.save_pretrained(SAVE_DIR)

Model	GSM8K
Llama-3.1-8B-Instruct	0.8337
Llama-3.1-8B-Instruct-KV-FP8-Tensor	0.8271
Llama-3.1-8B-Instruct-KV-FP8-Head	0.8354
Llama-3.1-8B-Instruct-QKV-FP8-Tensor	0.8321
Llama-3.1-8B-Instruct-QKV-FP8-Head	0.8238

brian-dellabetta

This looks good, though i have a number of questions and minor suggestions

src/compressed_tensors/modeling/attention.py

src/compressed_tensors/modeling/kvcache.py

dsikka

If the goal is to use this generally for kv_cache and attn quantize, can we move the initialize_hooked_attention and initialize_hooked_kv_cache to initialize.py?

I understand we haven't hooked them in yet for those workflows but I think these belong there.

src/compressed_tensors/modeling/attention.py

dsikka

do a pass through on any missing docstring, otherwise lgtm.
nice work

src/compressed_tensors/modeling/kvcache.py

The base branch was changed.

brian-dellabetta

Following for the most part. A few clarifications, but this makes sense to me

src/compressed_tensors/modeling/attention.py

The base branch was changed.

kylesayrs · 2025-10-13T20:31:30Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/18477528039

kylesayrs · 2025-10-14T02:43:09Z

Last nightly worked, but e2e failed due to model storage issues
https://github.com/neuralmagic/llm-compressor-testing/actions/runs/18483826999

brian-dellabetta

We can resolve the global var thread, I have another new comment we might want to consider in a follow-up but marking this as approved. Cool stuff! Excited to see it in action

src/compressed_tensors/modeling/attention.py

src/compressed_tensors/quantization/quant_config.py

dsikka

Just some questions. Otherwise, LGTM

dsikka · 2025-10-15T19:16:08Z

src/compressed_tensors/quantization/lifecycle/initialize.py

+    if scheme.weights is not None:
+        raise ValueError(
+            "Cannot apply weight quantization to attention. "
+            "Instead, target (q|k|v)_proj"


This error doesnt make a lot of sense / took me a while to realize you're saying that if you want to do weight quantization, you should target the linear layers in the attn block, not attention itself.

Is this clearer?

raise ValueError( "Cannot apply weight quantization to attention. " "Instead, target the (q|k|v)_proj submodule layers of attention"

src/compressed_tensors/quantization/lifecycle/initialize.py

dsikka · 2025-10-15T19:50:57Z

src/compressed_tensors/modeling/kvcache.py

+    """
+    if not hasattr(module, KV_CACHE_ATTR):
+        module.register_module(KV_CACHE_ATTR, QuantizedKVCache(model.config, module))
+        module.register_forward_pre_hook(_kv_cache_attention_hook, with_kwargs=True)


If I'm reading this correctly, _kv_cache_attention_hook is called before every forward pass? So we're replacing the kv_cache before every forward pass with the new quantized cache?

Yes, that's exactly correct. I've buffed up the docstrings to make this clearer.

QuantizedKVCache injects itself into the model by overriding the past_key_values input kwarg to attention, and wrapping the functionality of the original cache

dsikka · 2025-10-15T19:53:50Z

src/compressed_tensors/modeling/kvcache.py

+# ----- hooks ----- #
+
+
+def register_key_hook(


I can't seem to find where the key / value hooks get registered

These hooks are used to attach observer hooks (and any other hooks we might want to add in the future), see here

dsikka · 2025-10-15T20:01:13Z

src/compressed_tensors/quantization/quant_config.py

+        # infer format
        if format is None:
-            if quantization_status == QuantizationStatus.COMPRESSED:
+            if model_status == QuantizationStatus.COMPRESSED:


I know this is unrelated but defaulting to int doesnt make a lot of sense either

I agree. This was the original behavior of this logic.

dsikka · 2025-10-15T20:04:08Z

src/compressed_tensors/quantization/quant_config.py

-        quantization_status = None
-        ignore = {}
-        quantization_type_names = set()
+        from compressed_tensors.quantization.lifecycle.initialize import (


Thanks for cleaning this up. It doesn't seem like we're adding anything here, apart from how we're fetching the kv_cache scheme?

I still find our ignore logic very confusing

I entirely agree, I've created an issue to track potential removal #494.

This PR does not change behavior, only makes the existing logic easier to read and adds this line to infer kv cache scheme

# attention quantization implies kv cache quantization if is_attention_module(submodule): kv_cache_scheme = submodule.quantization_scheme.input_activations

dsikka

For the sake of completeness, do you mind adding your kv_cache and attn quantized sample models to this PR description?

dsikka · 2025-10-15T21:12:46Z

src/compressed_tensors/modeling/kvcache.py

+            )
+        else:
+            ret = (key_states, value_states)
+        self.past_key_values = None


Why do we set this to None?

Ensures that the cache is only used once. This should theoretically never be a problem, since the self.past_key_values attribute is always written to by the _kv_cache_attention_hook, but this is done just for peace of mind and to avoid dangling references, even if they are weak.

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs · 2025-10-20T20:03:07Z

https://github.com/neuralmagic/llm-compressor-testing/actions/runs/18663440870

brian-dellabetta

impressive work!

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs mentioned this pull request Aug 27, 2025

[Transform] Spinquant R3 vllm-project/llm-compressor#1778

Open

brian-dellabetta previously approved these changes Aug 27, 2025

View reviewed changes

dsikka reviewed Aug 28, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Outdated Show resolved Hide resolved

kylesayrs force-pushed the kylesayrs/r3-only branch from 7bf4b57 to 75056bf Compare August 28, 2025 21:09

dsikka previously approved these changes Sep 2, 2025

View reviewed changes

src/compressed_tensors/modeling/kvcache.py Outdated Show resolved Hide resolved

Base automatically changed from kylesayrs/transform-simplify-key to main September 8, 2025 18:46

kylesayrs mentioned this pull request Oct 7, 2025

[Attention] Attention head quantization strategy #481

Merged

kylesayrs force-pushed the kylesayrs/r3-only branch 2 times, most recently from e224a5d to 05ec17e Compare October 8, 2025 19:20

kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 8, 2025 19:20

brian-dellabetta previously approved these changes Oct 8, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Show resolved Hide resolved

src/compressed_tensors/modeling/attention.py Outdated Show resolved Hide resolved

kylesayrs marked this pull request as draft October 8, 2025 21:06

kylesayrs force-pushed the kylesayrs/add-attn-head-strat branch from d084c5e to e3f24d4 Compare October 9, 2025 14:19

kylesayrs mentioned this pull request Oct 9, 2025

[Transform] [Attention] [KV Cache] Support KV-cache integrated attention transform and quantization #428

Closed

kylesayrs changed the base branch from kylesayrs/add-attn-head-strat to main October 9, 2025 18:14

kylesayrs changed the base branch from main to kylesayrs/add-attn-head-strat October 9, 2025 18:15

kylesayrs mentioned this pull request Oct 9, 2025

[Quantization] Attention/ KV Cache Refactor vllm-project/llm-compressor#1651

Open

kylesayrs force-pushed the kylesayrs/r3-only branch from 145c9aa to 2efe3db Compare October 9, 2025 18:35

Base automatically changed from kylesayrs/add-attn-head-strat to main October 9, 2025 20:11

kylesayrs force-pushed the kylesayrs/r3-only branch from 7c19358 to 04f716a Compare October 9, 2025 20:16

kylesayrs mentioned this pull request Oct 12, 2025

[Transforms] Use get_head_dim util vllm-project/llm-compressor#1918

Open

kylesayrs marked this pull request as ready for review October 13, 2025 20:41

kylesayrs force-pushed the kylesayrs/r3-only branch 2 times, most recently from 4cc5ace to 9ead292 Compare October 14, 2025 04:21

kylesayrs mentioned this pull request Oct 14, 2025

[Quantization] Channel wise output activation quantization for QKV Attention layers #270

Closed

brian-dellabetta previously approved these changes Oct 14, 2025

View reviewed changes

src/compressed_tensors/modeling/attention.py Outdated Show resolved Hide resolved

src/compressed_tensors/quantization/quant_config.py Show resolved Hide resolved

kylesayrs mentioned this pull request Oct 15, 2025

[Bug]: k_scale and v_scale is zero after kv cache fp8 quantization vllm-project/llm-compressor#1928

Open

kylesayrs dismissed brian-dellabetta’s stale review via dc43b64 October 15, 2025 17:14

brian-dellabetta previously approved these changes Oct 15, 2025

View reviewed changes

dsikka reviewed Oct 15, 2025

View reviewed changes

kylesayrs dismissed brian-dellabetta’s stale review via 0674268 October 15, 2025 22:07

kylesayrs added 5 commits October 20, 2025 11:05

attention quant

1c9bf45

Signed-off-by: Kyle Sayers <[email protected]>

reduce diff

35acc55

Signed-off-by: Kyle Sayers <[email protected]>

address nits

a9f6e1f

Signed-off-by: Kyle Sayers <[email protected]>

fix kv cache serialization, add tests

311a9ab

Signed-off-by: Kyle Sayers <[email protected]>

fix style

8c99f63

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs force-pushed the kylesayrs/r3-only branch from eff4729 to 8c99f63 Compare October 20, 2025 15:05

do not force zp for attention

5225515

Signed-off-by: Kyle Sayers <[email protected]>

brian-dellabetta previously approved these changes Oct 20, 2025

View reviewed changes

populate ALL_MASK_ATTENTION_FUNCTIONS

a677372

Signed-off-by: Kyle Sayers <[email protected]>

kylesayrs dismissed brian-dellabetta’s stale review via a677372 October 21, 2025 19:49

brian-dellabetta approved these changes Oct 21, 2025

View reviewed changes

[Transform] Attention/Cache transforms #436

Are you sure you want to change the base?

[Transform] Attention/Cache transforms #436

Conversation

kylesayrs commented Aug 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Prerequisites

Changes

New Classes

Quantization Lifecycle Changes

Helpers

Testing

Evaluation

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kylesayrs commented Oct 13, 2025

Uh oh!

kylesayrs commented Oct 14, 2025

Uh oh!

brian-dellabetta left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

dsikka left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dsikka left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kylesayrs commented Aug 26, 2025 •

edited

Loading

kylesayrs Oct 15, 2025 •

edited

Loading

dsikka left a comment •

edited

Loading

kylesayrs Oct 16, 2025 •

edited

Loading