Skip to content

[CT] Fix CT Config to honor fp8_inc KV cache dtype#929

Merged
xuechendi merged 3 commits intovllm-project:mainfrom
yiliu30:fix-llmc-kv
Feb 5, 2026
Merged

[CT] Fix CT Config to honor fp8_inc KV cache dtype#929
xuechendi merged 3 commits intovllm-project:mainfrom
yiliu30:fix-llmc-kv

Conversation

@yiliu30
Copy link
Copy Markdown
Contributor

@yiliu30 yiliu30 commented Feb 4, 2026

Adapt the update in vllm-project/vllm#30141

        # llm-compressor mdls need to set cache_dtype to "fp8" manually.
        if getattr(quant_config, "kv_cache_scheme", None) is not None:
            kv_cache_dtype = "fp8"
            calculate_kv_scales = False
            if cache_config is not None:
                cache_config.cache_dtype = "fp8"
                cache_config.calculate_kv_scales = False

        self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype(
            kv_cache_dtype, vllm_config.model_config
        )
        self.kv_cache_dtype = kv_cache_dtype

cc @hshen14 @thuang6 @lkk12014402

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: yiliu30 <yi4.liu@intel.com>
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes a configuration issue in the Compressed Tensors implementation for HPU (Habana Processing Unit) to properly handle the fp8_inc KV cache dtype instead of the default fp8 format.

Changes:

  • Added a custom __init__ method to HPUCompressedTensorsConfig that overrides KV cache settings after parent initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py Outdated
Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py
@github-actions
Copy link
Copy Markdown

github-actions Bot commented Feb 5, 2026

✅ CI Passed

All checks passed successfully against the following vllm commit:
17b17c068453e6dc6af79240bb94857ae175cc51

@xuechendi xuechendi merged commit 175572b into vllm-project:main Feb 5, 2026
55 checks passed
adobrzyn pushed a commit that referenced this pull request Feb 6, 2026
Depends on #929

- Local test
```bash
vllm ({'pretrained': '/mnt/disk5/hf_models/Qwen3-8B-FP8_STATIC-FP8-Attn-LLMC-Test-Only/', 'tensor_parallel_size': 8, 'max_model_len': 4096, 'max_num_seqs': 64, 'gpu_memory_utilization': 0.85, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'kv_cache_dtype': 'fp8_inc'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8999|±  |0.0083|
# |     |       |strict-match    |     5|exact_match|↑  |0.8999|±  |0.0083|
```
cc @hshen14 @thuang6

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
Adapt the update in vllm-project/vllm#30141

```python
        # llm-compressor mdls need to set cache_dtype to "fp8" manually.
        if getattr(quant_config, "kv_cache_scheme", None) is not None:
            kv_cache_dtype = "fp8"
            calculate_kv_scales = False
            if cache_config is not None:
                cache_config.cache_dtype = "fp8"
                cache_config.calculate_kv_scales = False

        self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype(
            kv_cache_dtype, vllm_config.model_config
        )
        self.kv_cache_dtype = kv_cache_dtype
```

cc @hshen14 @thuang6 @lkk12014402

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
slokesha pushed a commit to libinta/vllm-gaudi that referenced this pull request Feb 9, 2026
Depends on vllm-project#929

- Local test
```bash
vllm ({'pretrained': '/mnt/disk5/hf_models/Qwen3-8B-FP8_STATIC-FP8-Attn-LLMC-Test-Only/', 'tensor_parallel_size': 8, 'max_model_len': 4096, 'max_num_seqs': 64, 'gpu_memory_utilization': 0.85, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'kv_cache_dtype': 'fp8_inc'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8999|±  |0.0083|
# |     |       |strict-match    |     5|exact_match|↑  |0.8999|±  |0.0083|
```
cc @hshen14 @thuang6

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Signed-off-by: slokesha <slokeshappa@habana.ai>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
Adapt the update in vllm-project/vllm#30141 

```python
        # llm-compressor mdls need to set cache_dtype to "fp8" manually.
        if getattr(quant_config, "kv_cache_scheme", None) is not None:
            kv_cache_dtype = "fp8"
            calculate_kv_scales = False
            if cache_config is not None:
                cache_config.cache_dtype = "fp8"
                cache_config.calculate_kv_scales = False

        self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype(
            kv_cache_dtype, vllm_config.model_config
        )
        self.kv_cache_dtype = kv_cache_dtype
```


cc @hshen14 @thuang6 @lkk12014402

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
adobrzyn pushed a commit that referenced this pull request Mar 31, 2026
Depends on #929

- Local test
```bash
vllm ({'pretrained': '/mnt/disk5/hf_models/Qwen3-8B-FP8_STATIC-FP8-Attn-LLMC-Test-Only/', 'tensor_parallel_size': 8, 'max_model_len': 4096, 'max_num_seqs': 64, 'gpu_memory_utilization': 0.85, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'kv_cache_dtype': 'fp8_inc'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 128
# |Tasks|Version|     Filter     |n-shot|  Metric   |   |Value |   |Stderr|
# |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:|
# |gsm8k|      3|flexible-extract|     5|exact_match|↑  |0.8999|±  |0.0083|
# |     |       |strict-match    |     5|exact_match|↑  |0.8999|±  |0.0083|
```
cc @hshen14 @thuang6

---------

Signed-off-by: yiliu30 <yi4.liu@intel.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants