[CT] Fix CT Config to honor `fp8_inc` KV cache dtype by yiliu30 · Pull Request #929 · vllm-project/vllm-gaudi

yiliu30 · 2026-02-04T11:16:12Z

Adapt the update in vllm-project/vllm#30141

        # llm-compressor mdls need to set cache_dtype to "fp8" manually.
        if getattr(quant_config, "kv_cache_scheme", None) is not None:
            kv_cache_dtype = "fp8"
            calculate_kv_scales = False
            if cache_config is not None:
                cache_config.cache_dtype = "fp8"
                cache_config.calculate_kv_scales = False

        self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype(
            kv_cache_dtype, vllm_config.model_config
        )
        self.kv_cache_dtype = kv_cache_dtype

cc @hshen14 @thuang6 @lkk12014402

Signed-off-by: yiliu30 <yi4.liu@intel.com>

Copilot

Pull request overview

This PR fixes a configuration issue in the Compressed Tensors implementation for HPU (Habana Processing Unit) to properly handle the fp8_inc KV cache dtype instead of the default fp8 format.

Changes:

Added a custom __init__ method to HPUCompressedTensorsConfig that overrides KV cache settings after parent initialization

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

github-actions · 2026-02-05T02:38:07Z

✅ CI Passed

All checks passed successfully against the following vllm commit:
17b17c068453e6dc6af79240bb94857ae175cc51

@hshen14

Depends on #929 - Local test ```bash vllm ({'pretrained': '/mnt/disk5/hf_models/Qwen3-8B-FP8_STATIC-FP8-Attn-LLMC-Test-Only/', 'tensor_parallel_size': 8, 'max_model_len': 4096, 'max_num_seqs': 64, 'gpu_memory_utilization': 0.85, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'kv_cache_dtype': 'fp8_inc'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 128 # |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| # |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| # |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8999|± |0.0083| # | | |strict-match | 5|exact_match|↑ |0.8999|± |0.0083| ``` cc @hshen14 @thuang6 --------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

@hshen14

Adapt the update in vllm-project/vllm#30141 ```python # llm-compressor mdls need to set cache_dtype to "fp8" manually. if getattr(quant_config, "kv_cache_scheme", None) is not None: kv_cache_dtype = "fp8" calculate_kv_scales = False if cache_config is not None: cache_config.cache_dtype = "fp8" cache_config.calculate_kv_scales = False self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype( kv_cache_dtype, vllm_config.model_config ) self.kv_cache_dtype = kv_cache_dtype ``` cc @hshen14 @thuang6 @lkk12014402 --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

@hshen14

Depends on vllm-project#929 - Local test ```bash vllm ({'pretrained': '/mnt/disk5/hf_models/Qwen3-8B-FP8_STATIC-FP8-Attn-LLMC-Test-Only/', 'tensor_parallel_size': 8, 'max_model_len': 4096, 'max_num_seqs': 64, 'gpu_memory_utilization': 0.85, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'kv_cache_dtype': 'fp8_inc'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 128 # |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| # |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| # |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8999|± |0.0083| # | | |strict-match | 5|exact_match|↑ |0.8999|± |0.0083| ``` cc @hshen14 @thuang6 --------- Signed-off-by: yiliu30 <yi4.liu@intel.com> Signed-off-by: slokesha <slokeshappa@habana.ai>

@hshen14

Adapt the update in vllm-project/vllm#30141 ```python # llm-compressor mdls need to set cache_dtype to "fp8" manually. if getattr(quant_config, "kv_cache_scheme", None) is not None: kv_cache_dtype = "fp8" calculate_kv_scales = False if cache_config is not None: cache_config.cache_dtype = "fp8" cache_config.calculate_kv_scales = False self.kv_cache_torch_dtype = kv_cache_dtype_str_to_dtype( kv_cache_dtype, vllm_config.model_config ) self.kv_cache_dtype = kv_cache_dtype ``` cc @hshen14 @thuang6 @lkk12014402 --------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

@hshen14

Depends on #929 - Local test ```bash vllm ({'pretrained': '/mnt/disk5/hf_models/Qwen3-8B-FP8_STATIC-FP8-Attn-LLMC-Test-Only/', 'tensor_parallel_size': 8, 'max_model_len': 4096, 'max_num_seqs': 64, 'gpu_memory_utilization': 0.85, 'dtype': 'bfloat16', 'max_gen_toks': 2048, 'enable_prefix_caching': False, 'max_num_batched_tokens': 32768, 'kv_cache_dtype': 'fp8_inc'}), gen_kwargs: ({}), limit: None, num_fewshot: None, batch_size: 128 # |Tasks|Version| Filter |n-shot| Metric | |Value | |Stderr| # |-----|------:|----------------|-----:|-----------|---|-----:|---|-----:| # |gsm8k| 3|flexible-extract| 5|exact_match|↑ |0.8999|± |0.0083| # | | |strict-match | 5|exact_match|↑ |0.8999|± |0.0083| ``` cc @hshen14 @thuang6 --------- Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 added 2 commits February 4, 2026 11:09

Revert LLMC override

60223ea

Signed-off-by: yiliu30 <yi4.liu@intel.com>

fix

329c363

Signed-off-by: yiliu30 <yi4.liu@intel.com>

yiliu30 requested a review from xuechendi as a code owner February 4, 2026 11:16

Copilot AI review requested due to automatic review settings February 4, 2026 11:16

yiliu30 requested review from adobrzyn, afierka-intel, iboiko-habana, kamil-kaczor, ksmusz, mgawarkiewicz-intel and michalkuligowski as code owners February 4, 2026 11:16

Copilot AI reviewed Feb 4, 2026

View reviewed changes

Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py Outdated

Comment thread vllm_gaudi/ops/hpu_compressed_tensors.py

yiliu30 mentioned this pull request Feb 4, 2026

[CT] Add FP8 GQA Support #874

Merged

github-actions Bot mentioned this pull request Feb 4, 2026

🚦 Team Review Dashboard #701

Open

Merge branch 'main' into fix-llmc-kv

6b0eac3

xuechendi approved these changes Feb 5, 2026

View reviewed changes

xuechendi merged commit 175572b into vllm-project:main Feb 5, 2026
55 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CT] Fix CT Config to honor `fp8_inc` KV cache dtype#929

[CT] Fix CT Config to honor `fp8_inc` KV cache dtype#929
xuechendi merged 3 commits intovllm-project:mainfrom
yiliu30:fix-llmc-kv

yiliu30 commented Feb 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yiliu30 commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Feb 5, 2026

✅ CI Passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

yiliu30 commented Feb 4, 2026 •

edited

Loading