Q Scaling factor for FP8 KV Cache ?

When using a FP8 quantized model with FP8 KV cache scales I get this warning in newer versions of vLLM :
`WARNING 03-25 22:34:30 [kv_cache.py:82] Checkpoint does not provide a q scaling factor. Setting it to
 k_scale. This only matters for the flash-attn backend.`

I indeed use Flash Attention 3.

How can I provide the Q scaling factor to vLLM ?

I use this recipe :

recipe = """

quant_stage:
    quant_modifiers:
        QuantizationModifier:
            ignore: ["re:.*lm_head"]
            config_groups:
                group_0:
                    weights:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    input_activations:
                        num_bits: 8
                        type: float
                        strategy: tensor
                        dynamic: false
                        symmetric: true
                    targets: ["Linear"]
            kv_cache_scheme:
                num_bits: 8
                type: float
                strategy: tensor
                dynamic: false
                symmetric: true                    
""" 

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Q Scaling factor for FP8 KV Cache ? #1294

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Q Scaling factor for FP8 KV Cache ? #1294

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions