- 
                Notifications
    You must be signed in to change notification settings 
- Fork 271
Description
When using a FP8 quantized model with FP8 KV cache scales I get this warning in newer versions of vLLM :
WARNING 03-25 22:34:30 [kv_cache.py:82] Checkpoint does not provide a q scaling factor. Setting it to k_scale. This only matters for the flash-attn backend.
I indeed use Flash Attention 3.
How can I provide the Q scaling factor to vLLM ?
I use this recipe :
recipe = """
quant_stage:
quant_modifiers:
QuantizationModifier:
ignore: ["re:.*lm_head"]
config_groups:
group_0:
weights:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
input_activations:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
targets: ["Linear"]
kv_cache_scheme:
num_bits: 8
type: float
strategy: tensor
dynamic: false
symmetric: true
"""