Add KV-Cache int8 quant support#10354
Add KV-Cache int8 quant support#10354YanyunDuanIEI wants to merge 19 commits intovllm-project:mainfrom
Conversation
Signed-off-by: Yanyun Duan <duanyanyun@inspur.com>
|
👋 Hi! Thank you for contributing to the vLLM project. Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can do one of these:
🚀 |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Would it be viable to hasten the review process? |
|
This pull request has merge conflicts that must be resolved before it can be |
|
@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR( #1507 )Write down the operation of model transformation? |
It is offline, and the demo is located in the |
Thank you for your answer. |
|
@YanyunDuanIEI Hello, May I ask if there is a download path for the calibration set files "ceval_val_cmcc.jsonl" and "mapping. json" for "ceval_val_cmcc" and "ceval"? |
|
Hi, if you are still interested in getting this in, please fix the merge conflict, thank you! |
Most of the datasets are in LLaMA-Factory, located in the |
|
@YanyunDuanIEI This doesn't seem to support models from the qwen2 series. Is it? |
There was a problem hiding this comment.
This examples directory has a lot of lines, especially due to the scales in the work_dir. If you want to keep this example, please try to:
- Rename dir to int8_kv_cache
- Write a README describing how to use
- Cleanup/consolidate these scripts if possible
- Possibly remove the work_dir? I think it is reasonable to keep one set of scales as demonstration, but I don't see a reason to keep so many
I think once this support lands, we can easily update llmcompressor with examples to produce calibrated int8 kv cache scales - similar to like we have for FP8 now https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_kv_cache
| float k_scale = 0; | ||
| float v_scale = 0; | ||
| if constexpr (KV_DTYPE == Fp8KVCacheDataType::kInt8Group128) { | ||
| int64_t tgt_kvs_idx = floor((kv_head_idx*HEAD_SIZE)/quant_group); |
There was a problem hiding this comment.
I think there is no need to keep quant_group as an argument since we have the KVCacheDataType as template parameter. We know in the kInt8Group128 that the quant_group will be 128, so I think we can remove this parameter completely.
| // printf("\n dequant scale= %f, zero_point= %f \n", scale, zero_point); | ||
| // if(abs(res+1.268555)<=0.01) | ||
| // printf("\nI am here int8_to_float, x = %d, a= %d, res=%f, scale=%f, zero_point=%f \n", | ||
| // x, a, res, scale, zero_point); |
| // printf("\n quant scale= %f \n", scale); | ||
| // if(abs(x+1.268555)<=0.00001) | ||
| // printf("\nI am here float_to_int8, x = %f, fx= %d, res=%d, scale=%f, zero_point=%f, (x-zero_point) / scale)=%f \n", | ||
| // x, fx, res, scale, zero_point, (x-zero_point) / scale); |
| template <typename Tout, typename Tin> | ||
| __inline__ __device__ Tout scaled_vec_conversion_int8(const Tin& x, | ||
| const float scale, const float zero_point) { | ||
| return x; | ||
| } |
There was a problem hiding this comment.
This does not seem right, what is the purpose of this definition?
| k_scales: torch.Tensor, | ||
| v_scales: torch.Tensor, |
There was a problem hiding this comment.
nit: we tend to just use scale rather than scales even in the case of using tensors, see these kernels as example
vllm/vllm/model_executor/layers/quantization/utils/w8a8_utils.py
Lines 204 to 212 in c2d1b07
| k_scale=k_scale, | ||
| v_scale=v_scale, | ||
| quant_group, | ||
| k_scales, | ||
| v_scales, |
There was a problem hiding this comment.
Please use named assignment of args here
| k_scale=k_scale, | ||
| v_scale=v_scale, | ||
| quant_group, | ||
| k_scales, | ||
| v_scales, |
There was a problem hiding this comment.
Please use named assignment of args here
| k_scales_lists = v_scales_lists = [1.0] | ||
| # k_scales_lists = [0.16] | ||
| # v_scales_lists = [0.005] | ||
| self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | ||
| self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | ||
| self._quant_group = cache_config.kv_quant_group | ||
| if cache_config.cache_dtype.startswith("int8"): | ||
| if cache_config.kv_quant_params_path is not None: | ||
| k_scales_lists = cache_config.kv_quant_params[0].pop(0) | ||
| v_scales_lists = cache_config.kv_quant_params[1].pop(0) | ||
| self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | ||
| self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | ||
| if self._quant_group !=0: | ||
| self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | ||
| self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) |
There was a problem hiding this comment.
| k_scales_lists = v_scales_lists = [1.0] | |
| # k_scales_lists = [0.16] | |
| # v_scales_lists = [0.005] | |
| self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | |
| self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | |
| self._quant_group = cache_config.kv_quant_group | |
| if cache_config.cache_dtype.startswith("int8"): | |
| if cache_config.kv_quant_params_path is not None: | |
| k_scales_lists = cache_config.kv_quant_params[0].pop(0) | |
| v_scales_lists = cache_config.kv_quant_params[1].pop(0) | |
| self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | |
| self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | |
| if self._quant_group !=0: | |
| self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | |
| self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | |
| default_scale = [1.0] | |
| self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
| self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
| self._quant_group = cache_config.kv_quant_group | |
| if cache_config.cache_dtype.startswith("int8"): | |
| if cache_config.kv_quant_params_path is not None: | |
| k_scales_lists = cache_config.kv_quant_params[0].pop(0) | |
| v_scales_lists = cache_config.kv_quant_params[1].pop(0) | |
| self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
| self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda") | |
| if self._quant_group !=0: | |
| self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) | |
| self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group)) |
| # v_scales_lists = [0.005] | ||
| self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda") | ||
| self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda") | ||
| self._quant_group = cache_config.kv_quant_group |
There was a problem hiding this comment.
We can deduce kv_quant_group from the cache_config.cache_dtype, as mentioned in the kernels
|
This pull request has merge conflicts that must be resolved before it can be |
|
@YanyunDuanIEI May I ask, in rocm, when performing accuracy verification, the variable 'quantization_param_path' needs to specify a file path, is it the same as the variable 'kv_quant_params_path'? Or, can we specify the generated JSON files' kv_cache_scales_layer_level.json 'and'kv_cache_scales_quant_group128.json 'separately? |
|
@YanyunDuanIEI do you plan to complete this PR? I ask as it has been almost 2 months since it was last updated. If not, I would like to close it as stale. |
|
This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you! |
|
Stale |

Add KV-Cache int8 quant support
Support
[layer_level]and[group_level]KV-Cache int8 quant.[layer_level]use common scale factors for each layer.[group_level]group the head_size according to group_size, with each group_size, the scaling factor of key/value corresponding to the same value.KV-Cache int8 quant (Click to Expand)
Get the scaling factor by calibration
Support to calibrate the KV-cache by datasets:
[examples/int8/calibrate.py]calibrate and save to pth.[export_kv_params.py]save scaling factors to json.Using KV-Cache int8