Add KV-Cache int8 quant support by YanyunDuanIEI · Pull Request #10354 · vllm-project/vllm

YanyunDuanIEI · 2024-11-15T06:00:52Z

Add KV-Cache int8 quant support

Support [layer_level] and [group_level] KV-Cache int8 quant.

[layer_level] use common scale factors for each layer.
[group_level] group the head_size according to group_size, with each group_size, the scaling factor of key/value corresponding to the same value.

KV-Cache int8 quant (Click to Expand)

Get the scaling factor by calibration

Support to calibrate the KV-cache by datasets:

[examples/int8/calibrate.py] calibrate and save to pth.
[export_kv_params.py] save scaling factors to json.

Using KV-Cache int8

kv_cache_dtype="int8"
kv_quant_params_path=kv_quant_params_path
kv_quant_group=kv_quant_group

Signed-off-by: Yanyun Duan <duanyanyun@inspur.com>

github-actions · 2024-11-15T06:01:04Z

👋 Hi! Thank you for contributing to the vLLM project.
Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can do one of these:

Add ready label to the PR
Enable auto-merge.

🚀

mergify · 2024-11-17T02:04:05Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @YanyunDuanIEI.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

csrc/attention/attention_kernels.cuh

…tensors

YanyunDuanIEI · 2024-12-11T05:59:15Z

Would it be viable to hasten the review process?

mergify · 2024-12-17T06:14:45Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @YanyunDuanIEI.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kahakuka · 2025-01-20T06:29:15Z

@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR（ #1507 ）Write down the operation of model transformation?

YanyunDuanIEI · 2025-01-20T08:27:38Z

@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR（ #1507 ）Write down the operation of model transformation?

It is offline, and the demo is located in the examples/int8/ directory. There is an execution demo named examples/int8/run_calibrate.sh.

kahakuka · 2025-01-20T08:30:15Z

@YanyunDuanIEI Hello, is the kv cache int8 quantization in this PR online or offline? The code requires calibration sets such as' c4 '. Can it be like this PR（ #1507 ）Write down the operation of model transformation?

It is offline, and the demo is located in the examples/int8/ directory. There is an execution demo named examples/int8/run_calibrate.sh.

Thank you for your answer.

kahakuka · 2025-01-20T12:57:34Z

@YanyunDuanIEI Hello, May I ask if there is a download path for the calibration set files "ceval_val_cmcc.jsonl" and "mapping. json" for "ceval_val_cmcc" and "ceval"?

simon-mo · 2025-01-21T17:45:27Z

Hi, if you are still interested in getting this in, please fix the merge conflict, thank you!

YanyunDuanIEI · 2025-01-22T08:31:30Z

@YanyunDuanIEI Hello, May I ask if there is a download path for the calibration set files "ceval_val_cmcc.jsonl" and "mapping. json" for "ceval_val_cmcc" and "ceval"?

Most of the datasets are in LLaMA-Factory, located in the LLaMA-Factory/evaluation/.

kahakuka · 2025-01-23T07:04:36Z

@YanyunDuanIEI This doesn't seem to support models from the qwen2 series. Is it?

mgoin · 2024-12-21T15:58:34Z

examples/int8/calib_dataloader.py

This examples directory has a lot of lines, especially due to the scales in the work_dir. If you want to keep this example, please try to:

Rename dir to int8_kv_cache

Write a README describing how to use

Cleanup/consolidate these scripts if possible

Possibly remove the work_dir? I think it is reasonable to keep one set of scales as demonstration, but I don't see a reason to keep so many

I think once this support lands, we can easily update llmcompressor with examples to produce calibrated int8 kv cache scales - similar to like we have for FP8 now https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_kv_cache

mgoin · 2024-12-21T16:03:05Z

csrc/attention/attention_kernels.cuh

+  float k_scale = 0;
+  float v_scale = 0;
+  if constexpr (KV_DTYPE == Fp8KVCacheDataType::kInt8Group128) {
+    int64_t tgt_kvs_idx = floor((kv_head_idx*HEAD_SIZE)/quant_group);


I think there is no need to keep quant_group as an argument since we have the KVCacheDataType as template parameter. We know in the kInt8Group128 that the quant_group will be 128, so I think we can remove this parameter completely.

mgoin · 2024-12-21T16:04:13Z

csrc/quantization/int8_kvcache/quant_utils.cuh

+  // printf("\n dequant scale= %f, zero_point= %f \n", scale, zero_point);
+  // if(abs(res+1.268555)<=0.01)
+  //   printf("\nI am here int8_to_float, x = %d, a= %d, res=%f, scale=%f, zero_point=%f \n",
+  //           x, a, res, scale, zero_point);


Leftover cruft

mgoin · 2024-12-21T16:04:18Z

csrc/quantization/int8_kvcache/quant_utils.cuh

+  // printf("\n quant scale= %f \n", scale);
+  // if(abs(x+1.268555)<=0.00001)
+  //   printf("\nI am here float_to_int8, x = %f, fx= %d, res=%d, scale=%f, zero_point=%f, (x-zero_point) / scale)=%f \n",
+  //           x, fx, res, scale, zero_point, (x-zero_point) / scale);


mgoin · 2024-12-21T16:04:56Z

csrc/quantization/int8_kvcache/quant_utils.cuh

+template <typename Tout, typename Tin>
+__inline__ __device__ Tout scaled_vec_conversion_int8(const Tin& x,
+                                                      const float scale, const float zero_point) {
+  return x;
+}


This does not seem right, what is the purpose of this definition?

mgoin · 2024-12-21T16:27:39Z

vllm/_custom_ops.py

+    k_scales: torch.Tensor,
+    v_scales: torch.Tensor,


nit: we tend to just use scale rather than scales even in the case of using tensors, see these kernels as example

vllm/vllm/model_executor/layers/quantization/utils/w8a8_utils.py

Lines 204 to 212 in c2d1b07

def apply_int8_linear(

input: torch.Tensor,

weight: torch.Tensor,

weight_scale: torch.Tensor,

input_scale: Optional[torch.Tensor] = None,

input_zero_point: Optional[torch.Tensor] = None,

azp_adj: Optional[torch.Tensor] = None,

bias: Optional[torch.Tensor] = None,

):

mgoin · 2024-12-21T16:30:19Z

vllm/attention/backends/flashinfer.py

-                    k_scale=k_scale,
-                    v_scale=v_scale,
+                    quant_group,
+                    k_scales,
+                    v_scales,


Please use named assignment of args here

mgoin · 2024-12-21T16:30:52Z

vllm/attention/backends/flashinfer.py

-                k_scale=k_scale,
-                v_scale=v_scale,
+                quant_group,
+                k_scales,
+                v_scales,


Please use named assignment of args here

mgoin · 2024-12-21T16:34:08Z

vllm/attention/layer.py

+        k_scales_lists = v_scales_lists = [1.0]
+        # k_scales_lists = [0.16]
+        # v_scales_lists = [0.005]
+        self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")
+        self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")
+        self._quant_group = cache_config.kv_quant_group
+        if cache_config.cache_dtype.startswith("int8"):
+            if cache_config.kv_quant_params_path is not None:
+                k_scales_lists = cache_config.kv_quant_params[0].pop(0)
+                v_scales_lists = cache_config.kv_quant_params[1].pop(0)
+                self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")
+                self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")
+                if self._quant_group !=0:
+                    self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))
+                    self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))


Suggested change

k_scales_lists = v_scales_lists = [1.0]

# k_scales_lists = [0.16]

# v_scales_lists = [0.005]

self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")

self._quant_group = cache_config.kv_quant_group

if cache_config.cache_dtype.startswith("int8"):

if cache_config.kv_quant_params_path is not None:

k_scales_lists = cache_config.kv_quant_params[0].pop(0)

v_scales_lists = cache_config.kv_quant_params[1].pop(0)

self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")

if self._quant_group !=0:

self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

default_scale = [1.0]

self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

self._quant_group = cache_config.kv_quant_group

if cache_config.cache_dtype.startswith("int8"):

if cache_config.kv_quant_params_path is not None:

k_scales_lists = cache_config.kv_quant_params[0].pop(0)

v_scales_lists = cache_config.kv_quant_params[1].pop(0)

self._k_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

self._v_scales = torch.Tensor(default_scale).type(torch.float32).to("cuda")

if self._quant_group !=0:

self._k_scales = self._k_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

self._v_scales = self._v_scales.reshape((-1, num_kv_heads, head_size//self._quant_group))

mgoin · 2024-12-21T16:35:37Z

vllm/attention/layer.py

+        # v_scales_lists = [0.005]
+        self._k_scales = torch.Tensor(k_scales_lists).type(torch.float32).to("cuda")
+        self._v_scales = torch.Tensor(v_scales_lists).type(torch.float32).to("cuda")
+        self._quant_group = cache_config.kv_quant_group


We can deduce kv_quant_group from the cache_config.cache_dtype, as mentioned in the kernels

mergify · 2025-01-28T03:21:30Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @YanyunDuanIEI.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

kahakuka · 2025-02-08T09:34:06Z

@YanyunDuanIEI May I ask, in rocm, when performing accuracy verification, the variable 'quantization_param_path' needs to specify a file path, is it the same as the variable 'kv_quant_params_path'? Or, can we specify the generated JSON files' kv_cache_scales_layer_level.json 'and'kv_cache_scales_quant_group128.json 'separately?

hmellor · 2025-03-11T17:29:12Z

@YanyunDuanIEI do you plan to complete this PR?

I ask as it has been almost 2 months since it was last updated. If not, I would like to close it as stale.

github-actions · 2025-06-10T02:15:22Z

This pull request has been automatically marked as stale because it has not had any activity within 90 days. It will be automatically closed if no further activity occurs within 30 days. Leave a comment if you feel this pull request should remain open. Thank you!

hmellor · 2025-07-24T12:09:07Z

Stale

int8 kv-cache support

b8e7779

Signed-off-by: Yanyun Duan <duanyanyun@inspur.com>

YanyunDuanIEI requested review from WoosukKwon, alexm-redhat, comaniac, njhill, youkaichao and zhuohan123 as code owners November 15, 2024 06:00

mergify bot added the needs-rebase label Nov 17, 2024

Merge branch 'main' into int8-kv-cache

6309f86

YanyunDuanIEI requested review from robertgshaw2-redhat and ywang96 as code owners November 18, 2024 01:59

mergify bot removed the needs-rebase label Nov 18, 2024

mgoin reviewed Nov 19, 2024

View reviewed changes

csrc/attention/attention_kernels.cuh Show resolved Hide resolved

csrc/attention/attention_kernels.cuh Outdated Show resolved Hide resolved

YanyunDuanIEI added 13 commits November 20, 2024 10:58

Update dtype_fp8.cuh

3b3d784

Update quant_utils.cuh

f09c559

Update quant_utils.cuh

03476e5

Update cache_kernels.cu

9a426c0

Update attention_kernels.cuh

2a67ed4

Update paged_attention_v1.cu

83b3d41

Update paged_attention_v2.cu

1056a5f

Update layer.py

e870565

Update selector.py

95effce

Update config.py

4e8ddb6

Update utils.py

cf98d49

Update model_runner.py

c5da7f5

merge the k_scale/v_scale and k_scaling_factor/v_scaling_factor into …

36afe99

…tensors

update to vllm-0.6.6

23592ee

YanyunDuanIEI requested review from DarkLight1337, LiuXiaoxuanPKU, simon-mo and tlrmchlsmth as code owners January 21, 2025 07:07

mergify bot added documentation Improvements or additions to documentation ci/build frontend labels Jan 21, 2025

Merge branch 'main' into int8-kv-cache

d1dffdb

mgoin reviewed Jan 28, 2025

View reviewed changes

mergify bot added the needs-rebase label Jan 28, 2025

mergify bot added the v1 label Feb 5, 2025

kahakuka mentioned this pull request Feb 18, 2025

[Feature]: kv cahce int8：Dynamic kv cache scaling factors computation #13478

Closed

1 task

github-actions bot added the stale Over 90 days of inactivity label Jun 10, 2025

mergify bot added the rocm Related to AMD ROCm label Jun 13, 2025

github-actions bot added unstale Recieved activity after being labelled stale and removed stale Over 90 days of inactivity labels Jun 14, 2025

hmellor closed this Jul 24, 2025

	def apply_int8_linear(
	input: torch.Tensor,
	weight: torch.Tensor,
	weight_scale: torch.Tensor,
	input_scale: Optional[torch.Tensor] = None,
	input_zero_point: Optional[torch.Tensor] = None,
	azp_adj: Optional[torch.Tensor] = None,
	bias: Optional[torch.Tensor] = None,
	):

Uh oh!

Conversation

YanyunDuanIEI commented Nov 15, 2024

Get the scaling factor by calibration

Using KV-Cache int8

Uh oh!

github-actions bot commented Nov 15, 2024

Uh oh!

mergify bot commented Nov 17, 2024

Uh oh!

Uh oh!

Uh oh!

YanyunDuanIEI commented Dec 11, 2024

Uh oh!

mergify bot commented Dec 17, 2024

Uh oh!

kahakuka commented Jan 20, 2025

Uh oh!

YanyunDuanIEI commented Jan 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kahakuka commented Jan 20, 2025

Uh oh!

kahakuka commented Jan 20, 2025

Uh oh!

simon-mo commented Jan 21, 2025

Uh oh!

YanyunDuanIEI commented Jan 22, 2025

Uh oh!

kahakuka commented Jan 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

mergify bot commented Jan 28, 2025

Uh oh!

kahakuka commented Feb 8, 2025

Uh oh!

hmellor commented Mar 11, 2025

Uh oh!

github-actions bot commented Jun 10, 2025

Uh oh!

hmellor commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

YanyunDuanIEI commented Jan 20, 2025 •

edited

Loading