Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create INT8 KV Cache on Qserve #2446

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

dleunji
Copy link

@dleunji dleunji commented Nov 14, 2024

Hi,

Thanks for your contributions and updates of Qserve.

I added an INT8KV feature as well.

Previously, the scale factor was calculated using the maximum value among the outputs of q_proj, k_proj, and v_proj. (code)

However, I found that it is not working in Qserve.

It only works well in Qserve when the scale factor is calculated based solely on the outputs of k_proj and v_proj.
This is different from the INT8 KV Cache in Qserve paper, which uses a dynamic cache. However, this int8 kv cache is a sufficient alternative for Qserve with high accuracy.

[Reference]
Qserve retrieves the scale of the kv cache separately for k and v, treating each with its own scale. (code)
However, TensorRT-LLM merges the k and v scales into a single kv_cache_scaling_factor derived from the outputs of qkv_proj. This setup made it difficult to use the kv cache scaling style of Qserve in TensorRT-LLM. However, I modified the approach to obtain the kv cache scale without considering q_proj, making it more similar to Qserve.
And I got much higher quality of outputs.

@lkm2835
Copy link
Contributor

lkm2835 commented Nov 14, 2024

This is related to #2444

@hello-11 hello-11 added the triaged Issue has been triaged by maintainers label Nov 18, 2024
@bobboli
Copy link
Collaborator

bobboli commented Nov 18, 2024

Hi,
Thank you for your contribution! The current checkpoint conversion is implemented in the legacy path, whereas we plan to migrate to the unified converter in the future. After that we can handle the combination of KV cache quantization with w4a8 in a more unified way.

Since you modified load_weights_from_lmquant heavily which will be deprecated, we will not proceed with this PR. But we will refer to your observation of not using q_proj for calibration.

Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
triaged Issue has been triaged by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants