Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Hi,
Thanks for your contributions and updates of Qserve.
I added an INT8KV feature as well.
Previously, the scale factor was calculated using the maximum value among the outputs of
q_proj
,k_proj
, andv_proj
. (code)However, I found that it is not working in Qserve.
It only works well in Qserve when the scale factor is calculated based solely on the outputs of
k_proj
andv_proj
.This is different from the INT8 KV Cache in Qserve paper, which uses a dynamic cache. However, this int8 kv cache is a sufficient alternative for Qserve with high accuracy.
[Reference]
Qserve retrieves the scale of the kv cache separately for k and v, treating each with its own scale. (code)
However, TensorRT-LLM merges the k and v scales into a single
kv_cache_scaling_factor
derived from the outputs ofqkv_proj
. This setup made it difficult to use the kv cache scaling style of Qserve in TensorRT-LLM. However, I modified the approach to obtain the kv cache scale without consideringq_proj
, making it more similar to Qserve.And I got much higher quality of outputs.