You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It seems that the qkv ,three linear layers' quantization scales are merged and shared, but this seems to have potential bad influence on quantization model accuracy.
Will this be fixed in the future?
The text was updated successfully, but these errors were encountered:
brisker
changed the title
Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?
Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?( potential bad influence on quantization accuracy)
Jan 25, 2024
@byshiue
It seems that llama2 quantization(not matter smoothquant or kv-cache quantization) has some bugs, which causes consistent bad accuracy ,and qkv scale merged or not may not be the main reason (but still a reason).
here:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/hf_llama_convert.py#L37
https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/convert.py#L88 (the same output quantization scale for q,k,v)
It seems that the qkv ,three linear layers' quantization scales are merged and shared, but this seems to have potential bad influence on quantization model accuracy.
Will this be fixed in the future?
The text was updated successfully, but these errors were encountered: