Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?( potential bad influence on quantization accuracy) #964

brisker · 2024-01-25T09:17:24Z

here:
https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/hf_llama_convert.py#L37

https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/llama/convert.py#L88 (the same output quantization scale for q,k,v)

It seems that the qkv ,three linear layers' quantization scales are merged and shared, but this seems to have potential bad influence on quantization model accuracy.

Will this be fixed in the future?

byshiue · 2024-01-26T06:43:51Z

From our experiment, it could keep good accuracy even if we merge the scales. Do you encounter any accuracy issue?

brisker · 2024-01-26T07:04:41Z

@byshiue
yes, the accuracy drops too much. The details can be found here:

#967

brisker · 2024-01-29T11:17:45Z

@byshiue
It seems that llama2 quantization(not matter smoothquant or kv-cache quantization) has some bugs, which causes consistent bad accuracy ,and qkv scale merged or not may not be the main reason (but still a reason).

byshiue · 2024-01-30T08:22:43Z

I think we could discuss this issue in #967, do you agree?

nv-guomingz · 2024-11-17T15:48:16Z

Hi @brisker do u still have further issue or question now? If not, we'll close it soon.

brisker changed the title ~~Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?~~ Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?( potential bad influence on quantization accuracy) Jan 25, 2024

brisker mentioned this issue Jan 26, 2024

llama2-7b bad results for int8-kv-cache + per-channel-int8-weight #967

Open

4 tasks

byshiue self-assigned this Jan 26, 2024

byshiue added the triaged Issue has been triaged by maintainers label Jan 26, 2024

nv-guomingz added the stale label Nov 17, 2024

nv-guomingz closed this as completed Dec 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?( potential bad influence on quantization accuracy) #964

Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?( potential bad influence on quantization accuracy) #964

brisker commented Jan 25, 2024 •

edited

Loading

byshiue commented Jan 26, 2024

brisker commented Jan 26, 2024 •

edited

Loading

brisker commented Jan 29, 2024 •

edited

Loading

byshiue commented Jan 30, 2024

nv-guomingz commented Nov 17, 2024

Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?( potential bad influence on quantization accuracy) #964

Are the three linear layers--q,k,v , merged in TensorRT-LLM for llama model?( potential bad influence on quantization accuracy) #964

Comments

brisker commented Jan 25, 2024 • edited Loading

byshiue commented Jan 26, 2024

brisker commented Jan 26, 2024 • edited Loading

brisker commented Jan 29, 2024 • edited Loading

byshiue commented Jan 30, 2024

nv-guomingz commented Nov 17, 2024

brisker commented Jan 25, 2024 •

edited

Loading

brisker commented Jan 26, 2024 •

edited

Loading

brisker commented Jan 29, 2024 •

edited

Loading