KV cache quantization back of the envelope calculations #539

msaroufim · 2024-07-25T05:57:35Z

Recently got interested in how to run Llama3 inferences with large context lengths like 128K. For context Llama2 had a max sequence length of 4096. One solution that always works is to go distributed with techniques like Ring Attention where you split the sequence over multiple devices but instead I'm interested in how to run large context windows on a single GPU.

For larger sequence lengths the primary VRAM bottleneck is not the model parameters but the size of the KV cache which has an analytical formula of: 2 * layers * attention heads * head_dim * byte_per_element * batch_size * sequence_length and the model param has a simple formula of number_of_param * byte_per_element

So what I plotted below was the model params + KV cache size as I increased the sequence length

A few things jump out

We can run a context length of 128K with an out of the box GPT-fast implementation at fp16 but it needs to be on 80GB+ GPU
Int8 KV quantization is likely the most important problem we can prioritize because it should allow us to support LLama8B inference even on a consumer GPU like a 3090 or 4090 with 24GB of VRAM
Keep in mind how the curve starts bending at a context length of around 16K below that KV cache size is not a huge worry
The second plot was the exact same thing with Llama70B

At this size

Fp16 is out of the question
int8 is feasible on an 80GB GPU at smaller sequence lengths
int4 is feasible on a 40GB GPU at smaller sequence lengths
int2 is feasible on a 24GB GPU at smaller sequence lengths
Running 128K sequence length is only feasible with int2 and an 80GB GPU
With int8 and below we can do single node inference which has up to 8 40GB GPU -> 320 > 250 so again int8 is a nice sweet spot for kv cache quantization

A few important caveats

These are analytical results so will need to run real experiment to see whether this holds up. For now the results above ignore accidental large intermediaries
NVIDIA is unlikely to increase VRAM on consumer GPUs. NVIDIA is also disabling peer to peer access on their consumer GPUs so it's unlikely consumer GPUs will be a thriving market for distributed inference
Int4/Int2 quantization will likely come with massive perplexity loss unless the quantization is done in a clever way similarly to KVQuant or leverage QAT - int8 is unlikely to require as much cleverness
There are other techniques to reduce KV cache sizes using MQA or cross layer KV cache sharing but we can't (AFAIK) flexibly change attention patterns from training to inference and expect things to work so some co-design with the training teams will be beneficial long term

* add dtype tests for runner-aoti + runner-et * typo

…ytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1.

* make --device fast the default * Update iOS.md (pytorch#517) * Update iOS.md * Update iOS.md * Pip to pip3 (pytorch#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (pytorch#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (pytorch#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (pytorch#519) * remove code for no KV Cache path (pytorch#527) * Update ADVANCED-USERS.md (pytorch#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (pytorch#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (pytorch#530) Update description of runner and build process in runner_build.md * clean up runner code a little (pytorch#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (pytorch#534) * add dtype tests for runner-aoti + runner-et (pytorch#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (pytorch#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (pytorch#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (pytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1. * fix generate for llama3 (pytorch#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (pytorch#551) * Add dtype runner aoti (pytorch#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (pytorch#553) * test sdpa with fp16 * kv cache fp32 * typo * update (pytorch#560) * Only support newest versions of lm-eval (pytorch#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (pytorch#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (pytorch#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (pytorch#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>

* code beautification * code beautification, move functions together * make --device fast the default (pytorch#515) * make --device fast the default * Update iOS.md (pytorch#517) * Update iOS.md * Update iOS.md * Pip to pip3 (pytorch#504) * remove macos-12 test * pip to pip3 * break aoti CI jobs separately (pytorch#500) * init * fixes * more fixes * fixes * fix * fix * bug fix * add objcopy update * suppress int8 * undefined variable --------- Co-authored-by: Michael Gschwind <[email protected]> * Support llama3 in chat in run.cpp (pytorch#486) * refactor chat runner in preparation for llama3 * add sketch for llama3 prompt template and move to returning tokens * fix tiktoken * fixes to chat * add default llama_ver * Add tests for quantize json, add cuda device specification and precision to cuda.json (pytorch#519) * remove code for no KV Cache path (pytorch#527) * Update ADVANCED-USERS.md (pytorch#529) Update Advanced Users description to reflect changes in the repo since the description was initially created. * runner-aoti on cuda (pytorch#531) * runner-aoti on cuda * transfer results back to CPU * transfer results back to CPU * runner-aoti on cuda * Update runner_build.md (pytorch#530) Update description of runner and build process in runner_build.md * clean up runner code a little (pytorch#532) * clean up runner code a little * update * update * pull out generate loop in chat * updates * edit docs * typo * move int8 linear class and function into qops.py (pytorch#534) * add dtype tests for runner-aoti + runner-et (pytorch#539) * add dtype tests for runner-aoti + runner-et * typo * Quantized embedding (pytorch#536) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * Move Linear int4 to qops (pytorch#537) * move int8 linear class and function into qops.py * move Quantized Embedding to qops.py * move int4 linear to qops * Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (pytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1. * fix generate for llama3 (pytorch#538) * fix generate for llama3 * switch more things to C * remove C++ header * add delegation visualization instructions (pytorch#551) * Add dtype runner aoti (pytorch#552) * add dtype tests for runner-aoti + runner-et * typo * add dtype test runner-aoti * test sdpa with fp16 (pytorch#553) * test sdpa with fp16 * kv cache fp32 * typo * update (pytorch#560) * Only support newest versions of lm-eval (pytorch#556) Summary: remove support for lm-eval 0.3 to reduce the options we have Test Plan: CI Reviewers: Subscribers: Tasks: Tags: * split cpu eval CI by dtype (pytorch#554) * split cpu eval CI by dtype * fix * differentiate names with checks * keep one name the same as old * fix * Removing duplicate HF issue message from README (pytorch#559) Co-authored-by: Michael Gschwind <[email protected]> * doc updates (pytorch#567) * Add VM-safe MPS check --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]> * add unpacking support (pytorch#525) * add unpacking support * fix typos and linter * perform parallel prefill when possible (pytorch#568) * perform parallel prefill when possible * typo * disable hack * remove print * remove debug messages which prevent export * fixes * stream results in generate.py (#571) * remove logging interfering with export --------- Co-authored-by: Anthony Shoumikhin <[email protected]> Co-authored-by: metascroy <[email protected]> Co-authored-by: Nikita Shulga <[email protected]> Co-authored-by: lucylq <[email protected]> Co-authored-by: Jerry Zhang <[email protected]> Co-authored-by: Jack-Khuu <[email protected]>

msaroufim closed this as completed Oct 25, 2024

yanbing-j pushed a commit to yanbing-j/ao that referenced this issue Dec 9, 2024

add dtype tests for runner-aoti + runner-et (pytorch#539)

5e266fb

* add dtype tests for runner-aoti + runner-et * typo

yanbing-j pushed a commit to yanbing-j/ao that referenced this issue Dec 9, 2024

Revert "add dtype tests for runner-aoti + runner-et (pytorch#539)" (p…

afe6b0f

…ytorch#548) This reverts commit a7a24577a65be67ac9ae4dc05452f35d9c49e5d1.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV cache quantization back of the envelope calculations #539

KV cache quantization back of the envelope calculations #539

msaroufim commented Jul 25, 2024

KV cache quantization back of the envelope calculations #539

KV cache quantization back of the envelope calculations #539

Comments

msaroufim commented Jul 25, 2024