Add support for Mistral fp8#935
Closed
jiminha wants to merge 100 commits into
Closed
Conversation
Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai> Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: Jimin Ha <jha@habana.ai> Co-authored-by: Yeonsil Yoon <yyoon@habana.ai> Co-authored-by: Sayantan Sarkar <supersarkar@gmail.com>
* Expose Llama Fused OPs control from run_lora_clm.py * Update as per review comments
* enable internal kv bucket in llama * initialize bucket_internal for CI * make bucket_internal more clear * further perf optim while max length is not multiple of bucket size
* [SW-173358] add first token prints * [SW-173358] rename x to outputs * [SW-173358] make style
* Enable Flash Attention in recompute and causal modes * Add flash_attention_causal_mask to generation utils * Propagate Flash Attention causal_mask to finetuning example * Modify README example and provide additional description * Add flash_attention_causal_mask to FT README
* enable loading falcon-180b ckpt in .safetensors format * Address comments borrowing transformer's way of reading ckpt file * address comments
Co-authored-by: Sun Choi <schoi@habana.ai>
* enable loading falcon-180b ckpt in .safetensors format * Address comments borrowing transformer's way of reading ckpt file * address comments * Update ckpt loading PR#15 reads a set of ckpt file names from the index json file. When OH downloads files from the hub instead of loading from a cache dir, get_repo_root() skips downloading the index json file. Thus the PR#15 fails to load file names. This PR scans the path and returns a list of names that matches the pattern * import modeling_utils from transformers
The new backend has been introduced to pytorch-integraton - HPU_BACKEND . The deprecated backend ( AOT_HPU_TRAINING_BACKEND ) shall no longer be available in optimum habana as it's going to be removed from pytorch-integration.
* Use KV cache till input seq len for prefill phase. Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Revert initialization of KV cache * Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag * remove os import * remove commented print --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
Co-authored-by: Soila Kavulya <skavulya@gmail.com>
* Sampling search UseKV cache till input seq len for prefill phase * Remove redundant line
* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>
- FlanT5 is giving a perf drop with "reduce_scatter" value True in DS config Signed-off-by: vineethanandh <vineethanandh@habana.ai>
* bf16 with disk offload protection * comment fix
* Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the corretness of the model, it uses openorca dataset * Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the corretness of the model, it uses openorca dataset * Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the correctness of the model, it uses openorca dataset
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What does this PR do?
Add support for Mistral fp8 text-generation.
Porting from: #918
Measurement
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_sizeRun
128x128xbs896
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphsThroughput (including tokenization) = 13235.770108332108 tokens/second
Number of HPU graphs = 91
Memory allocated = 38.35 GB
Max memory allocated = 92.85 GB
Total memory available = 94.62 GB
Graph compilation duration = 72.49320212705061 seconds
2048x128xbs120
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphThroughput (including tokenization) = 1368.34859681789 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.3 GB
Total memory available = 94.62 GB
Graph compilation duration = 71.46055008197436 seconds
2048x2048xbs44
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphsThroughput (including tokenization) = 3151.661188904903 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.44 GB
Total memory available = 94.62 GB
Graph compilation duration = 285.3293136919965 seconds
128x2048xbs120
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs