Skip to content

Add support for Mistral fp8#935

Closed
jiminha wants to merge 100 commits into
huggingface:mainfrom
HabanaAI:ae_mistral_fp8_new
Closed

Add support for Mistral fp8#935
jiminha wants to merge 100 commits into
huggingface:mainfrom
HabanaAI:ae_mistral_fp8_new

Conversation

@jiminha
Copy link
Copy Markdown
Contributor

@jiminha jiminha commented Apr 30, 2024

What does this PR do?

Add support for Mistral fp8 text-generation.
Porting from: #918

Measurement

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size

Run

128x128xbs896

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13235.770108332108 tokens/second
Number of HPU graphs = 91
Memory allocated = 38.35 GB
Max memory allocated = 92.85 GB
Total memory available = 94.62 GB
Graph compilation duration = 72.49320212705061 seconds

2048x128xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graph

Throughput (including tokenization) = 1368.34859681789 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.3 GB
Total memory available = 94.62 GB
Graph compilation duration = 71.46055008197436 seconds

2048x2048xbs44

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3151.661188904903 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.44 GB
Total memory available = 94.62 GB
Graph compilation duration = 285.3293136919965 seconds

128x2048xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

regisss and others added 30 commits January 26, 2024 08:51
Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai>
Co-authored-by: Libin Tang <litang@habana.ai>
Co-authored-by: Jimin Ha <jha@habana.ai>
Co-authored-by: Yeonsil Yoon <yyoon@habana.ai>
Co-authored-by: Sayantan Sarkar <supersarkar@gmail.com>
* Expose Llama Fused OPs control from run_lora_clm.py

* Update as per review comments
* enable internal kv bucket in llama

* initialize bucket_internal for CI

* make bucket_internal more clear

* further perf optim while max length is not multiple of bucket size
* [SW-173358] add first token prints

* [SW-173358] rename x to outputs

* [SW-173358] make style
* Enable Flash Attention in recompute and causal modes

* Add flash_attention_causal_mask to generation utils

* Propagate Flash Attention causal_mask to finetuning example

* Modify README example and provide additional description

* Add flash_attention_causal_mask to FT README
* enable loading falcon-180b ckpt in .safetensors format

* Address comments borrowing transformer's way of reading ckpt file

* address comments
Co-authored-by: Sun Choi <schoi@habana.ai>
* enable loading falcon-180b ckpt in .safetensors format

* Address comments borrowing transformer's way of reading ckpt file

* address comments

* Update ckpt loading

PR#15 reads a set of ckpt file names from the index json file.
When OH downloads files from the hub instead of loading from a cache dir, get_repo_root()
skips downloading the index json file. Thus the PR#15 fails to load file names.
This PR scans the path and returns a list of names that matches the pattern

* import modeling_utils from transformers
dudilester and others added 18 commits April 3, 2024 18:46
The new backend has been introduced to pytorch-integraton - HPU_BACKEND . The deprecated backend ( AOT_HPU_TRAINING_BACKEND ) shall no longer be available in optimum habana as it's going to be removed from pytorch-integration.
* Use KV cache till input seq len for prefill phase.

Pad KV cache to full input + new tokens len for decode phase.
Delete the KV cache used as inputs by HPU graphs after full prompt generation.
Ensure KV cache is not returned as output tensor during decode phase.
Deletion of KV cache input tensor used by HPU graphs needs to be protected by
PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable.
All the changes are protected by bucket internal flag.

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

* Revert initialization of KV cache

* Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag

* remove os import

* remove commented print

---------

Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>
Co-authored-by: Soila Kavulya <skavulya@gmail.com>
* Sampling search UseKV cache till input seq len for prefill phase

* Remove redundant line
* Mark only scales as const

* remove --fp8 flag usage from llama

* removed usage of ENABLE_CONST_MARKING

Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18

---------

Co-authored-by: Eran Geva <egeva@habana.ai>
- FlanT5 is giving a perf drop with "reduce_scatter"
value True in DS config

Signed-off-by: vineethanandh <vineethanandh@habana.ai>
* bf16 with disk offload protection

* comment fix
* Add rouge metric evalution for llama 70B with orca datasets

use rouge metric to evaluate the corretness of the model, it uses
openorca dataset

* Add rouge metric evalution for llama 70B with orca datasets

use rouge metric to evaluate the corretness of the model, it uses
openorca dataset

* Add rouge metric evalution for llama 70B with orca datasets

use rouge metric to evaluate the correctness of the model, it uses
openorca dataset
@jiminha jiminha requested a review from a user April 30, 2024 20:23
@jiminha jiminha requested a review from regisss as a code owner April 30, 2024 20:23
@jiminha jiminha closed this Apr 30, 2024
@jiminha jiminha deleted the ae_mistral_fp8_new branch April 30, 2024 20:25
@jiminha jiminha restored the ae_mistral_fp8_new branch April 30, 2024 20:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.