Add support for Mistral fp8 by jiminha · Pull Request #935 · huggingface/optimum-habana

jiminha · 2024-04-30T20:23:10Z

What does this PR do?

Add support for Mistral fp8 text-generation.
Porting from: #918

Measurement

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size

Run

128x128xbs896

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13235.770108332108 tokens/second
Number of HPU graphs = 91
Memory allocated = 38.35 GB
Max memory allocated = 92.85 GB
Total memory available = 94.62 GB
Graph compilation duration = 72.49320212705061 seconds

2048x128xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graph

Throughput (including tokenization) = 1368.34859681789 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.3 GB
Total memory available = 94.62 GB
Graph compilation duration = 71.46055008197436 seconds

2048x2048xbs44

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3151.661188904903 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.44 GB
Total memory available = 94.62 GB
Graph compilation duration = 285.3293136919965 seconds

128x2048xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

….0 (huggingface#699)

…ce#679)

Co-authored-by: Sayantan Sarkar <sasarkar@habana.ai> Co-authored-by: Libin Tang <litang@habana.ai> Co-authored-by: Jimin Ha <jha@habana.ai> Co-authored-by: Yeonsil Yoon <yyoon@habana.ai> Co-authored-by: Sayantan Sarkar <supersarkar@gmail.com>

…ace#705)

…gface#711)

* Expose Llama Fused OPs control from run_lora_clm.py * Update as per review comments

* enable internal kv bucket in llama * initialize bucket_internal for CI * make bucket_internal more clear * further perf optim while max length is not multiple of bucket size

* [SW-173358] add first token prints * [SW-173358] rename x to outputs * [SW-173358] make style

* Enable Flash Attention in recompute and causal modes * Add flash_attention_causal_mask to generation utils * Propagate Flash Attention causal_mask to finetuning example * Modify README example and provide additional description * Add flash_attention_causal_mask to FT README

* enable loading falcon-180b ckpt in .safetensors format * Address comments borrowing transformer's way of reading ckpt file * address comments

Co-authored-by: Sun Choi <schoi@habana.ai>

* enable loading falcon-180b ckpt in .safetensors format * Address comments borrowing transformer's way of reading ckpt file * address comments * Update ckpt loading PR#15 reads a set of ckpt file names from the index json file. When OH downloads files from the hub instead of loading from a cache dir, get_repo_root() skips downloading the index json file. Thus the PR#15 fails to load file names. This PR scans the path and returns a list of names that matches the pattern * import modeling_utils from transformers

The new backend has been introduced to pytorch-integraton - HPU_BACKEND . The deprecated backend ( AOT_HPU_TRAINING_BACKEND ) shall no longer be available in optimum habana as it's going to be removed from pytorch-integration.

* Use KV cache till input seq len for prefill phase. Pad KV cache to full input + new tokens len for decode phase. Delete the KV cache used as inputs by HPU graphs after full prompt generation. Ensure KV cache is not returned as output tensor during decode phase. Deletion of KV cache input tensor used by HPU graphs needs to be protected by PT_HPUGRAPH_DISABLE_TENSOR_CACHE env variable. All the changes are protected by bucket internal flag. Signed-off-by: Puneesh Khanna <pkhanna@habana.ai> * Revert initialization of KV cache * Set PT_HPUGRAPH_DISABLE_TENSOR_CACHE flag * remove os import * remove commented print --------- Signed-off-by: Puneesh Khanna <pkhanna@habana.ai>

Co-authored-by: Soila Kavulya <skavulya@gmail.com>

* Sampling search UseKV cache till input seq len for prefill phase * Remove redundant line

* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>

- FlanT5 is giving a perf drop with "reduce_scatter" value True in DS config Signed-off-by: vineethanandh <vineethanandh@habana.ai>

* bf16 with disk offload protection * comment fix

* Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the corretness of the model, it uses openorca dataset * Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the corretness of the model, it uses openorca dataset * Add rouge metric evalution for llama 70B with orca datasets use rouge metric to evaluate the correctness of the model, it uses openorca dataset

regisss and others added 30 commits January 26, 2024 08:51

Release: v1.10.0

c1154b2

Fix tests (huggingface#669)

90aa87f

Add Flan T5 to model table (huggingface#677)

a91559d

Bring back workaround for Falcon with SynapseAI 1.13 (huggingface#685)

3bc80cc

Fix version check when console output is disabled (huggingface#688)

6d6facd

Patch for Gaudi Text-Generation Pipeline (huggingface#690)

f68950b

Updated requirements for image-classification samples: datasets>=2.14…

f508f20

….0 (huggingface#699)

Add ControlNet Pipeline (huggingface#585)

58c4ea1

Change capture logic for HPU graphs in Diffusers pipelines (huggingfa…

52d6c34

…ce#679)

Update example diff file for image classification (huggingface#703)

6feb65a

Add instruction in README to checkout latest stable release (huggingf…

3c40b89

…ace#705)

Release: v1.10.1

a26d748

Update diff file (huggingface#706)

94b1c6f

To fix LLAMA-V2-70B-FT-HF (8x) for eager mode (huggingface#709)

202f040

Adding a flag whether to save checkpoint or not in run_clm.py (huggin…

188b09e

…gface#711)

Pin Accelerate (huggingface#714)

d5926f3

Change for R1.10.2 (huggingface#719)

09de214

Release: v1.10.2

a6a88fa

Fix Llama initialization (huggingface#712)

a9f8ac3

Release: v1.10.3

ee7a0b3

Expose Llama Fused OPs control from run_lora_clm.py (#23)

0e56c6b

* Expose Llama Fused OPs control from run_lora_clm.py * Update as per review comments

enable internal kv bucket in llama (#24)

6fea7b8

* enable internal kv bucket in llama * initialize bucket_internal for CI * make bucket_internal more clear * further perf optim while max length is not multiple of bucket size

[SW-173358] add first token prints (#18)

50c3d13

* [SW-173358] add first token prints * [SW-173358] rename x to outputs * [SW-173358] make style

Fix inference command clip-roberta (#31)

763f609

Changing backend name (#32)

5169c64

enable falcon-180b inference (#15)

ef718ba

* enable loading falcon-180b ckpt in .safetensors format * Address comments borrowing transformer's way of reading ckpt file * address comments

Add support for safetensors and sharded checkpoints (#25)

e0d1de5

Co-authored-by: Sun Choi <schoi@habana.ai>

dudilester and others added 18 commits April 3, 2024 18:46

add flash_attention_causal_mask to run_lm_eval.py (#142)

512c715

Fix get_dtype and convert_into_dtypes (huggingface#769) (#144)

387e675

update kvcache mistral (#145)

69096d0

Remove deprecated AOT_HPU_TRAINING_BACKEND (#138)

e7a97c9

The new backend has been introduced to pytorch-integraton - HPU_BACKEND . The deprecated backend ( AOT_HPU_TRAINING_BACKEND ) shall no longer be available in optimum habana as it's going to be removed from pytorch-integration.

set all fusedrope inputs to bf16 (#140)

1d44433

Fixed bug when using reuse_cache (#146)

d563d70

Fix throughput calculation for diffusion models (huggingface#715) (#160)

df2f541

Co-authored-by: Soila Kavulya <skavulya@gmail.com>

Sampling search UseKV cache till input seq len for prefill phase (#161)

64efe5b

* Sampling search UseKV cache till input seq len for prefill phase * Remove redundant line

Add nograd() in text-generation examples lm_eval (#151)

aa62805

Mark scale as const and remove --fp8 flag usage (#156)

8bfd6ef

* Mark only scales as const * remove --fp8 flag usage from llama * removed usage of ENABLE_CONST_MARKING Change-Id: I6dba8691d842fc62d09da5202ea1e61a111f5f18 --------- Co-authored-by: Eran Geva <egeva@habana.ai>

Remove --fp8 flag from script (#171)

1586f4b

[SW-12028] - set ds config "reduce scatter" to false (#173)

c1a6274

- FlanT5 is giving a perf drop with "reduce_scatter" value True in DS config Signed-off-by: vineethanandh <vineethanandh@habana.ai>

llama70b one card to infer device map with max memory limitation (#174)

aae78e9

bf16 with disk offload protection (#176)

b420a45

* bf16 with disk offload protection * comment fix

Fix RoPE data type issue for gpt_neox and stablelm (#177)

a9bc76e

Added Mistral fp8 support

dd16172

jiminha requested review from ZhaiFeiyue, bhargaveede, hlahkar, libinta, mandy-li, ssarkar2 and vivekgoe as code owners April 30, 2024 20:23

jiminha requested a review from a user April 30, 2024 20:23

jiminha requested a review from regisss as a code owner April 30, 2024 20:23

jiminha closed this Apr 30, 2024

jiminha deleted the ae_mistral_fp8_new branch April 30, 2024 20:25

jiminha restored the ae_mistral_fp8_new branch April 30, 2024 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add support for Mistral fp8#935

Add support for Mistral fp8#935
jiminha wants to merge 100 commits into
huggingface:mainfrom
HabanaAI:ae_mistral_fp8_new

jiminha commented Apr 30, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

jiminha commented Apr 30, 2024

What does this PR do?

Measurement

Run

128x128xbs896

2048x128xbs120

2048x2048xbs44

128x2048xbs120

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants