Added Mistral fp8 support by jiminha · Pull Request #185 · HabanaAI/optimum-habana-fork

jiminha · 2024-04-30T20:42:52Z

What does this PR do?

Add support for Mistral fp8 text-generation.
Porting from:
huggingface#918
huggingface#931

Measurement

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size

Run

128x128xbs896

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13235.770108332108 tokens/second
Number of HPU graphs = 91
Memory allocated = 38.35 GB
Max memory allocated = 92.85 GB
Total memory available = 94.62 GB
Graph compilation duration = 72.49320212705061 seconds

2048x128xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graph

Throughput (including tokenization) = 1368.34859681789 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.3 GB
Total memory available = 94.62 GB
Graph compilation duration = 71.46055008197436 seconds

2048x2048xbs44

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3151.661188904903 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.44 GB
Total memory available = 94.62 GB
Graph compilation duration = 285.3293136919965 seconds

128x2048xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 7957.2094113766125 tokens/second
Number of HPU graphs = 565
Memory allocated = 74.96 GB
Max memory allocated = 94.09 GB
Total memory available = 94.62 GB
Graph compilation duration = 278.0657788790413 seconds

32000x512xbs4

Flash attention with fp16

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

Throughput (including tokenization) = 56.17350039654345 tokens/second
Number of HPU graphs = 17
Memory allocated = 90.82 GB
Max memory allocated = 90.84 GB
Total memory available = 94.62 GB
Graph compilation duration = 112.64211278100265 seconds

Fused SDPA with fp8

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs

Throughput (including tokenization) = 42.383026178121916 tokens/second
Number of HPU graphs = 85
Memory allocated = 33.55 GB
Max memory allocated = 77.12 GB
Total memory available = 94.62 GB
Graph compilation duration = 199.3312814909732 seconds

astachowiczhabana · 2024-06-12T09:38:58Z

huggingface#918

jiminha added 2 commits April 30, 2024 20:16

Added Mistral fp8 support

dd16172

Add flash_attention support for long sequences

a4b844b

skaulintel approved these changes Apr 30, 2024

View reviewed changes

Remove FusedSDPA option without flash_attention

53fda89

libinta approved these changes Apr 30, 2024

View reviewed changes

libinta merged commit 38491eb into habana-main Apr 30, 2024

astachowiczhabana pushed a commit that referenced this pull request May 6, 2024

Added Mistral fp8 support (#185)

cf6a82e

jiminha mentioned this pull request May 21, 2024

add mistral flash attention #172

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added Mistral fp8 support#185

Added Mistral fp8 support#185
libinta merged 3 commits into
habana-mainfrom
ae_mistral_fp8_new

jiminha commented Apr 30, 2024 •

edited

Loading

Uh oh!

astachowiczhabana commented Jun 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

jiminha commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Measurement

Run

128x128xbs896

2048x128xbs120

2048x2048xbs44

128x2048xbs120

32000x512xbs4

Flash attention with fp16

Fused SDPA with fp8

Uh oh!

astachowiczhabana commented Jun 12, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

jiminha commented Apr 30, 2024 •

edited

Loading