Skip to content

Added Mistral fp8 support#185

Merged
libinta merged 3 commits into
habana-mainfrom
ae_mistral_fp8_new
Apr 30, 2024
Merged

Added Mistral fp8 support#185
libinta merged 3 commits into
habana-mainfrom
ae_mistral_fp8_new

Conversation

@jiminha
Copy link
Copy Markdown

@jiminha jiminha commented Apr 30, 2024

What does this PR do?

Add support for Mistral fp8 text-generation.
Porting from:
huggingface#918
huggingface#931

Measurement

QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size

Run

128x128xbs896

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13235.770108332108 tokens/second
Number of HPU graphs = 91
Memory allocated = 38.35 GB
Max memory allocated = 92.85 GB
Total memory available = 94.62 GB
Graph compilation duration = 72.49320212705061 seconds

2048x128xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graph

Throughput (including tokenization) = 1368.34859681789 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.3 GB
Total memory available = 94.62 GB
Graph compilation duration = 71.46055008197436 seconds

2048x2048xbs44

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3151.661188904903 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.44 GB
Total memory available = 94.62 GB
Graph compilation duration = 285.3293136919965 seconds

128x2048xbs120

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 7957.2094113766125 tokens/second
Number of HPU graphs = 565
Memory allocated = 74.96 GB
Max memory allocated = 94.09 GB
Total memory available = 94.62 GB
Graph compilation duration = 278.0657788790413 seconds

32000x512xbs4

Flash attention with fp16

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

Throughput (including tokenization) = 56.17350039654345 tokens/second
Number of HPU graphs = 17
Memory allocated = 90.82 GB
Max memory allocated = 90.84 GB
Total memory available = 94.62 GB
Graph compilation duration = 112.64211278100265 seconds

Fused SDPA with fp8

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs

Throughput (including tokenization) = 42.383026178121916 tokens/second
Number of HPU graphs = 85
Memory allocated = 33.55 GB
Max memory allocated = 77.12 GB
Total memory available = 94.62 GB
Graph compilation duration = 199.3312814909732 seconds

@libinta libinta merged commit 38491eb into habana-main Apr 30, 2024
astachowiczhabana pushed a commit that referenced this pull request May 6, 2024
@astachowiczhabana
Copy link
Copy Markdown

huggingface#918

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants