Skip to content

add fp8 related changes to mistral for text-generation#918

Merged
regisss merged 23 commits into
mainfrom
skaulintel/mistral_fp8
May 7, 2024
Merged

add fp8 related changes to mistral for text-generation#918
regisss merged 23 commits into
mainfrom
skaulintel/mistral_fp8

Conversation

@skaulintel
Copy link
Copy Markdown
Contributor

@skaulintel skaulintel commented Apr 23, 2024

What does this PR do?

Initial mistral fp8 change

Command Lines:

  1. 128x128xbs4

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 896 --fp8 --max_new_tokens 128 --max_input_tokens 128 --limit_hpu_graphs

Throughput (including tokenization) = 13250.825658116784 tokens/second
Number of HPU graphs = 85
Memory allocated = 38.37 GB
Max memory allocated = 94.61 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.98284676099138 seconds

  1. 2048x128

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120--fp8 --max_new_tokens 128 --max_input_tokens 2048 --limit_hpu_graphs

Throughput (including tokenization) = 1362.8371789032228 tokens/second
Number of HPU graphs = 85
Memory allocated = 74.29 GB
Max memory allocated = 93.82 GB
Total memory available = 94.62 GB
Graph compilation duration = 90.72206230499432 seconds

  1. 2048x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 44 --fp8 --max_new_tokens 2048 --max_input_tokens 2048 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 3105.9817365063354 tokens/second
Number of HPU graphs = 565
Memory allocated = 84.73 GB
Max memory allocated = 94.62 GB
Total memory available = 94.62 GB
Graph compilation duration = 414.38635561900446 seconds

  1. 128x2048

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 120 --fp8 --max_new_tokens 2048 --max_input_tokens 128 --bucket_internal --bucket_size 128 --limit_hpu_graphs

Throughput (including tokenization) = 7738.114888711109 tokens/second
Number of HPU graphs = 565
Memory allocated = 74.97 GB
Max memory allocated = 94.61 GB
Total memory available = 94.62 GB
Graph compilation duration = 405.53613558399957 seconds

@skaulintel skaulintel requested a review from regisss as a code owner April 23, 2024 17:54
@skaulintel skaulintel requested review from jiminha and libinta April 23, 2024 17:56
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@skaulintel skaulintel marked this pull request as draft April 23, 2024 22:00
@skaulintel skaulintel changed the title donotmerge: add fp8 related changes to mistral for text-generation add fp8 related changes to mistral for text-generation Apr 25, 2024
@skaulintel skaulintel marked this pull request as ready for review April 25, 2024 22:34
@libinta libinta added the run-test Run CI for PRs from external contributors label Apr 29, 2024
Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py Outdated
Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py Outdated
Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py Outdated
Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py
Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py
Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py
Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py Outdated
@regisss regisss merged commit 9f6eba3 into main May 7, 2024
@regisss regisss deleted the skaulintel/mistral_fp8 branch May 7, 2024 22:16
ccrhx4 pushed a commit to ccrhx4/ccrhx4.optimum-habana that referenced this pull request May 11, 2024
Co-authored-by: Jimin Ha <jha@habana.ai>
Co-authored-by: regisss <15324346+regisss@users.noreply.github.com>
@mandy-li mandy-li requested review from schoi-habana and removed request for schoi-habana May 17, 2024 16:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors synapse1.16

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants