Support Mistral 32K input token by jiminha · Pull Request #931 · huggingface/optimum-habana

jiminha · 2024-04-30T03:53:53Z

What does this PR do?

Support long sequences 32k with bs4 with Flash attention and FusedSDPA

Additionally this also added these fix

support the token length also longer than 32k(max_position_embedding)
remove use_fused_rope argument which no longer in use.

Before submitting

Out of memory for any 32k input token

Use Flash attention

: Only enabled when the flag is enabled. Performed much better but the token generated is a little different. (accuracy issue)

32000x500xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

Throughput (including tokenization) = 55.87124162386965 tokens/second
Number of HPU graphs = 17
Memory allocated = 86.88 GB
Max memory allocated = 86.9 GB
Total memory available = 94.62 GB
Graph compilation duration = 112.99861384701217 seconds

input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term that can refer to any type of data that is used to represent information. In the context of machine learning, data refers to the input that is used to train a model or make predictions.\n\nIn the context of deep learning, data refers to the input that is used to train a neural network. This can be images, audio, text, or any other type of data that can be represented as numbers. The data is typically preprocessed and formatted into batches, which are then fed into the neural network during training.\n\nThe quality and quantity of the data used for training a deep learning model is a critical factor in determining the model's performance. High-quality data that is representative of the problem domain and free from noise and errors is essential for achieving accurate and reliable results. Additionally, having a large amount of data can help the model learn more complex patterns and relationships, leading to better performance.\n\nThere are various ways to obtain data for deep learning, including collecting it from the real world, generating it synthetically, or using publicly available datasets. In some cases, collecting and labeling data can be a time-consuming and expensive process, making it important to consider alternative sources or methods for obtaining data.\n\nOverall, data is a crucial component of deep learning, and the quality and quantity of the data used for training can have a significant impact on the model's performance. It is essential to carefully consider the data collection and preprocessing steps to ensure that the model is trained on high-quality data that accurately represents the problem domain. 2. What is the difference between supervised and unsupervised learning in deep learning?\n\nIn machine learning, supervised learning and unsupervised learning are two broad categories of learning algorithms based on the type of data used for training. In deep learning, these same concepts apply.\n\nSupervised learning refers to a type of machine learning where the model is trained on labeled data. Labeled data means that each input example comes with a corresponding output label that the model is trying to learn. For example, in image classification, each image is labeled with a class label such as "cat" or "dog." The model learns to map inputs to outputs based on the labeled examples it is given during training.\n\nUnsupervised learning, on the other hand, refers to a type of machine learning where the model is trained on unlabeled data. Unlabel',)

32000x700xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 700 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

Throughput (including tokenization) = 55.965171075990526 tokens/second
Number of HPU graphs = 17
Memory allocated = 87.35 GB
Max memory allocated = 87.37 GB
Total memory available = 94.62 GB
Graph compilation duration = 152.95444813597715 seconds

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term that can refer to any type of data that is used to represent information. In the context of machine learning, data refers to the input that is used to train a model or make predictions.\n\nIn the context of deep learning, data refers to the input that is used to train a neural network. This can be images, audio, text, or any other type of data that can be represented as numbers. The data is typically preprocessed and formatted into batches, which are then fed into the neural network during training.\n\nThe quality and quantity of the data used for training a deep learning model is a critical factor in determining the model's performance. High-quality data that is representative of the problem domain and free from noise and errors is essential for achieving accurate and reliable results. Additionally, having a large amount of data can help the model learn more complex patterns and relationships, leading to better performance.\n\nThere are various ways to obtain data for deep learning, including collecting it from the real world, generating it synthetically, or using publicly available datasets. In some cases, collecting and labeling data can be a time-consuming and expensive process, making it important to consider alternative sources or methods for obtaining data.\n\nOverall, data is a crucial component of deep learning, and the quality and quantity of the data used for training can have a significant impact on the model's performance. It is essential to carefully consider the data collection and preprocessing steps to ensure that the model is trained on high-quality data that accurately represents the problem domain. 2. What is the difference between supervised and unsupervised learning in deep learning?\n\nIn machine learning, supervised learning and unsupervised learning are two broad categories of learning algorithms based on the type of data used for training. In deep learning, these same concepts apply.\n\nSupervised learning refers to a type of machine learning where the model is trained on labeled data. Labeled data means that each input example comes with a corresponding output label that the model is trying to learn. For example, in image classification, each image is labeled with a class label such as "cat" or "dog." The model learns to map inputs to outputs based on the labeled examples it is given during training.\n\nUnsupervised learning, on the other hand, refers to a type of machine learning where the model is trained on unlabeled data. Unlabeled data means that there is no corresponding output label for each input example. Instead, the model learns to find patterns and relationships in the data on its own. For example, in clustering, the model groups similar data points together based on their features.\n\nIn deep learning, supervised learning is typically used for tasks such as image classification, speech recognition, and natural language processing, where the input data comes with labeled output labels. Unsupervised learning is typically used for tasks such as anomaly detection, dimensionality reduction, and clustering, where the input data does not come with labeled output labels.\n\nOne key difference between supervised and unsupervised learning in deep learning is the type of neural network architecture used. For supervised learning, fully connected neural networks are commonly used, while for unsupervised learning, recurrent neural networks (RNNs) and autoencoders are more commonly',)

Use Flash attention (without causal mask)

Better accuracy but performs slower than the one with causal_mask

32000x512xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs --use_flash_attention --flash_attention_recompute

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for distributed training and inference, developed by Meta. It is designed to be efficient on large-scale systems, and supports both PyTorch and TensorFlow models. DeepSpeed includes several components:\n\n1. Model Parallelism: DeepSpeed allows you to split your model across multiple GPUs or TPUs, which can help reduce memory usage and improve training speed.\n2. Data Parallelism: DeepSpeed supports standard data parallelism, where the model is replicated across multiple GPUs or TPUs and each replica processes a different batch of data.\n3. Pipeline Parallelism: DeepSpeed also supports pipeline parallelism, where the forward and backward passes are overlapped to improve utilization of GPUs or TPUs.\n4. Gradient Accumulation: DeepSpeed supports gradient accumulation, which allows you to accumulate gradients over multiple batches before performing a weight update. This can help reduce the number of communication rounds between GPUs or TPUs.\n5. Mixed Precision Training: DeepSpeed supports mixed precision training, which uses lower-precision data types (such as FP16) during training to reduce memory usage and improve training speed.\n6. Model Pruning: DeepSpeed includes tools for model pruning, which can help reduce the size of your model and improve inference speed.\n7. Distributed Inference: DeepSpeed supports distributed inference, which allows you to run your model on multiple GPUs or TPUs to improve inference speed.\n\nDeepSpeed is an open-source project, and you can find more information and documentation on the DeepSpeed GitHub page: https://github.com/microsoft/DeepSpeed.\n\nDeepSpeed is a machine learning framework developed by Meta for distributed training and inference. It supports both PyTorch and TensorFlow models and includes several components for efficient large-scale training:\n\n1. Model Parallelism: Splits the model across multiple GPUs or TPUs to reduce memory usage and improve training speed.\n2. Data Parallelism: Replicates the model across multiple GPUs or TPUs and processes different batches of data.\n3. Pipeline Parallelism: Overlaps forward and backward passes to improve GPU or TPU utilization.\n4. Gradient Accumulation: Accumulates gradients over multiple batches to reduce communication rounds.\n',)
Throughput (including tokenization) = 37.6014126683573 tokens/second
Number of HPU graphs = 15
Memory allocated = 47.62 GB
Max memory allocated = 87.27 GB
Total memory available = 94.62 GB
Graph compilation duration = 172.61339238239452 seconds

16000x512xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 8 --max_new_tokens 512 --max_input_tokens 16000 --limit_hpu_graphs --use_flash_attention --flash_attention_recompute

Throughput (including tokenization) = 117.49020642629243 tokens/second
Number of HPU graphs = 15
Memory allocated = 40.65 GB
Max memory allocated = 77.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 113.2534202276729 seconds

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for distributed training and inference, developed by Meta. It is designed to be efficient on large-scale systems, and supports both PyTorch and TensorFlow models. DeepSpeed includes several components:\n\n1. Model Parallelism: DeepSpeed allows you to split your model across multiple GPUs or TPUs, which can help reduce memory usage and improve training speed.\n2. Data Parallelism: DeepSpeed supports standard data parallelism, where the model is replicated across multiple GPUs or TPUs and each replica processes a different batch of data.\n3. Pipeline Parallelism: DeepSpeed also supports pipeline parallelism, where the forward and backward passes are overlapped to improve utilization of GPUs or TPUs.\n4. Gradient Accumulation: DeepSpeed supports gradient accumulation, which allows you to accumulate gradients over multiple batches before performing a weight update. This can help reduce the number of communication rounds between GPUs or TPUs.\n5. Mixed Precision Training: DeepSpeed supports mixed precision training, which uses lower-precision data types (such as FP16) during training to reduce memory usage and improve training speed.\n6. Model Pruning: DeepSpeed includes tools for model pruning, which can help reduce the size of your model and improve inference speed.\n7. Distributed Inference: DeepSpeed supports distributed inference, which allows you to run your model on multiple GPUs or TPUs to improve inference speed.\n\nDeepSpeed is an open-source project, and you can find more information and documentation on the DeepSpeed GitHub page: https://github.com/microsoft/DeepSpeed.\n\nDeepSpeed is a machine learning framework developed by Meta for distributed training and inference. It supports both PyTorch and TensorFlow models and includes several components for efficient large-scale training:\n\n1. Model Parallelism: Splits the model across multiple GPUs or TPUs to reduce memory usage and improve training speed.\n2. Data Parallelism: Replicates the model across multiple GPUs or TPUs and processes different batches of data.\n3. Pipeline Parallelism: Overlaps forward and backward passes to improve GPU or TPU utilization.\n4. Gradient Accumulation: Accumulates gradients over multiple batches to reduce communication rounds.\n',)

16000x512xbs4 + QUANT

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 24 --max_new_tokens 512 --max_input_tokens 16000 --limit_hpu_graphs --use_flash_attention --flash_attention_recompute

Throughput (including tokenization) = 121.95398079993026 tokens/second
Number of HPU graphs = 81
Memory allocated = 27.27 GB
Max memory allocated = 63.62 GB
Total memory available = 94.62 GB
Graph compilation duration = 194.07247889088467 seconds

…ce/optimum-habana into skaulintel/mistral_fp8

remove padding_mask warning

HuggingFaceDocBuilderDev · 2024-04-30T03:57:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

regisss · 2024-05-04T16:07:45Z

There is a lot of overlap with #918, let's wait for it to be merged.

regisss · 2024-05-08T07:28:46Z

@jiminha Can you rebase on main and fix the merge conflicts?

jiminha · 2024-05-09T21:18:23Z

@jiminha Can you rebase on main and fix the merge conflicts?

Done. Please review.

regisss

Is this compatible only with 1.16?
When running

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs

I get the following error:

Traceback (most recent call last):
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 643, in <module>
    main()
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 289, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 407, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 168, in setup_model
    habana_quantization_toolkit.prep_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in prep_model
    prepare_model(model)  # registers hooks
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/__init__.py", line 47, in prepare_model
    return quantize_hooks(model, mod_list)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/quantize.py", line 69, in quantize_hooks
    measurement=load_measurements(config.cfg['measure_file'])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/measure.py", line 129, in load_measurements
    d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/common.py", line 108, in load_file
    raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz

regisss · 2024-05-09T21:40:42Z

        reuse_cache: Optional[bool] = False,
        cache_idx: Optional[int] = None,
        attn_softmax_bf16: Optional[bool] = False,
-        use_fused_rope: Optional[bool] = True,


Good catch! I missed all these use_fused_rope

jiminha · 2024-05-09T21:50:48Z

Is this compatible only with 1.16? When running

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs

I get the following error:

Traceback (most recent call last):
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 643, in <module>
    main()
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 289, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 407, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 168, in setup_model
    habana_quantization_toolkit.prep_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in prep_model
    prepare_model(model)  # registers hooks
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/__init__.py", line 47, in prepare_model
    return quantize_hooks(model, mod_list)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/quantize.py", line 69, in quantize_hooks
    measurement=load_measurements(config.cfg['measure_file'])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/measure.py", line 129, in load_measurements
    d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/common.py", line 108, in load_file
    raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz

Do you have measurement done?
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1

Actually let me double check again if docker15 supports this FusedSDPA fp8.

jiminha · 2024-05-09T22:45:03Z

@regisss
I forgot to remove some instructions that's outdated. I removed just FusedSDPA option from my code and only kept flash_attention option(which shows best performance).

Try this command:
python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

regisss

LGTM!

The generations with causal mask look a bit off to me at the beginning (right after the input):

input 1: ('DeepSpeed is a machine learning framework',)                                                                                                                                       
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term

input 2: ('He is working on',)                                                                                                                                                                
output 1: ('He is working onе of the most popular and widely used programming languages.

input 3: ('He has a',)                                                                                                                                                                        
output 1: ("He has aеоlоgу, аnd аndrоid аnd iоs аpps.\n\n## What is the difference between a VPN and a proxy server?

input 4: ('He got all',)
output 1: ("He got allеу оff the ground.\n\nThe 2018 season was a turning point for the team.

I don't think it is related to this PR though.

jiminha · 2024-05-10T16:11:35Z

LGTM!

The generations with causal mask look a bit off to me at the beginning (right after the input):

input 1: ('DeepSpeed is a machine learning framework',)                                                                                                                                       
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term

input 2: ('He is working on',)                                                                                                                                                                
output 1: ('He is working onе of the most popular and widely used programming languages.

input 3: ('He has a',)                                                                                                                                                                        
output 1: ("He has aеоlоgу, аnd аndrоid аnd iоs аpps.\n\n## What is the difference between a VPN and a proxy server?

input 4: ('He got all',)
output 1: ("He got allеу оff the ground.\n\nThe 2018 season was a turning point for the team.

I don't think it is related to this PR though.

Yes it's known issue. We are expecting a fix to come in for 16 release.

skaulintel and others added 11 commits April 23, 2024 10:49

add fp8 related changes to mistral for text-generation

2549290

add KVCache object

513aa10

Fix layer_idx warning issue

be79f0f

Style fix

e955c20

add reuse_cache and some other arguments to mistral inputs

ffaaa1d

Merge branch 'skaulintel/mistral_fp8' of https://github.com/huggingfa…

7b950fe

…ce/optimum-habana into skaulintel/mistral_fp8

style reformat

0ee0339

Update modeling_mistral.py

dc7afb4

remove padding_mask warning

style fix

2e20a3f

Merge branch 'main' into skaulintel/mistral_fp8

e661aac

Mistral flash attention support for longer input length

04daee2

jiminha requested a review from regisss as a code owner April 30, 2024 03:53

jiminha changed the title ~~Support Mistral 13K~~ Support Mistral 32K input token Apr 30, 2024

jiminha requested review from libinta and mandy-li April 30, 2024 16:24

jiminha mentioned this pull request Apr 30, 2024

Added Mistral fp8 support HabanaAI/optimum-habana-fork#185

Merged

libinta added the run-test Run CI for PRs from external contributors label May 3, 2024

libinta added the synapse1.16 label May 7, 2024

libinta approved these changes May 8, 2024

View reviewed changes

jiminha added 2 commits May 8, 2024 12:58

Merge remote-tracking branch 'remotes/origin/main' into jha/mistral30k

c14243e

Add fp8 support for FusedSDPA for Mistral

e95c4ba

jiminha requested review from bhargaveede, ssarkar2 and vivekgoe as code owners May 9, 2024 21:15

Remove unnecessary comment

e54f93c

regisss reviewed May 9, 2024

View reviewed changes

Style fix

87de2e5

regisss approved these changes May 10, 2024

View reviewed changes

regisss merged commit 2efe099 into main May 10, 2024

regisss deleted the jha/mistral30k branch May 10, 2024 09:54

astachowiczhabana mentioned this pull request Jun 12, 2024

FP8 FusedSDPA support for Mistral HabanaAI/optimum-habana-fork#195

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support Mistral 32K input token#931

Support Mistral 32K input token#931
regisss merged 15 commits into
mainfrom
jha/mistral30k

jiminha commented Apr 30, 2024 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 30, 2024

Uh oh!

regisss commented May 4, 2024

Uh oh!

regisss commented May 8, 2024

Uh oh!

jiminha commented May 9, 2024

Uh oh!

regisss left a comment

Uh oh!

Uh oh!

regisss May 9, 2024

Uh oh!

jiminha commented May 9, 2024 •

edited

Loading

Uh oh!

jiminha commented May 9, 2024

Uh oh!

regisss left a comment

Uh oh!

jiminha commented May 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

jiminha commented Apr 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Before submitting

Use Flash attention

32000x500xbs4

32000x700xbs4

Use Flash attention (without causal mask)

32000x512xbs4

16000x512xbs4

16000x512xbs4 + QUANT

Uh oh!

HuggingFaceDocBuilderDev commented Apr 30, 2024

Uh oh!

regisss commented May 4, 2024

Uh oh!

regisss commented May 8, 2024

Uh oh!

jiminha commented May 9, 2024

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

regisss May 9, 2024

Choose a reason for hiding this comment

Uh oh!

jiminha commented May 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jiminha commented May 9, 2024

Uh oh!

regisss left a comment

Choose a reason for hiding this comment

Uh oh!

jiminha commented May 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jiminha commented Apr 30, 2024 •

edited

Loading

jiminha commented May 9, 2024 •

edited

Loading