Support Mistral 32K input token#931
Conversation
…ce/optimum-habana into skaulintel/mistral_fp8
remove padding_mask warning
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
There is a lot of overlap with #918, let's wait for it to be merged. |
|
@jiminha Can you rebase on main and fix the merge conflicts? |
Done. Please review. |
regisss
left a comment
There was a problem hiding this comment.
Is this compatible only with 1.16?
When running
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs
I get the following error:
Traceback (most recent call last):
File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 643, in <module>
main()
File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 289, in main
model, tokenizer, generation_config = initialize_model(args, logger)
File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 407, in initialize_model
setup_model(args, model_dtype, model_kwargs, logger)
File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 168, in setup_model
habana_quantization_toolkit.prep_model(model)
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in prep_model
prepare_model(model) # registers hooks
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/__init__.py", line 47, in prepare_model
return quantize_hooks(model, mod_list)
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/quantize.py", line 69, in quantize_hooks
measurement=load_measurements(config.cfg['measure_file'])
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/measure.py", line 129, in load_measurements
d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/common.py", line 108, in load_file
raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz
| reuse_cache: Optional[bool] = False, | ||
| cache_idx: Optional[int] = None, | ||
| attn_softmax_bf16: Optional[bool] = False, | ||
| use_fused_rope: Optional[bool] = True, |
There was a problem hiding this comment.
Good catch! I missed all these use_fused_rope
Do you have measurement done? Actually let me double check again if docker15 supports this FusedSDPA fp8. |
|
@regisss Try this command: |
regisss
left a comment
There was a problem hiding this comment.
LGTM!
The generations with causal mask look a bit off to me at the beginning (right after the input):
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term
input 2: ('He is working on',)
output 1: ('He is working onе of the most popular and widely used programming languages.
input 3: ('He has a',)
output 1: ("He has aеоlоgу, аnd аndrоid аnd iоs аpps.\n\n## What is the difference between a VPN and a proxy server?
input 4: ('He got all',)
output 1: ("He got allеу оff the ground.\n\nThe 2018 season was a turning point for the team.
I don't think it is related to this PR though.
Yes it's known issue. We are expecting a fix to come in for 16 release. |
What does this PR do?
Support long sequences 32k with bs4 with Flash attention and FusedSDPA
Additionally this also added these fix
Before submitting
Out of memory for any 32k input token
Use Flash attention
: Only enabled when the flag is enabled. Performed much better but the token generated is a little different. (accuracy issue)
32000x500xbs4
python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_maskThroughput (including tokenization) = 55.87124162386965 tokens/second
Number of HPU graphs = 17
Memory allocated = 86.88 GB
Max memory allocated = 86.9 GB
Total memory available = 94.62 GB
Graph compilation duration = 112.99861384701217 seconds
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term that can refer to any type of data that is used to represent information. In the context of machine learning, data refers to the input that is used to train a model or make predictions.\n\nIn the context of deep learning, data refers to the input that is used to train a neural network. This can be images, audio, text, or any other type of data that can be represented as numbers. The data is typically preprocessed and formatted into batches, which are then fed into the neural network during training.\n\nThe quality and quantity of the data used for training a deep learning model is a critical factor in determining the model's performance. High-quality data that is representative of the problem domain and free from noise and errors is essential for achieving accurate and reliable results. Additionally, having a large amount of data can help the model learn more complex patterns and relationships, leading to better performance.\n\nThere are various ways to obtain data for deep learning, including collecting it from the real world, generating it synthetically, or using publicly available datasets. In some cases, collecting and labeling data can be a time-consuming and expensive process, making it important to consider alternative sources or methods for obtaining data.\n\nOverall, data is a crucial component of deep learning, and the quality and quantity of the data used for training can have a significant impact on the model's performance. It is essential to carefully consider the data collection and preprocessing steps to ensure that the model is trained on high-quality data that accurately represents the problem domain. 2. What is the difference between supervised and unsupervised learning in deep learning?\n\nIn machine learning, supervised learning and unsupervised learning are two broad categories of learning algorithms based on the type of data used for training. In deep learning, these same concepts apply.\n\nSupervised learning refers to a type of machine learning where the model is trained on labeled data. Labeled data means that each input example comes with a corresponding output label that the model is trying to learn. For example, in image classification, each image is labeled with a class label such as "cat" or "dog." The model learns to map inputs to outputs based on the labeled examples it is given during training.\n\nUnsupervised learning, on the other hand, refers to a type of machine learning where the model is trained on unlabeled data. Unlabel',)
32000x700xbs4
python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 700 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_maskThroughput (including tokenization) = 55.965171075990526 tokens/second
Number of HPU graphs = 17
Memory allocated = 87.35 GB
Max memory allocated = 87.37 GB
Total memory available = 94.62 GB
Graph compilation duration = 152.95444813597715 seconds
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term that can refer to any type of data that is used to represent information. In the context of machine learning, data refers to the input that is used to train a model or make predictions.\n\nIn the context of deep learning, data refers to the input that is used to train a neural network. This can be images, audio, text, or any other type of data that can be represented as numbers. The data is typically preprocessed and formatted into batches, which are then fed into the neural network during training.\n\nThe quality and quantity of the data used for training a deep learning model is a critical factor in determining the model's performance. High-quality data that is representative of the problem domain and free from noise and errors is essential for achieving accurate and reliable results. Additionally, having a large amount of data can help the model learn more complex patterns and relationships, leading to better performance.\n\nThere are various ways to obtain data for deep learning, including collecting it from the real world, generating it synthetically, or using publicly available datasets. In some cases, collecting and labeling data can be a time-consuming and expensive process, making it important to consider alternative sources or methods for obtaining data.\n\nOverall, data is a crucial component of deep learning, and the quality and quantity of the data used for training can have a significant impact on the model's performance. It is essential to carefully consider the data collection and preprocessing steps to ensure that the model is trained on high-quality data that accurately represents the problem domain. 2. What is the difference between supervised and unsupervised learning in deep learning?\n\nIn machine learning, supervised learning and unsupervised learning are two broad categories of learning algorithms based on the type of data used for training. In deep learning, these same concepts apply.\n\nSupervised learning refers to a type of machine learning where the model is trained on labeled data. Labeled data means that each input example comes with a corresponding output label that the model is trying to learn. For example, in image classification, each image is labeled with a class label such as "cat" or "dog." The model learns to map inputs to outputs based on the labeled examples it is given during training.\n\nUnsupervised learning, on the other hand, refers to a type of machine learning where the model is trained on unlabeled data. Unlabeled data means that there is no corresponding output label for each input example. Instead, the model learns to find patterns and relationships in the data on its own. For example, in clustering, the model groups similar data points together based on their features.\n\nIn deep learning, supervised learning is typically used for tasks such as image classification, speech recognition, and natural language processing, where the input data comes with labeled output labels. Unsupervised learning is typically used for tasks such as anomaly detection, dimensionality reduction, and clustering, where the input data does not come with labeled output labels.\n\nOne key difference between supervised and unsupervised learning in deep learning is the type of neural network architecture used. For supervised learning, fully connected neural networks are commonly used, while for unsupervised learning, recurrent neural networks (RNNs) and autoencoders are more commonly',)
Use Flash attention (without causal mask)
Better accuracy but performs slower than the one with causal_mask
32000x512xbs4
python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs --use_flash_attention --flash_attention_recomputeInput/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for distributed training and inference, developed by Meta. It is designed to be efficient on large-scale systems, and supports both PyTorch and TensorFlow models. DeepSpeed includes several components:\n\n1. Model Parallelism: DeepSpeed allows you to split your model across multiple GPUs or TPUs, which can help reduce memory usage and improve training speed.\n2. Data Parallelism: DeepSpeed supports standard data parallelism, where the model is replicated across multiple GPUs or TPUs and each replica processes a different batch of data.\n3. Pipeline Parallelism: DeepSpeed also supports pipeline parallelism, where the forward and backward passes are overlapped to improve utilization of GPUs or TPUs.\n4. Gradient Accumulation: DeepSpeed supports gradient accumulation, which allows you to accumulate gradients over multiple batches before performing a weight update. This can help reduce the number of communication rounds between GPUs or TPUs.\n5. Mixed Precision Training: DeepSpeed supports mixed precision training, which uses lower-precision data types (such as FP16) during training to reduce memory usage and improve training speed.\n6. Model Pruning: DeepSpeed includes tools for model pruning, which can help reduce the size of your model and improve inference speed.\n7. Distributed Inference: DeepSpeed supports distributed inference, which allows you to run your model on multiple GPUs or TPUs to improve inference speed.\n\nDeepSpeed is an open-source project, and you can find more information and documentation on the DeepSpeed GitHub page: https://github.com/microsoft/DeepSpeed.\n\nDeepSpeed is a machine learning framework developed by Meta for distributed training and inference. It supports both PyTorch and TensorFlow models and includes several components for efficient large-scale training:\n\n1. Model Parallelism: Splits the model across multiple GPUs or TPUs to reduce memory usage and improve training speed.\n2. Data Parallelism: Replicates the model across multiple GPUs or TPUs and processes different batches of data.\n3. Pipeline Parallelism: Overlaps forward and backward passes to improve GPU or TPU utilization.\n4. Gradient Accumulation: Accumulates gradients over multiple batches to reduce communication rounds.\n',)
Throughput (including tokenization) = 37.6014126683573 tokens/second
Number of HPU graphs = 15
Memory allocated = 47.62 GB
Max memory allocated = 87.27 GB
Total memory available = 94.62 GB
Graph compilation duration = 172.61339238239452 seconds
16000x512xbs4
python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 8 --max_new_tokens 512 --max_input_tokens 16000 --limit_hpu_graphs --use_flash_attention --flash_attention_recomputeThroughput (including tokenization) = 117.49020642629243 tokens/second
Number of HPU graphs = 15
Memory allocated = 40.65 GB
Max memory allocated = 77.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 113.2534202276729 seconds
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for distributed training and inference, developed by Meta. It is designed to be efficient on large-scale systems, and supports both PyTorch and TensorFlow models. DeepSpeed includes several components:\n\n1. Model Parallelism: DeepSpeed allows you to split your model across multiple GPUs or TPUs, which can help reduce memory usage and improve training speed.\n2. Data Parallelism: DeepSpeed supports standard data parallelism, where the model is replicated across multiple GPUs or TPUs and each replica processes a different batch of data.\n3. Pipeline Parallelism: DeepSpeed also supports pipeline parallelism, where the forward and backward passes are overlapped to improve utilization of GPUs or TPUs.\n4. Gradient Accumulation: DeepSpeed supports gradient accumulation, which allows you to accumulate gradients over multiple batches before performing a weight update. This can help reduce the number of communication rounds between GPUs or TPUs.\n5. Mixed Precision Training: DeepSpeed supports mixed precision training, which uses lower-precision data types (such as FP16) during training to reduce memory usage and improve training speed.\n6. Model Pruning: DeepSpeed includes tools for model pruning, which can help reduce the size of your model and improve inference speed.\n7. Distributed Inference: DeepSpeed supports distributed inference, which allows you to run your model on multiple GPUs or TPUs to improve inference speed.\n\nDeepSpeed is an open-source project, and you can find more information and documentation on the DeepSpeed GitHub page: https://github.com/microsoft/DeepSpeed.\n\nDeepSpeed is a machine learning framework developed by Meta for distributed training and inference. It supports both PyTorch and TensorFlow models and includes several components for efficient large-scale training:\n\n1. Model Parallelism: Splits the model across multiple GPUs or TPUs to reduce memory usage and improve training speed.\n2. Data Parallelism: Replicates the model across multiple GPUs or TPUs and processes different batches of data.\n3. Pipeline Parallelism: Overlaps forward and backward passes to improve GPU or TPU utilization.\n4. Gradient Accumulation: Accumulates gradients over multiple batches to reduce communication rounds.\n',)
16000x512xbs4 + QUANT
QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 24 --max_new_tokens 512 --max_input_tokens 16000 --limit_hpu_graphs --use_flash_attention --flash_attention_recomputeThroughput (including tokenization) = 121.95398079993026 tokens/second
Number of HPU graphs = 81
Memory allocated = 27.27 GB
Max memory allocated = 63.62 GB
Total memory available = 94.62 GB
Graph compilation duration = 194.07247889088467 seconds