Skip to content

Support Mistral 32K input token#931

Merged
regisss merged 15 commits into
mainfrom
jha/mistral30k
May 10, 2024
Merged

Support Mistral 32K input token#931
regisss merged 15 commits into
mainfrom
jha/mistral30k

Conversation

@jiminha
Copy link
Copy Markdown
Contributor

@jiminha jiminha commented Apr 30, 2024

What does this PR do?

Support long sequences 32k with bs4 with Flash attention and FusedSDPA

Additionally this also added these fix

  • support the token length also longer than 32k(max_position_embedding)
  • remove use_fused_rope argument which no longer in use.

Before submitting

Out of memory for any 32k input token

Use Flash attention

: Only enabled when the flag is enabled. Performed much better but the token generated is a little different. (accuracy issue)

32000x500xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

Throughput (including tokenization) = 55.87124162386965 tokens/second
Number of HPU graphs = 17
Memory allocated = 86.88 GB
Max memory allocated = 86.9 GB
Total memory available = 94.62 GB
Graph compilation duration = 112.99861384701217 seconds

input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term that can refer to any type of data that is used to represent information. In the context of machine learning, data refers to the input that is used to train a model or make predictions.\n\nIn the context of deep learning, data refers to the input that is used to train a neural network. This can be images, audio, text, or any other type of data that can be represented as numbers. The data is typically preprocessed and formatted into batches, which are then fed into the neural network during training.\n\nThe quality and quantity of the data used for training a deep learning model is a critical factor in determining the model's performance. High-quality data that is representative of the problem domain and free from noise and errors is essential for achieving accurate and reliable results. Additionally, having a large amount of data can help the model learn more complex patterns and relationships, leading to better performance.\n\nThere are various ways to obtain data for deep learning, including collecting it from the real world, generating it synthetically, or using publicly available datasets. In some cases, collecting and labeling data can be a time-consuming and expensive process, making it important to consider alternative sources or methods for obtaining data.\n\nOverall, data is a crucial component of deep learning, and the quality and quantity of the data used for training can have a significant impact on the model's performance. It is essential to carefully consider the data collection and preprocessing steps to ensure that the model is trained on high-quality data that accurately represents the problem domain. 2. What is the difference between supervised and unsupervised learning in deep learning?\n\nIn machine learning, supervised learning and unsupervised learning are two broad categories of learning algorithms based on the type of data used for training. In deep learning, these same concepts apply.\n\nSupervised learning refers to a type of machine learning where the model is trained on labeled data. Labeled data means that each input example comes with a corresponding output label that the model is trying to learn. For example, in image classification, each image is labeled with a class label such as "cat" or "dog." The model learns to map inputs to outputs based on the labeled examples it is given during training.\n\nUnsupervised learning, on the other hand, refers to a type of machine learning where the model is trained on unlabeled data. Unlabel',)

32000x700xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 700 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

Throughput (including tokenization) = 55.965171075990526 tokens/second
Number of HPU graphs = 17
Memory allocated = 87.35 GB
Max memory allocated = 87.37 GB
Total memory available = 94.62 GB
Graph compilation duration = 152.95444813597715 seconds

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term that can refer to any type of data that is used to represent information. In the context of machine learning, data refers to the input that is used to train a model or make predictions.\n\nIn the context of deep learning, data refers to the input that is used to train a neural network. This can be images, audio, text, or any other type of data that can be represented as numbers. The data is typically preprocessed and formatted into batches, which are then fed into the neural network during training.\n\nThe quality and quantity of the data used for training a deep learning model is a critical factor in determining the model's performance. High-quality data that is representative of the problem domain and free from noise and errors is essential for achieving accurate and reliable results. Additionally, having a large amount of data can help the model learn more complex patterns and relationships, leading to better performance.\n\nThere are various ways to obtain data for deep learning, including collecting it from the real world, generating it synthetically, or using publicly available datasets. In some cases, collecting and labeling data can be a time-consuming and expensive process, making it important to consider alternative sources or methods for obtaining data.\n\nOverall, data is a crucial component of deep learning, and the quality and quantity of the data used for training can have a significant impact on the model's performance. It is essential to carefully consider the data collection and preprocessing steps to ensure that the model is trained on high-quality data that accurately represents the problem domain. 2. What is the difference between supervised and unsupervised learning in deep learning?\n\nIn machine learning, supervised learning and unsupervised learning are two broad categories of learning algorithms based on the type of data used for training. In deep learning, these same concepts apply.\n\nSupervised learning refers to a type of machine learning where the model is trained on labeled data. Labeled data means that each input example comes with a corresponding output label that the model is trying to learn. For example, in image classification, each image is labeled with a class label such as "cat" or "dog." The model learns to map inputs to outputs based on the labeled examples it is given during training.\n\nUnsupervised learning, on the other hand, refers to a type of machine learning where the model is trained on unlabeled data. Unlabeled data means that there is no corresponding output label for each input example. Instead, the model learns to find patterns and relationships in the data on its own. For example, in clustering, the model groups similar data points together based on their features.\n\nIn deep learning, supervised learning is typically used for tasks such as image classification, speech recognition, and natural language processing, where the input data comes with labeled output labels. Unsupervised learning is typically used for tasks such as anomaly detection, dimensionality reduction, and clustering, where the input data does not come with labeled output labels.\n\nOne key difference between supervised and unsupervised learning in deep learning is the type of neural network architecture used. For supervised learning, fully connected neural networks are commonly used, while for unsupervised learning, recurrent neural networks (RNNs) and autoencoders are more commonly',)

Use Flash attention (without causal mask)

Better accuracy but performs slower than the one with causal_mask

32000x512xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs --use_flash_attention --flash_attention_recompute

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for distributed training and inference, developed by Meta. It is designed to be efficient on large-scale systems, and supports both PyTorch and TensorFlow models. DeepSpeed includes several components:\n\n1. Model Parallelism: DeepSpeed allows you to split your model across multiple GPUs or TPUs, which can help reduce memory usage and improve training speed.\n2. Data Parallelism: DeepSpeed supports standard data parallelism, where the model is replicated across multiple GPUs or TPUs and each replica processes a different batch of data.\n3. Pipeline Parallelism: DeepSpeed also supports pipeline parallelism, where the forward and backward passes are overlapped to improve utilization of GPUs or TPUs.\n4. Gradient Accumulation: DeepSpeed supports gradient accumulation, which allows you to accumulate gradients over multiple batches before performing a weight update. This can help reduce the number of communication rounds between GPUs or TPUs.\n5. Mixed Precision Training: DeepSpeed supports mixed precision training, which uses lower-precision data types (such as FP16) during training to reduce memory usage and improve training speed.\n6. Model Pruning: DeepSpeed includes tools for model pruning, which can help reduce the size of your model and improve inference speed.\n7. Distributed Inference: DeepSpeed supports distributed inference, which allows you to run your model on multiple GPUs or TPUs to improve inference speed.\n\nDeepSpeed is an open-source project, and you can find more information and documentation on the DeepSpeed GitHub page: https://github.com/microsoft/DeepSpeed.\n\nDeepSpeed is a machine learning framework developed by Meta for distributed training and inference. It supports both PyTorch and TensorFlow models and includes several components for efficient large-scale training:\n\n1. Model Parallelism: Splits the model across multiple GPUs or TPUs to reduce memory usage and improve training speed.\n2. Data Parallelism: Replicates the model across multiple GPUs or TPUs and processes different batches of data.\n3. Pipeline Parallelism: Overlaps forward and backward passes to improve GPU or TPU utilization.\n4. Gradient Accumulation: Accumulates gradients over multiple batches to reduce communication rounds.\n',)
Throughput (including tokenization) = 37.6014126683573 tokens/second
Number of HPU graphs = 15
Memory allocated = 47.62 GB
Max memory allocated = 87.27 GB
Total memory available = 94.62 GB
Graph compilation duration = 172.61339238239452 seconds

16000x512xbs4

python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 8 --max_new_tokens 512 --max_input_tokens 16000 --limit_hpu_graphs --use_flash_attention --flash_attention_recompute

Throughput (including tokenization) = 117.49020642629243 tokens/second
Number of HPU graphs = 15
Memory allocated = 40.65 GB
Max memory allocated = 77.01 GB
Total memory available = 94.62 GB
Graph compilation duration = 113.2534202276729 seconds

Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework for distributed training and inference, developed by Meta. It is designed to be efficient on large-scale systems, and supports both PyTorch and TensorFlow models. DeepSpeed includes several components:\n\n1. Model Parallelism: DeepSpeed allows you to split your model across multiple GPUs or TPUs, which can help reduce memory usage and improve training speed.\n2. Data Parallelism: DeepSpeed supports standard data parallelism, where the model is replicated across multiple GPUs or TPUs and each replica processes a different batch of data.\n3. Pipeline Parallelism: DeepSpeed also supports pipeline parallelism, where the forward and backward passes are overlapped to improve utilization of GPUs or TPUs.\n4. Gradient Accumulation: DeepSpeed supports gradient accumulation, which allows you to accumulate gradients over multiple batches before performing a weight update. This can help reduce the number of communication rounds between GPUs or TPUs.\n5. Mixed Precision Training: DeepSpeed supports mixed precision training, which uses lower-precision data types (such as FP16) during training to reduce memory usage and improve training speed.\n6. Model Pruning: DeepSpeed includes tools for model pruning, which can help reduce the size of your model and improve inference speed.\n7. Distributed Inference: DeepSpeed supports distributed inference, which allows you to run your model on multiple GPUs or TPUs to improve inference speed.\n\nDeepSpeed is an open-source project, and you can find more information and documentation on the DeepSpeed GitHub page: https://github.com/microsoft/DeepSpeed.\n\nDeepSpeed is a machine learning framework developed by Meta for distributed training and inference. It supports both PyTorch and TensorFlow models and includes several components for efficient large-scale training:\n\n1. Model Parallelism: Splits the model across multiple GPUs or TPUs to reduce memory usage and improve training speed.\n2. Data Parallelism: Replicates the model across multiple GPUs or TPUs and processes different batches of data.\n3. Pipeline Parallelism: Overlaps forward and backward passes to improve GPU or TPU utilization.\n4. Gradient Accumulation: Accumulates gradients over multiple batches to reduce communication rounds.\n',)

16000x512xbs4 + QUANT

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 24 --max_new_tokens 512 --max_input_tokens 16000 --limit_hpu_graphs --use_flash_attention --flash_attention_recompute

Throughput (including tokenization) = 121.95398079993026 tokens/second
Number of HPU graphs = 81
Memory allocated = 27.27 GB
Max memory allocated = 63.62 GB
Total memory available = 94.62 GB
Graph compilation duration = 194.07247889088467 seconds

@jiminha jiminha requested a review from regisss as a code owner April 30, 2024 03:53
@jiminha jiminha changed the title Support Mistral 13K Support Mistral 32K input token Apr 30, 2024
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@jiminha jiminha requested review from libinta and mandy-li April 30, 2024 16:24
@libinta libinta added the run-test Run CI for PRs from external contributors label May 3, 2024
@regisss
Copy link
Copy Markdown
Collaborator

regisss commented May 4, 2024

There is a lot of overlap with #918, let's wait for it to be merged.

@regisss
Copy link
Copy Markdown
Collaborator

regisss commented May 8, 2024

@jiminha Can you rebase on main and fix the merge conflicts?

@jiminha
Copy link
Copy Markdown
Contributor Author

jiminha commented May 9, 2024

@jiminha Can you rebase on main and fix the merge conflicts?

Done. Please review.

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this compatible only with 1.16?
When running

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs

I get the following error:

Traceback (most recent call last):
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 643, in <module>
    main()
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 289, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 407, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 168, in setup_model
    habana_quantization_toolkit.prep_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in prep_model
    prepare_model(model)  # registers hooks
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/__init__.py", line 47, in prepare_model
    return quantize_hooks(model, mod_list)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/quantize.py", line 69, in quantize_hooks
    measurement=load_measurements(config.cfg['measure_file'])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/measure.py", line 129, in load_measurements
    d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/common.py", line 108, in load_file
    raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz

Comment thread optimum/habana/transformers/models/mistral/modeling_mistral.py Outdated
reuse_cache: Optional[bool] = False,
cache_idx: Optional[int] = None,
attn_softmax_bf16: Optional[bool] = False,
use_fused_rope: Optional[bool] = True,
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch! I missed all these use_fused_rope

@jiminha
Copy link
Copy Markdown
Contributor Author

jiminha commented May 9, 2024

Is this compatible only with 1.16? When running

QUANT_CONFIG=./quantization_config/maxabs_quant.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --limit_hpu_graphs

I get the following error:

Traceback (most recent call last):
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 643, in <module>
    main()
  File "/root/workspace/optimum-habana/examples/text-generation/run_generation.py", line 289, in main
    model, tokenizer, generation_config = initialize_model(args, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 407, in initialize_model
    setup_model(args, model_dtype, model_kwargs, logger)
  File "/root/workspace/optimum-habana/examples/text-generation/utils.py", line 168, in setup_model
    habana_quantization_toolkit.prep_model(model)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/prepare_quant/prepare_model.py", line 14, in prep_model
    prepare_model(model)  # registers hooks
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/__init__.py", line 47, in prepare_model
    return quantize_hooks(model, mod_list)
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/quantize.py", line 69, in quantize_hooks
    measurement=load_measurements(config.cfg['measure_file'])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/measure.py", line 129, in load_measurements
    d = load_file(fname_np, np.ndarray, fail_on_file_not_exist=config['scale_method'] not in [ScaleMethod.WITHOUT_SCALE, ScaleMethod.UNIT_SCALE])
  File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_hook_method/common.py", line 108, in load_file
    raise FileNotFoundError(f"Failed to load file {fname}")
FileNotFoundError: Failed to load file ./hqt_output/measure_hooks_maxabs.npz

Do you have measurement done?
QUANT_CONFIG=./quantization_config/maxabs_measure_include_outputs.json python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 1

Actually let me double check again if docker15 supports this FusedSDPA fp8.

@jiminha
Copy link
Copy Markdown
Contributor Author

jiminha commented May 9, 2024

@regisss
I forgot to remove some instructions that's outdated. I removed just FusedSDPA option from my code and only kept flash_attention option(which shows best performance).

Try this command:
python run_generation.py --model_name_or_path mistralai/Mistral-7B-Instruct-v0.2 --attn_softmax_bf16 --use_hpu_graphs --trim_logits --use_kv_cache --reuse_cache --bf16 --batch_size 4 --max_new_tokens 512 --max_input_tokens 32000 --use_flash_attention --flash_attention_recompute --flash_attention_causal_mask

Copy link
Copy Markdown
Collaborator

@regisss regisss left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

The generations with causal mask look a bit off to me at the beginning (right after the input):

input 1: ('DeepSpeed is a machine learning framework',)                                                                                                                                       
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term

input 2: ('He is working on',)                                                                                                                                                                
output 1: ('He is working onе of the most popular and widely used programming languages.

input 3: ('He has a',)                                                                                                                                                                        
output 1: ("He has aеоlоgу, аnd аndrоid аnd iоs аpps.\n\n## What is the difference between a VPN and a proxy server?

input 4: ('He got all',)
output 1: ("He got allеу оff the ground.\n\nThe 2018 season was a turning point for the team.

I don't think it is related to this PR though.

@regisss regisss merged commit 2efe099 into main May 10, 2024
@regisss regisss deleted the jha/mistral30k branch May 10, 2024 09:54
@jiminha
Copy link
Copy Markdown
Contributor Author

jiminha commented May 10, 2024

LGTM!

The generations with causal mask look a bit off to me at the beginning (right after the input):

input 1: ('DeepSpeed is a machine learning framework',)                                                                                                                                       
output 1: ('DeepSpeed is a machine learning frameworkе оn the other hand, is a more general term

input 2: ('He is working on',)                                                                                                                                                                
output 1: ('He is working onе of the most popular and widely used programming languages.

input 3: ('He has a',)                                                                                                                                                                        
output 1: ("He has aеоlоgу, аnd аndrоid аnd iоs аpps.\n\n## What is the difference between a VPN and a proxy server?

input 4: ('He got all',)
output 1: ("He got allеу оff the ground.\n\nThe 2018 season was a turning point for the team.

I don't think it is related to this PR though.

Yes it's known issue. We are expecting a fix to come in for 16 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors synapse1.16

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants