Skip to content

Support mixtral long sequence 32k with bs 4#903

Merged
regisss merged 14 commits into
huggingface:mainfrom
jychen21:support-long-sequence-with-bs4
May 10, 2024
Merged

Support mixtral long sequence 32k with bs 4#903
regisss merged 14 commits into
huggingface:mainfrom
jychen21:support-long-sequence-with-bs4

Conversation

@jychen21
Copy link
Copy Markdown

What does this PR do?

Support long sequences 32k with bs4 (move Q slicing inside loop to save memory)

Before
OutofMemory

After
Basic Command (with --limit_hpu_graphs, --reuse_cache, --bucket_internal, --batch_size 4)

QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py --model_name_or_path mistralai/Mixtral-8x7B-v0.1 --use_hpu_graphs --limit_hpu_graphs --use_kv_cache --reuse_cache --bucket_internal --bucket_size ${bucket_size} --max_new_tokens ${max_new_tokens} --bf16 --fp8 --batch_size 4 --max_input_tokens 32000

Test case 1: --bucket_size 128 --max_new_tokens 128
input 4: ('He got all',)
output 1: ('He got all the way to the top of the mountain, but he couldn’t get over the last hurdle.\n\nThe 2018 World Cup is over for Cristiano Ronaldo.\n\nThe Portugal star was unable to help his team advance to the quarterfinals, as Uruguay defeated Portugal 2-1 in the round of 16 on Saturday.\n\nRonaldo, 33, has never won a World Cup. He’s never even made it to the semifinals.\n\nRonaldo has won the Champions League five times, the Ballon d’Or',)

Stats:
Throughput (including tokenization) = 8.158104216323297 tokens/second
Number of HPU graphs = 337
Memory allocated = 69.49 GB
Max memory allocated = 94.41 GB
Total memory available = 94.62 GB
Graph compilation duration = 464.618659114989 seconds

Test case 2: --bucket_size 256 --max_new_tokens 512
input 4: ('He got all',)
output 1: ('He got all the way to the top of the mountain, but he couldn’t get over the last hurdle.\n\nThe 2018 World Cup is over for Cristiano Ronaldo.\n\nThe Portugal star was unable to help his team advance to the quarterfinals, as Uruguay defeated Portugal 2-1 in the round of 16 on Saturday.\n\nRonaldo, 33, has never won a World Cup. He’s never even made it to the semifinals.\n\nRonaldo has won the Champions League five times, the Ballon d’Or five times, the European Championship once, and the European Golden Shoe four times.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 85 goals in 154 appearances for Portugal. He’s scored 450 goals in 438 appearances for Real Madrid.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying.',)

Stats:
Throughput (including tokenization) = 27.221884640574576 tokens/second
Number of HPU graphs = 369
Memory allocated = 69.68 GB
Max memory allocated = 94.58 GB
Total memory available = 94.62 GB
Graph compilation duration = 547.4480734140379 seconds

Test case 3: --bucket_size 256 --max_new_tokens 700
input 4: ('He got all',)
output 1: ('He got all the way to the top of the mountain, but he couldn’t get over the last hurdle.\n\nThe 2018 World Cup is over for Cristiano Ronaldo.\n\nThe Portugal star was unable to help his team advance to the quarterfinals, as Uruguay defeated Portugal 2-1 in the round of 16 on Saturday.\n\nRonaldo, 33, has never won a World Cup. He’s never even made it to the semifinals.\n\nRonaldo has won the Champions League five times, the Ballon d’Or five times, the European Championship once, and the European Golden Shoe four times.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 85 goals in 154 appearances for Portugal. He’s scored 450 goals in 438 appearances for Real Madrid.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.\n\nBut he’s never won a World Cup.\n\nRonaldo has scored 15 goals in 17 appearances for Portugal in World Cup qualifying. He’s scored 15 goals in 14 appearances for Portugal in the World Cup.',)

Stats:
Throughput (including tokenization) = 34.39802785860282 tokens/second
Number of HPU graphs = 401
Memory allocated = 69.78 GB
Max memory allocated = 94.6 GB
Total memory available = 94.62 GB
Graph compilation duration = 670.873160321964 seconds

Before submitting

  • This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • Did you make sure to update the documentation with your changes?
  • Did you write any new necessary tests?

@jychen21
Copy link
Copy Markdown
Author

Break PR #836 into small pieces, based on PR #901

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@mandy-li mandy-li self-requested a review April 29, 2024 19:01
@mandy-li
Copy link
Copy Markdown
Collaborator

@jychen-habana , please test rope_scaling with Mixtral and update the results here.

Comment thread optimum/habana/transformers/models/mixtral/modeling_mixtral.py
@jychen21
Copy link
Copy Markdown
Author

jychen21 commented May 8, 2024

@jychen-habana , please test rope_scaling with Mixtral and update the results here.

Run with rope_scaling (add below to config.json):
"rope_scaling": {"type":"linear","factor":2.0},

Test case: --max_input_tokens 32000 --bucket_size 1024 --max_new_tokens 512 --batch_size 1
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that is developed by Microsoft to train large models with billions of parameters. It is a library that is built on top of PyTorcharm and is designed to train large models with billions of parameters. It is a library is built on top of PyTorcharm and is designed to train large models with billions of parameters. It is library is built on top Pycharm and designed to train large models with billions of parameters. It is library is built on Pycharm designed to train large models billions of parameters. It is library built on Pyarm to train large models billions parameters. It built Pyarm to train models billions. It arm train models. It train.\n\n\nDeepSpeed is a machine learning framework is developed by Microsoft to train large models with billions of parameters. It is a library is built on top PyTorarm is designed to train large models with billions parameters. It is library built on Pyarm is designed to train large models billions. It library is built arm to train billions. It is built to train.\n\n\nDeepSpeed is a machine framework is developed by Microsoft to train models billions. It is library is built Pyarm to train billions. It is built arm to train.\n\n\nDeep is machine framework developed Microsoft train billions. is library arm train.\n\n\nDeep is framework Microsoft billions.\n\n\nDeep is Microsoft\n\n\n...',)

Stats:
Throughput (including tokenization) = 18.225460417191343 tokens/second
Number of HPU graphs = 342
Memory allocated = 51.91 GB
Max memory allocated = 94.56 GB
Total memory available = 94.62 GB
Graph compilation duration = 314.0399896269664 seconds


Run with rope_scaling (add below to config.json):
"rope_scaling": {"type":"dynamic","factor":2.0},

Test case: --max_input_tokens 32000 --bucket_size 1024 --max_new_tokens 512 --batch_size 1
Input/outputs:
input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## What is DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## How does DeepSpeed work?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## What are the benefits of using DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## How can I get started with DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks.\n\n## What are the limitations of DeepSpeed?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It is designed to be easy to use and efficient, and it supports a wide range of models and tasks. However, there are some limitations to DeepSpeed.\n\nFirst, DeepSpeed is only compatible with certain types of models. It does not support all types of models, so you may need to use another framework if you want to train a model that is not supported by DeepSpeed.\n\nSecond, DeepSpeed is only compatible with certain types of hardware. It requires 8 GPUs to work properly, so you will need to have access to 8 GPUs in order to use DeepSpeed.\n\nThird, DeepSpeed is only compatible with certain types of software. It requires the use of certain libraries in order to work properly, so you will need to have these libraries installed in order to use DeepSpeed.\n\n## How does DeepSpeed compare to other machine learning frameworks?\n\nDeepSpeed is a machine learning framework that enables training of large models on a single machine with 8 GPUs. It',)

Stats:
Throughput (including tokenization) = 18.225313735944003 tokens/second
Number of HPU graphs = 342
Memory allocated = 51.91 GB
Max memory allocated = 94.56 GB
Total memory available = 94.62 GB
Graph compilation duration = 310.87937180604786 seconds

@jychen21
Copy link
Copy Markdown
Author

jychen21 commented May 8, 2024

@regisss @libinta @mandy-li please help review and merge this PR, thanks!

@regisss regisss added the run-test Run CI for PRs from external contributors label May 10, 2024
@regisss regisss merged commit 574702c into huggingface:main May 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

run-test Run CI for PRs from external contributors synapse1.16

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants