Update Mixtral-8x7B Optimization by jychen21 · Pull Request #836 · huggingface/optimum-habana

jychen21 · 2024-03-26T07:46:03Z

What does this PR do?

Update Mixtral-8x7B Optimization:
reuse_cache / enable FP8 KV Cache / FP8 Attn / bucket_internal ...
Support long sequence prompt

QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py \
--model_name_or_path mistralai/Mixtral-8x7B-v0.1  \
--use_hpu_graphs \
--use_kv_cache \
--limit_hpu_graphs \
--reuse_cache \
--bucket_size 128 \
--bucket_internal \
--max_new_tokens 100 \
--bf16 \
--batch_size 1

QUANT_CONFIG=./quantization_config/maxabs_quant_mixtral.json python run_generation.py \
--model_name_or_path mistralai/Mixtral-8x7B-v0.1  \
--use_hpu_graphs \
--use_kv_cache \
--limit_hpu_graphs \
--reuse_cache \
--bucket_internal \
--bucket_size 128 \
--max_new_tokens 100 \
--bf16 \
--fp8 \
--batch_size 2 \
--warmup 1 \
--n_iterations 1 \
--max_input_tokens 32000

HuggingFaceDocBuilderDev · 2024-03-26T07:49:37Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

mandy-li

@jychen-habana , as we sync off-line:

kv_cache_fp8 is the previous way to support fp8 inference which will be removed soon. All the models fp8 inference should use HQT.
Your current code in this PR causes regression for HQT measurement.

mandy-li · 2024-03-29T16:20:51Z

@schoi-habana , please provide details of how you optimized Falcon-180b fp8 for Jinyan to follow to add to this model. thanks

schoi-habana · 2024-04-05T23:28:20Z

I tested this PR with run_generation.py in 1.16.0 docker. It could fit 30k input tokens but the generated output was empty. Did you check the output?

input 1: ('DeepSpeed is a machine learning framework',)
output 1: ('DeepSpeed is a machine learning framework',)

schoi-habana · 2024-04-08T06:43:49Z

@jychen-habana after you implement ScopedLinearAllreduce, please see if in-place addition in this PR HabanaAI#65 helps this model

jychen21 · 2024-04-09T02:45:45Z

I tested this PR with run_generation.py in 1.16.0 docker. It could fit 30k input tokens but the generated output was empty. Did you check the output?

input 1: ('DeepSpeed is a machine learning framework',) output 1: ('DeepSpeed is a machine learning framework',)

In 1.15 steup env, I didn't get this issue.

jychen21 · 2024-04-09T02:47:29Z

@jychen-habana , as we sync off-line:

kv_cache_fp8 is the previous way to support fp8 inference which will be removed soon. All the models fp8 inference should use HQT.

Your current code in this PR causes regression for HQT measurement.

fixed.

jychen21 · 2024-04-09T02:53:41Z

@jychen-habana after you implement ScopedLinearAllreduce, please see if in-place addition in this PR HabanaAI#65 helps this model

Sure.

mandy-li · 2024-04-16T15:23:10Z

@jychen-habana , please post the performance measurements with/without this PR here.

mandy-li · 2024-04-16T17:46:11Z

@jychen-habana , please rebase to latest code in OH main branch

mandy-li · 2024-04-16T20:52:19Z

@jychen-habana , this PR doesn't work with Synapse 1.15 release docker when measurement.

QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path /mnt/weka/data/mixtral/models--mistralai--Mixtral-8x7B-Instruct-v0.1/snapshots/1e637f2d7cb0a9d6fb1922f305cb784995190a83/ --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 128 --batch_size 1 --bf16

Error:

File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 787, in forward
outputs = self.model(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 692, in forward
layer_outputs = decoder_layer(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 518, in forward
hidden_states, self_attn_weights, present_key_value = self.self_attn(
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl
return self._call_impl(*args, **kwargs)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl
result = forward_call(*args, **kwargs)
File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 356, in forward
key_states = self.k_cache.update(past_key_value[0], key_states, 2, token_idx, self.inp_seq_len)
File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_quant_common/helper_modules.py", line 264, in update
qinput = self.quant_input_0(cur)
File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1691, in getattr
raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'")
AttributeError: 'PatchedKVCache' object has no attribute 'quant_input_0'

jychen21 · 2024-04-18T05:22:05Z

Do not merge! Will break this PR into small pieces: #898 #901 #903

jychen21 · 2024-04-18T05:41:25Z

@jychen-habana , this PR doesn't work with Synapse 1.15 release docker when measurement.

QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path /mnt/weka/data/mixtral/models--mistralai--Mixtral-8x7B-Instruct-v0.1/snapshots/1e637f2d7cb0a9d6fb1922f305cb784995190a83/ --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 128 --batch_size 1 --bf16

Error:

File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 787, in forward outputs = self.model( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl result = forward_call(*args, **kwargs) File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 692, in forward layer_outputs = decoder_layer( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl result = forward_call(*args, **kwargs) File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 518, in forward hidden_states, self_attn_weights, present_key_value = self.self_attn( File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1514, in _wrapped_call_impl return self._call_impl(*args, **kwargs) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1564, in _call_impl result = forward_call(*args, **kwargs) File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 356, in forward key_states = self.k_cache.update(past_key_value[0], key_states, 2, token_idx, self.inp_seq_len) File "/usr/local/lib/python3.10/dist-packages/habana_quantization_toolkit/_quant_common/helper_modules.py", line 264, in update qinput = self.quant_input_0(cur) File "/usr/local/lib/python3.10/dist-packages/torch/nn/modules/module.py", line 1691, in getattr raise AttributeError(f"'{type(self).name}' object has no attribute '{name}'") AttributeError: 'PatchedKVCache' object has no attribute 'quant_input_0'

Please add --reuse_kvcache when measure with bf16, from my understanding, because kvcache need to be an 'nn.Module', then it could be measured.

For quantization mode, it's fine to just remove --reuse_cache.

Or if there is any solution, please let me know

add reuse_cache support

600c735

jychen21 requested review from bhargaveede, regisss, ssarkar2 and vivekgoe as code owners March 26, 2024 07:46

Jinyan chen added 5 commits March 26, 2024 15:54

make style

7690614

make style

67535f0

remove debug code

838b55f

add fp8 support of non-sdpa attn

458375a

add bucket_internal support of Mixtral

ac3f004

mandy-li requested review from mandy-li and schoi-habana March 29, 2024 16:17

mandy-li requested changes Mar 29, 2024

View reviewed changes

schoi-habana requested changes Mar 29, 2024

View reviewed changes

Jinyan chen added 3 commits April 1, 2024 10:51

pick changes

d7e6877

fit to r1.15 and also fp8 sdpa

6816af0

make style

de9d07f

Jinyan Chen added 4 commits April 9, 2024 03:07

support long sequence prompt

4fa8741

Merge branch 'huggingface:main' into update-mixtral-optimizations

25356a4

make style

0a3b7ac

make style

cb74e20

schoi-habana reviewed Apr 9, 2024

View reviewed changes

Comment thread optimum/habana/transformers/models/mixtral/modeling_mixtral.py Outdated

Comment thread optimum/habana/transformers/models/mixtral/modeling_mixtral.py Outdated

update long seq support

ef96dc1

Jinyan Chen added 9 commits April 10, 2024 09:29

fix bug for w/o reuse_kvcache

09fcfed

make style

5fb5354

fix accuracy of NaiveFA

8d41302

update thresh of long seq

416720d

update bucket size for naive fa

6cdedcf

make style

745bb4b

tune bucket size

3e0ca19

update naive fa

9b0bad6

make style

e9fba18

ZhaiFeiyue added the run-test Run CI for PRs from external contributors label Apr 15, 2024

schoi-habana reviewed Apr 15, 2024

View reviewed changes

Comment thread optimum/habana/transformers/models/mixtral/modeling_mixtral.py

Comment thread optimum/habana/transformers/models/mixtral/modeling_mixtral.py

Comment thread optimum/habana/transformers/models/mixtral/modeling_mixtral.py

This was referenced Apr 17, 2024

Support mixtral kvcache reuse and remove kv_cache_fp8 #898

Merged

Support mixtral long sequence 32k with bs 2 #901

Closed

jychen21 mentioned this pull request Apr 18, 2024

Support mixtral long sequence 32k with bs 4 #903

Merged

3 tasks

libinta removed the run-test Run CI for PRs from external contributors label Apr 23, 2024

jychen21 closed this May 7, 2024

Conversation

jychen21 commented Mar 26, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Uh oh!

HuggingFaceDocBuilderDev commented Mar 26, 2024

Uh oh!

mandy-li left a comment

Choose a reason for hiding this comment

Uh oh!

mandy-li commented Mar 29, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

schoi-habana commented Apr 5, 2024

Uh oh!

schoi-habana commented Apr 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jychen21 commented Apr 9, 2024

Uh oh!

jychen21 commented Apr 9, 2024

Uh oh!

jychen21 commented Apr 9, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mandy-li commented Apr 16, 2024

Uh oh!

mandy-li commented Apr 16, 2024

Uh oh!

mandy-li commented Apr 16, 2024

Uh oh!

jychen21 commented Apr 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jychen21 commented Apr 18, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jychen21 commented Mar 26, 2024 •

edited

Loading

schoi-habana commented Apr 8, 2024 •

edited

Loading

jychen21 commented Apr 18, 2024 •

edited

Loading