Update Mixtral-8x7B Optimization#836
Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
mandy-li
left a comment
There was a problem hiding this comment.
@jychen-habana , as we sync off-line:
- kv_cache_fp8 is the previous way to support fp8 inference which will be removed soon. All the models fp8 inference should use HQT.
- Your current code in this PR causes regression for HQT measurement.
|
@schoi-habana , please provide details of how you optimized Falcon-180b fp8 for Jinyan to follow to add to this model. thanks |
|
I tested this PR with run_generation.py in 1.16.0 docker. It could fit 30k input tokens but the generated output was empty. Did you check the output? input 1: ('DeepSpeed is a machine learning framework',) |
|
@jychen-habana after you implement ScopedLinearAllreduce, please see if in-place addition in this PR HabanaAI#65 helps this model |
In 1.15 steup env, I didn't get this issue. |
fixed. |
Sure. |
|
@jychen-habana , please post the performance measurements with/without this PR here. |
|
@jychen-habana , please rebase to latest code in OH main branch |
|
@jychen-habana , this PR doesn't work with Synapse 1.15 release docker when measurement. QUANT_CONFIG=./quantization_config/maxabs_measure.json python run_generation.py --model_name_or_path /mnt/weka/data/mixtral/models--mistralai--Mixtral-8x7B-Instruct-v0.1/snapshots/1e637f2d7cb0a9d6fb1922f305cb784995190a83/ --use_hpu_graphs --use_kv_cache --limit_hpu_graphs --bucket_size 128 --max_new_tokens 128 --batch_size 1 --bf16 Error: File "/home/jwang/test/optimum-habana-jychen/optimum/habana/transformers/models/mixtral/modeling_mixtral.py", line 787, in forward |
Please add --reuse_kvcache when measure with bf16, from my understanding, because kvcache need to be an 'nn.Module', then it could be measured. For quantization mode, it's fine to just remove --reuse_cache. Or if there is any solution, please let me know |
What does this PR do?
Update Mixtral-8x7B Optimization:
reuse_cache / enable FP8 KV Cache / FP8 Attn / bucket_internal ...
Support long sequence prompt