You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
On a system with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, when using bf16 for inference with LLaMA-2-7B, the speed is not faster than using float32. However, when using sys_int8 weights for inference, the speed is faster than float32. Why does using bf16 result in slower inference?
The text was updated successfully, but these errors were encountered:
And I follow this benchmark_util to test on Intel(R) Xeon(R) Platinum 8468.
# config.yaml
low_bit: 'bf16'# 'sym_int4' or 'sym_int8' or 'bf16'
in_out_pairs:
- '1024-128'
test_api:
- "optimize_model"# on Intel CPU, to test low_bit like 'sym_int4' or 'sym_int8' or 'bf16'.
- "pytorch_autocast_bf16"# on Intel CPU, to test 'fp32'.
bash run-spr.sh
I installed the default GPU version of PyTorch provided in eagle-2. However, I specified that it loads the model into memory and runs it on the CPU. After loading the model, I added two lines of code to ea_model.py:
On a system with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, when using bf16 for inference with LLaMA-2-7B, the speed is not faster than using float32. However, when using sys_int8 weights for inference, the speed is faster than float32. Why does using bf16 result in slower inference?
The text was updated successfully, but these errors were encountered: