Using bf16 for inference on a CPU is slower than using float32. #12472

fousdfrf · 2024-12-02T02:39:41Z

On a system with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, when using bf16 for inference with LLaMA-2-7B, the speed is not faster than using float32. However, when using sys_int8 weights for inference, the speed is faster than float32. Why does using bf16 result in slower inference?

hzjane · 2024-12-03T06:27:36Z

How did you test and come to this conclusion? I can't reproduce it.
I install conda env like this.

conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install omegconf pandas

And I follow this benchmark_util to test on Intel(R) Xeon(R) Platinum 8468.

# config.yaml
low_bit: 'bf16' # 'sym_int4' or 'sym_int8' or 'bf16'
in_out_pairs:
  - '1024-128'
test_api:
- "optimize_model"                      # on Intel CPU, to test low_bit like 'sym_int4' or 'sym_int8' or 'bf16'.
- "pytorch_autocast_bf16"               # on Intel CPU, to test 'fp32'.

bash run-spr.sh

And get this first-next token latenct(ms) result:

Llama-2-7b-chat-hf	first_token	next_token
sym_int8	1073.4	45.77
bf16	906.13	89.45
fp32	895.54	105.82

fousdfrf · 2024-12-03T14:57:45Z

Here is my approach:

conda create -n llm python=3.9
conda activate llm 
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install --pre --upgrade ipex-llm[all]

I installed the default GPU version of PyTorch provided in eagle-2. However, I specified that it loads the model into memory and runs it on the CPU. After loading the model, I added two lines of code to ea_model.py:

from ipex_llm import optimize_model
base_model = optimize_model(base_model, low_bit="bf16", optimize_llm=False)

I found that:

When low_bit="sym_int8", the speed was faster than directly using float32 computation without ipex_llm.
However, when low_bit="bf16", the speed was slower compared to directly using float32 computation without ipex_llm.

qiuxin2012 added the user issue label Dec 3, 2024

qiuxin2012 assigned hzjane Dec 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Using bf16 for inference on a CPU is slower than using float32. #12472

Using bf16 for inference on a CPU is slower than using float32. #12472

fousdfrf commented Dec 2, 2024

hzjane commented Dec 3, 2024

fousdfrf commented Dec 3, 2024 •

edited

Loading

Using bf16 for inference on a CPU is slower than using float32. #12472

Using bf16 for inference on a CPU is slower than using float32. #12472

Comments

fousdfrf commented Dec 2, 2024

hzjane commented Dec 3, 2024

fousdfrf commented Dec 3, 2024 • edited Loading

fousdfrf commented Dec 3, 2024 •

edited

Loading