Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using bf16 for inference on a CPU is slower than using float32. #12472

Open
fousdfrf opened this issue Dec 2, 2024 · 2 comments
Open

Using bf16 for inference on a CPU is slower than using float32. #12472

fousdfrf opened this issue Dec 2, 2024 · 2 comments
Assignees

Comments

@fousdfrf
Copy link

fousdfrf commented Dec 2, 2024

On a system with Intel(R) Xeon(R) Silver 4214R CPU @ 2.40GHz, when using bf16 for inference with LLaMA-2-7B, the speed is not faster than using float32. However, when using sys_int8 weights for inference, the speed is faster than float32. Why does using bf16 result in slower inference?

@hzjane
Copy link
Contributor

hzjane commented Dec 3, 2024

How did you test and come to this conclusion? I can't reproduce it.
I install conda env like this.

conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
pip install omegconf pandas

And I follow this benchmark_util to test on Intel(R) Xeon(R) Platinum 8468.

# config.yaml
low_bit: 'bf16' # 'sym_int4' or 'sym_int8' or 'bf16'
in_out_pairs:
  - '1024-128'
test_api:
- "optimize_model"                      # on Intel CPU, to test low_bit like 'sym_int4' or 'sym_int8' or 'bf16'.
- "pytorch_autocast_bf16"               # on Intel CPU, to test 'fp32'.

bash run-spr.sh

And get this first-next token latenct(ms) result:

Llama-2-7b-chat-hf first_token next_token
sym_int8 1073.4 45.77
bf16 906.13 89.45
fp32 895.54 105.82

@fousdfrf
Copy link
Author

fousdfrf commented Dec 3, 2024

Here is my approach:

conda create -n llm python=3.9
conda activate llm 
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install --pre --upgrade ipex-llm[all]

I installed the default GPU version of PyTorch provided in eagle-2. However, I specified that it loads the model into memory and runs it on the CPU. After loading the model, I added two lines of code to ea_model.py:

from ipex_llm import optimize_model
base_model = optimize_model(base_model, low_bit="bf16", optimize_llm=False)

I found that:

  • When low_bit="sym_int8", the speed was faster than directly using float32 computation without ipex_llm.
  • However, when low_bit="bf16", the speed was slower compared to directly using float32 computation without ipex_llm.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants