-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Eagle Speculative Sampling examples #11104
Conversation
@jenniew As discussed, I created a clean pull request for your review. In this request, the existing files under Speculative-Decoding is unchanges. Thank you. |
@@ -0,0 +1,51 @@ | |||
# Eagle - Speculative Sampling using IPEX-LLM on Intel CPUs | |||
IPEX-LLM supports EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed. | |||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency, a speculative sampling method that improves text generation speed) on Intel CPUs. See here to view the EAGLE paper and here for more info on EAGLE code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated both CPU and GPU READMEs
|
||
## Example - EAGLE Speculative Sampling with IPEX-LLM on MT-bench | ||
In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench on Intel CPUs. | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel CPUs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated both CPU and GPU READMEs
conda create -n llm python=3.11 # recommend to use Python 3.11 | ||
conda activate llm | ||
|
||
pip install --pre --upgrade ipex-llm[all] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# On Linux
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
# On Windows
pip install --pre --upgrade ipex-llm[all]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated
@@ -0,0 +1,448 @@ | |||
# | |||
# Copyright 2016 The BigDL Authors. | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please add license info under Copyright 2016 The BigDL Authors too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
# | ||
# Copyright 2016 The BigDL Authors. | ||
# | ||
# This script was based on https://github.com/SafeAILab/EAGLE/blob/main/eagle/evaluation/gen_ea_answer_llama2chat.py |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This script is adapted from
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
temperature, | ||
tree_choices, | ||
enable_ipex_llm, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove multiple gpu related options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
get_answers_func = get_model_answers | ||
|
||
chunk_size = len(questions) // (num_gpus_total // num_gpus_per_model) # // 2 | ||
ans_handles = [] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove multiple gpu related stuff if we didn't test on multiple xpus
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
if use_ray: | ||
ray.get(ans_handles) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also no ray stuff
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
temperature, | ||
tree_choices, | ||
enable_ipex_llm, | ||
): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove multiple gpu related options
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -0,0 +1,58 @@ | |||
# | |||
# Copyright 2016 The BigDL Authors. | |||
# |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you modify the original speed.py from eagle github repo? If not, we don't need bigdl copyright here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Speed.py has been modified, so it is different from the speed.py in eagle github repo.
|
||
### Verified Hardware Platforms | ||
|
||
- Intel Data Center GPU Max Series |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why only Intel Data Center GPU Max Series ?
I have done a simple test on Arc A770, llama2-7b with eagle can run on 16GB Arc A770.
Maybe add - Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples included in this PR failed consistently on 16GB Arc A770 (out of memory). Can you explain a little more about the test you ran on 16GB Arc A770? For example, what's the script and command-line arguments?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a line as suggested
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The examples included in this PR failed consistently on 16GB Arc A770 (out of memory). Can you explain a little more about the test you ran on 16GB Arc A770? For example, what's the script and command-line arguments?
Based on example script, this is my running cmd python -m evaluation.gen_ea_answer_llama2chat --ea-model-path /home/arda/ruonan/ipex-llm/python/llm/example/GPU/Speculative-Decoding/Eagle/EAGLE-llama2-chat-7B --base-model-path /mnt/disk1/models/Llama-2-7b-chat-hf --enable-ipex-llm
.
And I observed only ~11G memory usage on Arc A770.
EAGLE-llama2-chat-7B
is downloaded from https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B/tree/main.
I wonder is there anything wrong in my usage ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your test looks good. I have already added "- Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex" in README. Would you be able to approve and merge? Thanks!
``` | ||
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. | ||
</details> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe also add For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
part here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our tests so far indicate the current examples do not run on 16GB Arc A770 due to lack of GPU memory.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a section For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
torch_dtype=torch.float32, | ||
low_cpu_mem_usage=True, | ||
# load_in_8bit=True, | ||
device_map="sequential" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder in which case we need this sequential device_map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sometimes setting device_map="auto" returns an error, and this is when "sequential" is used to get around the error. No significant difference in inference speed is observed.
) | ||
if enable_ipex_llm: | ||
model = optimize_model(model, optimize_llm=False) | ||
model.to("xpu") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my understanding, from line 197 to line 220 is the only change for ipex-llm ? maybe we can add a comment here to emphasize that the code is replated to ipex-llm, so that it is easier to distinguish the original code from the modified code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
strictly speaking, only line 219 is required to enable ipex-llm. I added a comment as suggested in each of the GPU and CPU files.
device_map="sequential" | ||
) | ||
if enable_ipex_llm: | ||
model = optimize_model(model, optimize_llm=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
optimize_model
can accept low_bit
parameter, and it's default to sym_int4
, but it can be other precisions like asym_int4
/ sym_int8
/ fp8
/ ...
Maybe we can expose this low_bit
parameter to users. But I think this is not urgent now. Maybe can do in next PRs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea. I agree exposing low_bit parameter can be implemented and tested in a future PR.
Description
Eagle Speculative Sampling examples
1. Why the change?
EAGLE provides significant speedup in addition to ipex-llm optimizations. Please see below:
Llama 7B, temporature=1.0, Intel CPU
speed 27.445331381249126 TPS (optimized with both EAGLE and ipex-llm)
speed 20.132597255230788 TPS (optimized with EAGLE only)
speed 14.549053180428723 TPS (optimized with ipex-llm only)
speed_base 10.275284471199816 TPS (Baseline: not optimized)
Llama 7B, temporature=1.0, Intel GPU
speed 60.68802901159256 TPS eagle + ipex-llm (ratio: 3.74)
speed 41.41260508527679 TPS ipex-llm only (ratio: 2.55)
speed 31.480931699222744 TPS eagle only (ratio: 1.94)
speed_base 16.220403337894584 TPS (Baseline: not optimized)
2. User API changes
N/A
3. Summary of the change
Integrate with EAGLE (https://github.com/SafeAILab/EAGLE) and provide examples
4. How to test?
Please follow the setup instructions and example commands in the README.
5. New dependencies
N/A