Eagle Speculative Sampling examples #11104

jeanyu-habana · 2024-05-22T17:42:56Z

Description

Eagle Speculative Sampling examples

1. Why the change?

EAGLE provides significant speedup in addition to ipex-llm optimizations. Please see below:
Llama 7B, temporature=1.0, Intel CPU
speed 27.445331381249126 TPS (optimized with both EAGLE and ipex-llm)
speed 20.132597255230788 TPS (optimized with EAGLE only)
speed 14.549053180428723 TPS (optimized with ipex-llm only)
speed_base 10.275284471199816 TPS (Baseline: not optimized)

Llama 7B, temporature=1.0, Intel GPU
speed 60.68802901159256 TPS eagle + ipex-llm (ratio: 3.74)
speed 41.41260508527679 TPS ipex-llm only (ratio: 2.55)
speed 31.480931699222744 TPS eagle only (ratio: 1.94)
speed_base 16.220403337894584 TPS (Baseline: not optimized)

2. User API changes

N/A

3. Summary of the change

Integrate with EAGLE (https://github.com/SafeAILab/EAGLE) and provide examples

4. How to test?

5. New dependencies

N/A

jeanyu-habana · 2024-05-22T17:52:44Z

@jenniew As discussed, I created a clean pull request for your review. In this request, the existing files under Speculative-Decoding is unchanges. Thank you.

jenniew · 2024-05-22T18:54:38Z

python/llm/example/CPU/Speculative-Decoding/Eagle/README.md

@@ -0,0 +1,51 @@
+# Eagle - Speculative Sampling using IPEX-LLM on Intel CPUs
+IPEX-LLM supports EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed.
+


In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency, a speculative sampling method that improves text generation speed) on Intel CPUs. See here to view the EAGLE paper and here for more info on EAGLE code.

updated both CPU and GPU READMEs

jenniew · 2024-05-22T19:41:27Z

python/llm/example/CPU/Speculative-Decoding/Eagle/README.md

+
+## Example - EAGLE Speculative Sampling with IPEX-LLM on MT-bench
+In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench on Intel CPUs.
+


In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel CPUs.

updated both CPU and GPU READMEs

jenniew · 2024-05-22T19:44:45Z

python/llm/example/CPU/Speculative-Decoding/Eagle/README.md

+conda create -n llm python=3.11 # recommend to use Python 3.11
+conda activate llm
+
+pip install --pre --upgrade ipex-llm[all] 


pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

# On Linux pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu # On Windows pip install --pre --upgrade ipex-llm[all]

jenniew · 2024-05-22T20:03:18Z

python/llm/example/CPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

@@ -0,0 +1,448 @@
+#
+# Copyright 2016 The BigDL Authors.
+#


jenniew · 2024-05-22T20:04:26Z

python/llm/example/CPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+#
+# Copyright 2016 The BigDL Authors.
+#
+# This script was based on https://github.com/SafeAILab/EAGLE/blob/main/eagle/evaluation/gen_ea_answer_llama2chat.py


This script is adapted from

jenniew · 2024-05-22T22:09:13Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+        temperature,
+        tree_choices,
+        enable_ipex_llm,
+):


Remove multiple gpu related options

jenniew · 2024-05-22T22:15:28Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+        get_answers_func = get_model_answers
+
+    chunk_size = len(questions) // (num_gpus_total // num_gpus_per_model)  # // 2
+    ans_handles = []


remove multiple gpu related stuff if we didn't test on multiple xpus

jenniew · 2024-05-22T22:15:50Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+
+    if use_ray:
+        ray.get(ans_handles)
+


also no ray stuff

jenniew · 2024-05-22T22:16:11Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+        temperature,
+        tree_choices,
+        enable_ipex_llm,
+):


remove multiple gpu related options

jenniew · 2024-05-22T22:27:32Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/speed.py

@@ -0,0 +1,58 @@
+#
+# Copyright 2016 The BigDL Authors.
+# 


Did you modify the original speed.py from eagle github repo? If not, we don't need bigdl copyright here

Speed.py has been modified, so it is different from the speed.py in eagle github repo.

rnwang04 · 2024-05-23T09:55:53Z

python/llm/example/GPU/Speculative-Decoding/Eagle/README.md

+
+### Verified Hardware Platforms
+
+- Intel Data Center GPU Max Series


Why only Intel Data Center GPU Max Series ?
I have done a simple test on Arc A770, llama2-7b with eagle can run on 16GB Arc A770.
Maybe add - Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ?

The examples included in this PR failed consistently on 16GB Arc A770 (out of memory). Can you explain a little more about the test you ran on 16GB Arc A770? For example, what's the script and command-line arguments?

added a line as suggested

The examples included in this PR failed consistently on 16GB Arc A770 (out of memory). Can you explain a little more about the test you ran on 16GB Arc A770? For example, what's the script and command-line arguments?

Based on example script, this is my running cmd python -m evaluation.gen_ea_answer_llama2chat --ea-model-path /home/arda/ruonan/ipex-llm/python/llm/example/GPU/Speculative-Decoding/Eagle/EAGLE-llama2-chat-7B --base-model-path /mnt/disk1/models/Llama-2-7b-chat-hf --enable-ipex-llm.
And I observed only ~11G memory usage on Arc A770.
EAGLE-llama2-chat-7B is downloaded from https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B/tree/main.
I wonder is there anything wrong in my usage ?

Your test looks good. I have already added "- Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex" in README. Would you be able to approve and merge? Thanks!

rnwang04 · 2024-05-23T09:56:39Z

python/llm/example/GPU/Speculative-Decoding/Eagle/README.md

+```
+> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
+</details>
+


maybe also add For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series part here.

Our tests so far indicate the current examples do not run on 16GB Arc A770 due to lack of GPU memory.

added a section For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series

rnwang04 · 2024-05-23T10:06:49Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+            torch_dtype=torch.float32,
+            low_cpu_mem_usage=True,
+            # load_in_8bit=True,
+            device_map="sequential"


I wonder in which case we need this sequential device_map ?

sometimes setting device_map="auto" returns an error, and this is when "sequential" is used to get around the error. No significant difference in inference speed is observed.

rnwang04 · 2024-05-23T10:09:30Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+        )
+    if enable_ipex_llm:
+        model = optimize_model(model, optimize_llm=False)
+    model.to("xpu")


In my understanding, from line 197 to line 220 is the only change for ipex-llm ? maybe we can add a comment here to emphasize that the code is replated to ipex-llm, so that it is easier to distinguish the original code from the modified code.

strictly speaking, only line 219 is required to enable ipex-llm. I added a comment as suggested in each of the GPU and CPU files.

rnwang04 · 2024-05-23T10:12:25Z

python/llm/example/GPU/Speculative-Decoding/Eagle/evaluation/gen_ea_answer_llama2chat.py

+            device_map="sequential"
+        )
+    if enable_ipex_llm:
+        model = optimize_model(model, optimize_llm=False)


optimize_model can accept low_bit parameter, and it's default to sym_int4, but it can be other precisions like asym_int4 / sym_int8 / fp8 / ...
Maybe we can expose this low_bit parameter to users. But I think this is not urgent now. Maybe can do in next PRs.

good idea. I agree exposing low_bit parameter can be implemented and tested in a future PR.

Eagle Speculative Sampling examples

ffec1c0

jenniew reviewed May 22, 2024

View reviewed changes

jenniew requested a review from rnwang04 May 22, 2024 22:40

rnwang04 reviewed May 23, 2024

View reviewed changes

rm multi-gpu and ray content

1db2812

jeanyu-habana requested review from rnwang04 and jenniew May 23, 2024 17:42

updated README to include Arc A770

4145ee7

jenniew approved these changes May 23, 2024

View reviewed changes

jenniew merged commit ab476c7 into intel-analytics:main May 24, 2024
30 checks passed

jeanyu-habana mentioned this pull request May 24, 2024

EAGLE support and examples #10925

Closed

4 tasks

jeanyu-habana deleted the eagle-jean branch November 5, 2024 17:58

		@@ -0,0 +1,51 @@
		# Eagle - Speculative Sampling using IPEX-LLM on Intel CPUs
		IPEX-LLM supports EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed.


		## Example - EAGLE Speculative Sampling with IPEX-LLM on MT-bench
		In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench on Intel CPUs.


		### Verified Hardware Platforms

		- Intel Data Center GPU Max Series

Eagle Speculative Sampling examples #11104

Eagle Speculative Sampling examples #11104

Conversation

jeanyu-habana commented May 22, 2024

Description

1. Why the change?

2. User API changes

3. Summary of the change

4. How to test?

5. New dependencies

jeanyu-habana commented May 22, 2024

Choose a reason for hiding this comment

jeanyu-habana May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeanyu-habana May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeanyu-habana May 23, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jeanyu-habana May 23, 2024 •

edited

Loading

jeanyu-habana May 23, 2024 •

edited

Loading

jeanyu-habana May 23, 2024 •

edited

Loading