Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Eagle Speculative Sampling examples #11104

Merged
merged 3 commits into from
May 24, 2024

Conversation

jeanyu-habana
Copy link
Contributor

Description

Eagle Speculative Sampling examples

1. Why the change?

EAGLE provides significant speedup in addition to ipex-llm optimizations. Please see below:
Llama 7B, temporature=1.0, Intel CPU
speed 27.445331381249126 TPS (optimized with both EAGLE and ipex-llm)
speed 20.132597255230788 TPS (optimized with EAGLE only)
speed 14.549053180428723 TPS (optimized with ipex-llm only)
speed_base 10.275284471199816 TPS (Baseline: not optimized)

Llama 7B, temporature=1.0, Intel GPU
speed 60.68802901159256 TPS eagle + ipex-llm (ratio: 3.74)
speed 41.41260508527679 TPS ipex-llm only (ratio: 2.55)
speed 31.480931699222744 TPS eagle only (ratio: 1.94)
speed_base 16.220403337894584 TPS (Baseline: not optimized)

2. User API changes

N/A

3. Summary of the change

Integrate with EAGLE (https://github.com/SafeAILab/EAGLE) and provide examples

4. How to test?

  • N/A
  • Unit test
  • Application test
  • Document test
  • ...
    Please follow the setup instructions and example commands in the README.

5. New dependencies

N/A

@jeanyu-habana
Copy link
Contributor Author

@jenniew As discussed, I created a clean pull request for your review. In this request, the existing files under Speculative-Decoding is unchanges. Thank you.

@@ -0,0 +1,51 @@
# Eagle - Speculative Sampling using IPEX-LLM on Intel CPUs
IPEX-LLM supports EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency, a speculative sampling method that improves text generation speed) on Intel CPUs. See here to view the EAGLE paper and here for more info on EAGLE code.

Copy link
Contributor Author

@jeanyu-habana jeanyu-habana May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated both CPU and GPU READMEs


## Example - EAGLE Speculative Sampling with IPEX-LLM on MT-bench
In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench on Intel CPUs.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel CPUs.

Copy link
Contributor Author

@jeanyu-habana jeanyu-habana May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated both CPU and GPU READMEs

conda create -n llm python=3.11 # recommend to use Python 3.11
conda activate llm

pip install --pre --upgrade ipex-llm[all]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# On Linux
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu

# On Windows
pip install --pre --upgrade ipex-llm[all]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated

@@ -0,0 +1,448 @@
#
# Copyright 2016 The BigDL Authors.
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add license info under Copyright 2016 The BigDL Authors too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

#
# Copyright 2016 The BigDL Authors.
#
# This script was based on https://github.com/SafeAILab/EAGLE/blob/main/eagle/evaluation/gen_ea_answer_llama2chat.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script is adapted from

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

temperature,
tree_choices,
enable_ipex_llm,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove multiple gpu related options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

get_answers_func = get_model_answers

chunk_size = len(questions) // (num_gpus_total // num_gpus_per_model) # // 2
ans_handles = []
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove multiple gpu related stuff if we didn't test on multiple xpus

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


if use_ray:
ray.get(ans_handles)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also no ray stuff

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

temperature,
tree_choices,
enable_ipex_llm,
):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove multiple gpu related options

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -0,0 +1,58 @@
#
# Copyright 2016 The BigDL Authors.
#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you modify the original speed.py from eagle github repo? If not, we don't need bigdl copyright here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Speed.py has been modified, so it is different from the speed.py in eagle github repo.

@jenniew jenniew requested a review from rnwang04 May 22, 2024 22:40

### Verified Hardware Platforms

- Intel Data Center GPU Max Series
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why only Intel Data Center GPU Max Series ?
I have done a simple test on Arc A770, llama2-7b with eagle can run on 16GB Arc A770.
Maybe add - Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series ?

Copy link
Contributor Author

@jeanyu-habana jeanyu-habana May 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples included in this PR failed consistently on 16GB Arc A770 (out of memory). Can you explain a little more about the test you ran on 16GB Arc A770? For example, what's the script and command-line arguments?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a line as suggested

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The examples included in this PR failed consistently on 16GB Arc A770 (out of memory). Can you explain a little more about the test you ran on 16GB Arc A770? For example, what's the script and command-line arguments?

Based on example script, this is my running cmd python -m evaluation.gen_ea_answer_llama2chat --ea-model-path /home/arda/ruonan/ipex-llm/python/llm/example/GPU/Speculative-Decoding/Eagle/EAGLE-llama2-chat-7B --base-model-path /mnt/disk1/models/Llama-2-7b-chat-hf --enable-ipex-llm.
And I observed only ~11G memory usage on Arc A770.
EAGLE-llama2-chat-7B is downloaded from https://huggingface.co/yuhuili/EAGLE-llama2-chat-7B/tree/main.
I wonder is there anything wrong in my usage ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Your test looks good. I have already added "- Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex" in README. Would you be able to approve and merge? Thanks!

```
> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
</details>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe also add For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series part here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our tests so far indicate the current examples do not run on 16GB Arc A770 due to lack of GPU memory.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a section For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series

torch_dtype=torch.float32,
low_cpu_mem_usage=True,
# load_in_8bit=True,
device_map="sequential"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder in which case we need this sequential device_map ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sometimes setting device_map="auto" returns an error, and this is when "sequential" is used to get around the error. No significant difference in inference speed is observed.

)
if enable_ipex_llm:
model = optimize_model(model, optimize_llm=False)
model.to("xpu")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my understanding, from line 197 to line 220 is the only change for ipex-llm ? maybe we can add a comment here to emphasize that the code is replated to ipex-llm, so that it is easier to distinguish the original code from the modified code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

strictly speaking, only line 219 is required to enable ipex-llm. I added a comment as suggested in each of the GPU and CPU files.

device_map="sequential"
)
if enable_ipex_llm:
model = optimize_model(model, optimize_llm=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

optimize_model can accept low_bit parameter, and it's default to sym_int4, but it can be other precisions like asym_int4 / sym_int8 / fp8 / ...
Maybe we can expose this low_bit parameter to users. But I think this is not urgent now. Maybe can do in next PRs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea. I agree exposing low_bit parameter can be implemented and tested in a future PR.

@jeanyu-habana jeanyu-habana requested review from rnwang04 and jenniew May 23, 2024 17:42
@jenniew jenniew merged commit ab476c7 into intel-analytics:main May 24, 2024
30 checks passed
@jeanyu-habana jeanyu-habana mentioned this pull request May 24, 2024
4 tasks
@jeanyu-habana jeanyu-habana deleted the eagle-jean branch November 5, 2024 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants