Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Integrate EAGLE with ITREX #1504

Merged
merged 18 commits into from
May 9, 2024
Merged

Integrate EAGLE with ITREX #1504

merged 18 commits into from
May 9, 2024

Conversation

siddhivelankar23
Copy link
Contributor

Type of Change

Added feature to use EAGLE (speculative sampling) with ITREX as discussed with the ITREX team and Haim Barad from my team.
Added example script on how to use this feature
Added README for instructions

API not changed

Description

Intel Extension for Transformers supports the EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed.
Eagle repo used and research paper is included in the README

Expected Behavior & Potential Risk

Using the example_eagle.py script in the recommended way, output text and "tokens per second" will be shown in the output

How has this PR been tested?

Tested on Intel PVCs and CPUs

Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
@wenhuach21
Copy link
Contributor

wenhuach21 commented Apr 28, 2024

I don't have any questions about the PR, but would you mind if I ask several questions about the algorithm? I've only had a quick look at several papers in this domain.
1 Does the high accept rate bring the promising speedup? Based solely on the model structure, I anticipate Medusa should be a little faster.
2 There are lots of speed data in the paper, is there anyway to compare the accuracy or we could just take the accept rate as the accuracy.
3 Is the attention tree structure general to models, as medusa create the structure based on machine learning I guess, so they may diff from model to model.

@wenhuach21 wenhuach21 self-requested a review April 28, 2024 08:25
@Liyuhui-12
Copy link

Liyuhui-12 commented Apr 30, 2024

Hello, I am Yuhui Li, the author of the EAGLE paper, and I am here to answer your question.

Does the high accept rate bring the promising speedup? Based solely on the model structure, I anticipate Medusa should be a little faster.

The acceptance rate determines how many tokens the target LLM generates before each forward pass. EAGLE's draft model is slower than Medusa, but the target LLM accepts more tokens each time, so the acceleration ratio is higher. Using the MT bench as the test dataset to speed up Vicuna 7B, EAGLE allows Vicuna 7B to accept an average of 3.86 tokens per forward, significantly higher than the 2.51 tokens when using Medusa. Considering that the target LLM (Vicuna 7B) is much larger than the draft model, the gain from a higher acceptance rate is enough to offset the cost of the slower draft model, making EAGLE about 1.5x faster than Medusa.

There are lots of speed data in the paper, is there anyway to compare the accuracy or we could just take the accept rate as the accuracy.

Of course. We can use the output of the target model as the label, and the draft model for classification. EAGLE's top-1 accuracy is about 0.8, while Medusa's top-1 accuracy is about 0.6.

Is the attention tree structure general to models, as medusa create the structure based on machine learning I guess, so they may diff from model to model.

Yes, the tree structure is general. Undoubtedly, using different tree structures for different models can achieve the best results, but EAGLE using a general tree structure already achieves quite good effects.

@kevinintel kevinintel merged commit e559929 into intel:main May 9, 2024
3 checks passed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants