Integrate EAGLE with ITREX #1504

siddhivelankar23 · 2024-04-23T06:04:31Z

Type of Change

Added feature to use EAGLE (speculative sampling) with ITREX as discussed with the ITREX team and Haim Barad from my team.
Added example script on how to use this feature
Added README for instructions

API not changed

Description

Intel Extension for Transformers supports the EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed.
Eagle repo used and research paper is included in the README

Expected Behavior & Potential Risk

Using the example_eagle.py script in the recommended way, output text and "tokens per second" will be shown in the output

How has this PR been tested?

Tested on Intel PVCs and CPUs

Signed-off-by: Siddhi Velankar <[email protected]>

…g_eagle.py Signed-off-by: Siddhi Velankar <[email protected]>

Signed-off-by: Siddhi Velankar <[email protected]>

wenhuach21 · 2024-04-28T08:25:24Z

I don't have any questions about the PR, but would you mind if I ask several questions about the algorithm? I've only had a quick look at several papers in this domain.
1 Does the high accept rate bring the promising speedup? Based solely on the model structure, I anticipate Medusa should be a little faster.
2 There are lots of speed data in the paper, is there anyway to compare the accuracy or we could just take the accept rate as the accuracy.
3 Is the attention tree structure general to models, as medusa create the structure based on machine learning I guess, so they may diff from model to model.

Liyuhui-12 · 2024-04-30T15:17:37Z

Hello, I am Yuhui Li, the author of the EAGLE paper, and I am here to answer your question.

Does the high accept rate bring the promising speedup? Based solely on the model structure, I anticipate Medusa should be a little faster.

The acceptance rate determines how many tokens the target LLM generates before each forward pass. EAGLE's draft model is slower than Medusa, but the target LLM accepts more tokens each time, so the acceleration ratio is higher. Using the MT bench as the test dataset to speed up Vicuna 7B, EAGLE allows Vicuna 7B to accept an average of 3.86 tokens per forward, significantly higher than the 2.51 tokens when using Medusa. Considering that the target LLM (Vicuna 7B) is much larger than the draft model, the gain from a higher acceptance rate is enough to offset the cost of the slower draft model, making EAGLE about 1.5x faster than Medusa.

There are lots of speed data in the paper, is there anyway to compare the accuracy or we could just take the accept rate as the accuracy.

Of course. We can use the output of the target model as the label, and the draft model for classification. EAGLE's top-1 accuracy is about 0.8, while Medusa's top-1 accuracy is about 0.6.

Is the attention tree structure general to models, as medusa create the structure based on machine learning I guess, so they may diff from model to model.

Yes, the tree structure is general. Undoubtedly, using different tree structures for different models can achieve the best results, but EAGLE using a general tree structure already achieves quite good effects.

Signed-off-by: Siddhi Velankar <[email protected]>

siddhivelankar23 added 13 commits April 14, 2024 22:48

Add main eagle script modeling_eagle.py

49f694b

Signed-off-by: Siddhi Velankar <[email protected]>

add example file

38083cc

Signed-off-by: Siddhi Velankar <[email protected]>

Add README

1cdbf5b

Signed-off-by: Siddhi Velankar <[email protected]>

Update README

5722da0

Signed-off-by: Siddhi Velankar <[email protected]>

Merge branch 'intel:main' into main

ee2cbd9

update import statement

c04831a

Signed-off-by: Siddhi Velankar <[email protected]>

tps logic update based on outs type

27360c2

Signed-off-by: Siddhi Velankar <[email protected]>

Delete intel_extension_for_transformers/transformers/modeling/modelin…

4d4516b

…g_eagle.py Signed-off-by: Siddhi Velankar <[email protected]>

Update README.md

21bc11f

Signed-off-by: Siddhi Velankar <[email protected]>

tps logic

d089c8a

Signed-off-by: Siddhi Velankar <[email protected]>

Update README.md

a63a9bd

Signed-off-by: Siddhi Velankar <[email protected]>

Update README.md

8167116

Signed-off-by: Siddhi Velankar <[email protected]>

Merge branch 'main' into main

72ef767

XinyuYe-Intel approved these changes Apr 24, 2024

View reviewed changes

siddhivelankar23 added 2 commits April 24, 2024 09:42

Merge branch 'main' into main

1bb7c00

Merge branch 'main' into main

2aecf02

wenhuach21 self-requested a review April 28, 2024 08:25

wenhuach21 approved these changes Apr 28, 2024

View reviewed changes

Merge branch 'main' into main

7baed8b

siddhivelankar23 added 2 commits April 30, 2024 10:20

text to prompt

8de519d

Signed-off-by: Siddhi Velankar <[email protected]>

Merge branch 'main' into main

1758653

kevinintel merged commit e559929 into intel:main May 9, 2024
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate EAGLE with ITREX #1504

Integrate EAGLE with ITREX #1504

siddhivelankar23 commented Apr 23, 2024

wenhuach21 commented Apr 28, 2024 •

edited

Loading

Liyuhui-12 commented Apr 30, 2024 •

edited

Loading

Integrate EAGLE with ITREX #1504

Integrate EAGLE with ITREX #1504

Conversation

siddhivelankar23 commented Apr 23, 2024

Type of Change

Description

Expected Behavior & Potential Risk

How has this PR been tested?

wenhuach21 commented Apr 28, 2024 • edited Loading

Liyuhui-12 commented Apr 30, 2024 • edited Loading

wenhuach21 commented Apr 28, 2024 •

edited

Loading

Liyuhui-12 commented Apr 30, 2024 •

edited

Loading