-
Notifications
You must be signed in to change notification settings - Fork 211
Conversation
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
…g_eagle.py Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
Signed-off-by: Siddhi Velankar <[email protected]>
I don't have any questions about the PR, but would you mind if I ask several questions about the algorithm? I've only had a quick look at several papers in this domain. |
Hello, I am Yuhui Li, the author of the EAGLE paper, and I am here to answer your question.
The acceptance rate determines how many tokens the target LLM generates before each forward pass. EAGLE's draft model is slower than Medusa, but the target LLM accepts more tokens each time, so the acceleration ratio is higher. Using the MT bench as the test dataset to speed up Vicuna 7B, EAGLE allows Vicuna 7B to accept an average of 3.86 tokens per forward, significantly higher than the 2.51 tokens when using Medusa. Considering that the target LLM (Vicuna 7B) is much larger than the draft model, the gain from a higher acceptance rate is enough to offset the cost of the slower draft model, making EAGLE about 1.5x faster than Medusa.
Of course. We can use the output of the target model as the label, and the draft model for classification. EAGLE's top-1 accuracy is about 0.8, while Medusa's top-1 accuracy is about 0.6.
Yes, the tree structure is general. Undoubtedly, using different tree structures for different models can achieve the best results, but EAGLE using a general tree structure already achieves quite good effects. |
Signed-off-by: Siddhi Velankar <[email protected]>
Type of Change
Added feature to use EAGLE (speculative sampling) with ITREX as discussed with the ITREX team and Haim Barad from my team.
Added example script on how to use this feature
Added README for instructions
API not changed
Description
Intel Extension for Transformers supports the EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency) which is a speculative sampling method that improves text generation speed.
Eagle repo used and research paper is included in the README
Expected Behavior & Potential Risk
Using the example_eagle.py script in the recommended way, output text and "tokens per second" will be shown in the output
How has this PR been tested?
Tested on Intel PVCs and CPUs