Fast LLM Inference - Optimized Task Plan

I hope to implement some acceleration technologies for Large Language Models (LLMs) because I enjoy doing this myself and love the challenge of bringing research papers into real-world applications.

If there are any technologies you'd like to develop or discuss, feel free to reach out. Thanks!

I'm excited to dive deeper into AI research!

Updates Log

2024

2024/12/16: Add the Medusa-1 Training Script v2
2024/12/15: Add the Medusa-1 Training Script
2024/12/12: Update the KV Cache support for Speculative Decoding
2024/12/04: Add the Kangaroo Training Script v2
2024/11/26: Add the Kangaroo Training Script
2024/11/22: Update the Target Model Keep Generation Mechanism experiment
2024/11/18: Update the Self-Speculative Decoding experiment results of google--gemma-2-9b-it.
2024/11/12: Reviewing implementation challenges for Self-Speculative Decoding and evaluating model compatibility for improved efficiency.
2024/11/10: Initial setup for Self-Speculative Decoding completed; data pipeline in place for testing draft-and-verify.
2024/11/08: Speculative Decoding successfully implemented. Verified improved inference time with no noticeable accuracy degradation.

Pending Decisions

Batched Speculative Decoding:
Prompt lookup decoding: Determine timeline after reviewing initial implementations.
UAG Integration: Assess when to integrate after Medusa and Kangaroo are in place.

TODO List

November 2024

2024/11/08 | Complete Speculative Decoding following the paper Fast Inference from Transformers via Speculative Decoding
2024/11/15 | Implement Self-Speculative Decoding as per Draft & Verify - Lossless Large Language Model Acceleration via Self-Speculative Decoding
- LayerSkip model architecture
- Bayesian Optimization for Layer Skip Selection (AR)
- Adaption Draft-Exiting Mechanism
- Optimization
- Bayesian Optimization for Layer Skip Selection (Speed)
- gemma-2-9b-it experiment
2024/11/22 | Develop Kangaroo following Kangaroo - Lossless Self-Speculative Decoding via Double Early Exiting
- Kangaroo model
- Training Script
- Implement double early exits to improve speed.
2024/11/29 | Implement Medusa from Medusa - Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
- Medusa model
- Training Script (Medusa-1)
- Testing
2024/12/20 | Implement Eagle from

Additional Enhancements

TBD | Implement Batched Speculative Decoding from The Synergy of Speculative Decoding and Batching in Serving Large Language Models
TBD | Implement prompt lookup decoding from prompt-lookup-decoding GitHub
TBD | Implement UAG (Universal Assisted Generation) from Universal Assisted Generation Blog

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Fast LLM Inference - Optimized Task Plan

Updates Log

2024

Pending Decisions

TODO List

November 2024

Additional Enhancements

Files

README.md

Latest commit

History

README.md

File metadata and controls

Fast LLM Inference - Optimized Task Plan

Updates Log

2024

Pending Decisions

TODO List

November 2024

Additional Enhancements