Conversation
EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py
|
Explore the complete analysis inside the Version Insights Performance Analysis Summary: PR #568 - EAGLE3 Speculative DecodingOverviewPR #568 implements EAGLE3 speculative decoding support across 25 files with 1,119 additions. The changes introduce a new encoder-decoder architecture for draft token generation, enabling 2-3x inference speedup through speculative execution. Analysis reveals localized performance variations in argument parsing functions with negligible impact on core inference paths. Key FindingsPerformance-Critical Area ImpactCore Inference Functions: No modifications detected to Feature Extraction: New Draft Generation: Vocabulary Mapping: Draft-to-target vocabulary mapping in Argument Parsing VariationsFive argument parsing operators show throughput increases ranging from 54 ns to 83 ns (absolute deltas). These are one-time startup costs during command-line processing:
Operator E44 improved from 56 ns to 11 ns (-45 ns), demonstrating selective optimization. These microsecond-level variations occur once at startup and do not affect inference performance. Power Consumption AnalysisBinary-level power consumption changes remain under 0.14%:
The minimal power consumption delta indicates efficient implementation with localized computational overhead concentrated in EAGLE3-specific code paths. GPU SynchronizationIntroduction of Tokens Per Second ImpactStandard Inference: Zero impact. Core tokenization and inference functions remain unmodified. EAGLE3 Mode: Achieves 2-3x speedup (e.g., 44.5 t/s → 146.2 t/s for LLaMA 3.1 8B BF16) through speculative decoding. The speedup derives from accepting multiple draft tokens per target model invocation, effectively reducing per-token latency despite additional feature extraction and draft generation overhead. |
88be9c1 to
98cb7a6
Compare
048ad94 to
6c1fde6
Compare
ef7afbe to
d4c3480
Compare
Mirrored from ggml-org/llama.cpp#18039
As discussed in ggml-org/llama.cpp#15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.
The following provides a brief overview of this PR:
EAGLE3 is an encoder-decoder based speculative decoding method:
Key changes:
EAGLE3 Architecture Overview :
How to run EAGLE3 in llama.cpp
Requirements
This PR currently only support two EAGLE3 models:
Step 1: Convert Models to GGUF Format
Step 2: Compile llama.cpp
Step 3: Run EAGLE3 Speculative Decoding
Performance Evaluation (RTX A6000 48GB)
Note: Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.
Q4_K_M, its Eagle3 withQ4_K_MQ4_K_M, its Eagle3 withQ4_K_MDetails of GGML backend modifications
In the Eagle3 decoder, two parallel inputs are processed:
When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).
Solution:
Use
ggml_set_sync()to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.
Examples results
Future Steps