Skip to content

UPSTREAM PR #18039: [Speculative decoding] feat: add EAGLE3 speculative decoding support#568

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18039-branch_ichbinhandsome-eagle3-adapt-new-arch
Open

UPSTREAM PR #18039: [Speculative decoding] feat: add EAGLE3 speculative decoding support#568
loci-dev wants to merge 1 commit intomainfrom
upstream-PR18039-branch_ichbinhandsome-eagle3-adapt-new-arch

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18039

As discussed in ggml-org/llama.cpp#15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.

The following provides a brief overview of this PR:

EAGLE3 is an encoder-decoder based speculative decoding method:

  • Extracts features from target model at specific layers
  • Uses feature fusion layer to compress target features
  • Generates draft tokens with single-layer decoder
  • Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:

  • Add LLM_ARCH_EAGLE3 architecture
  • Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
  • Add feature extraction from target model layers
  • Add g_embeddings handling for decoder input
  • Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
  • Add --eagle3 flag for speculative-simple example
  • Add EAGLE3 model conversion in convert_hf_to_gguf.py

EAGLE3 Architecture Overview :

┌─────────────────────────────────────────────────────────────────┐
│                    EAGLE3 Overview                              │
└─────────────────────────────────────────────────────────────────┘

  Target Model          EAGLE3 Encoder         EAGLE3 Decoder
  (LLaMA 8B)              (FC Layer)           (1-layer Transformer)
       │                      │                       │
       │                      │                       │
       ▼                      ▼                       ▼
┌─────────────┐        ┌─────────────┐        ┌─────────────────┐
│  Generate   │        │  Compress   │        │  Generate Draft │
│  Features   │───────►│  Features   │───────►│  Tokens Fast    │
│  [12288]    │        │  [4096]     │        │  [k tokens]     │
└─────────────┘        └─────────────┘        └────────┬────────┘
                                                       │
                                                       ▼
                                              ┌─────────────────┐
                                              │  Verify Drafts  │
                                              │  with Target    │
                                              └─────────────────┘

How to run EAGLE3 in llama.cpp

Requirements

This PR currently only support two EAGLE3 models:

Step 1: Convert Models to GGUF Format

  • Convert Target Model
TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
TARGET_MODEL_GGUF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct_bf16.gguf"

python convert_hf_to_gguf.py \
    "${TARGET_MODEL_HF}" \
    --outtype bf16 \
    --outfile "${TARGET_MODEL_GGUF}"
  • Convert EAGLE3 Draft Model
TARGET_MODEL_HF="${MODELS_DIR}/Meta-Llama-3.1-8B-Instruct"
EAGLE3_MODEL_HF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B"
EAGLE3_MODEL_GGUF="${MODELS_DIR}/EAGLE3-LLaMA3.1-Instruct-8B_fp16.gguf"

python convert_hf_to_gguf.py \
    "${EAGLE3_MODEL_HF}" \
    --outtype f16 \
    --target-model-dir "${TARGET_MODEL_HF}" \
    --outfile "${EAGLE3_MODEL_GGUF}"

Step 2: Compile llama.cpp

cmake -B build -DGGML_CUDA=ON
cmake --build build --config Release

Step 3: Run EAGLE3 Speculative Decoding

for prompt in \
    "Write a quicksort algorithm in Python. Write code only." \
    "Explain the Pythagorean theorem" \
    "Plan a 1 day trip to DC"; do
  echo "=== Prompt: $prompt ==="
    ./build/bin/llama-speculative-simple \
      -m "${TARGET_MODEL_GGUF}" \
      -md "${EAGLE3_MODEL_GGUF}" \
      --eagle3 -p "$prompt" -n 256 --draft 8 \
      --temp 0 --top-k 1 --seed 42 -ngl 99 -ngld 99 
done

Performance Evaluation (RTX A6000 48GB)

Note: Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.

  • LLaMA3.1-Instruct-8B with BF16, its Eagle3 with FP16
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 44.5 t/s 146.2 t/s 80.6% 3.28x
Explain the Pythagorean theorem 44.5 t/s 126.8 t/s 77.7% 2.85x
Plan a 1 day trip to DC 44.5 t/s 111.8 t/s 78.4% 2.51x
  • LLaMA3.1-Instruct-8B with Q4_K_M, its Eagle3 with Q4_K_M
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 121.5 t/s 260.5 t/s 83.6% 2.14x
Explain the Pythagorean theorem 121.4 t/s 232.4 t/s 78.6% 1.91x
Plan a 1 day trip to DC 121.4 t/s 186.8 t/s 71.5% 1.54x
  • LLaMA3.3-Instruct-70B with Q4_K_M, its Eagle3 with Q4_K_M
Prompt Baseline (llama-cli) EAGLE3 (draft_size=8) Accept Rate Speedup
Write a quicksort algorithm in Python. Write code only. 15.6 t/s 31.7 t/s 67.7% 2.03x
Explain the Pythagorean theorem 15.6 t/s 37.1 t/s 80.8% 2.38x
Plan a 1 day trip to DC 15.6 t/s 29.9 t/s 73.5% 1.92x

Details of GGML backend modifications

In the Eagle3 decoder, two parallel inputs are processed:

input_embeds ──→ RMS_NORM ──┐
                            ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ──→ RMS_NORM ──┘

When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).

Solution:
Use ggml_set_sync() to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.

input_embeds ──→ RMS_NORM ──→ [SYNC] ──┐
                                       ├──→ CONCAT ──→ Transformer Decoder
g_embeddings ─────────────→ RMS_NORM ──┘
         (split 1)            |         (split 2)
                           barrier

This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.

Examples results

  • Prompt: "Write a quicksort algorithm in Python. Write code only."
image
  • Prompt: "Explain the Pythagorean theorem"
image
  • Prompt: "Plan a 1 day trip to DC"
image

Future Steps

  • Support more Eagle3 models
  • Currently, Eagle3 is integrated only in llama-speculative-simple, support may need to be extended to other APIs if possible
  • Support context-dependent tree sampling (tree attention) as described in the Eagle3 paper to improve accept rate
  • Support batch processing (batch size > 1) with Eagle3 speculative decoding

EAGLE3 is an encoder-decoder based speculative decoding method:
- Extracts features from target model at specific layers
- Uses feature fusion layer to compress target features
- Generates draft tokens with single-layer decoder
- Maps draft vocabulary to target vocabulary via d2t tensor

Key changes:
- Add LLM_ARCH_EAGLE3 architecture
- Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp)
- Add feature extraction from target model layers
- Add g_embeddings handling for decoder input
- Add GGML_TENSOR_FLAG_SYNC for GPU synchronization
- Add --eagle3 flag for speculative-simple example
- Add EAGLE3 model conversion in convert_hf_to_gguf.py
@loci-review
Copy link

loci-review bot commented Dec 14, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #568 - EAGLE3 Speculative Decoding

Overview

PR #568 implements EAGLE3 speculative decoding support across 25 files with 1,119 additions. The changes introduce a new encoder-decoder architecture for draft token generation, enabling 2-3x inference speedup through speculative execution. Analysis reveals localized performance variations in argument parsing functions with negligible impact on core inference paths.

Key Findings

Performance-Critical Area Impact

Core Inference Functions: No modifications detected to llama_decode, llama_encode, or llama_tokenize. The PR adds parallel EAGLE3 execution paths without altering existing inference logic. Response time and throughput for standard inference remain unchanged, indicating zero impact on tokens per second for non-EAGLE3 workloads.

Feature Extraction: New extract_eagle3_features() function extracts 3 layers of embeddings (approximately 49 KB per token for LLaMA 3.1 8B). This operation occurs during target model forward pass and is amortized across multiple draft tokens generated by EAGLE3 decoder.

Draft Generation: gen_eagle3_draft() implements KV cache reuse, processing only incremental tokens (n_new) rather than full sequences. With acceptance rates of 67-84%, the draft overhead is offset by reduced target model invocations.

Vocabulary Mapping: Draft-to-target vocabulary mapping in llama_context::decode() adds per-token overhead through synchronous tensor retrieval and sparse mapping loop. The d2t mapping is cached after first use, limiting repeated overhead.

Argument Parsing Variations

Five argument parsing operators show throughput increases ranging from 54 ns to 83 ns (absolute deltas). These are one-time startup costs during command-line processing:

  • Operator E54: 32 ns → 115 ns (+83 ns)
  • Operator E0: 116 ns → 243 ns (+127 ns)
  • Operator E46: 51 ns → 106 ns (+55 ns)
  • Operator E51: 50 ns → 77 ns (+27 ns)

Operator E44 improved from 56 ns to 11 ns (-45 ns), demonstrating selective optimization. These microsecond-level variations occur once at startup and do not affect inference performance.

Power Consumption Analysis

Binary-level power consumption changes remain under 0.14%:

  • build.bin.llama-cvector-generator: +0.138%
  • build.bin.llama-run: +0.093%
  • build.bin.llama-tts: +0.073%
  • build.bin.libllama.so: -0.025%

The minimal power consumption delta indicates efficient implementation with localized computational overhead concentrated in EAGLE3-specific code paths.

GPU Synchronization

Introduction of GGML_TENSOR_FLAG_SYNC mechanism forces split boundaries in EAGLE3 decoder to prevent race conditions during parallel RMS_NORM operations. This targeted synchronization applies only to specific tensors, avoiding global GPU pipeline stalls. Benchmarks confirm the synchronization overhead does not compromise the 2-3x speedup.

Tokens Per Second Impact

Standard Inference: Zero impact. Core tokenization and inference functions remain unmodified.

EAGLE3 Mode: Achieves 2-3x speedup (e.g., 44.5 t/s → 146.2 t/s for LLaMA 3.1 8B BF16) through speculative decoding. The speedup derives from accepting multiple draft tokens per target model invocation, effectively reducing per-token latency despite additional feature extraction and draft generation overhead.

@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 88be9c1 to 98cb7a6 Compare December 17, 2025 14:10
@loci-dev loci-dev force-pushed the main branch 27 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 3 times, most recently from ef7afbe to d4c3480 Compare February 14, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants