[Speculative decoding] feat: add EAGLE3 speculative decoding support#18039
[Speculative decoding] feat: add EAGLE3 speculative decoding support#18039ichbinhandsome wants to merge 12 commits intoggml-org:masterfrom
Conversation
EAGLE3 is an encoder-decoder based speculative decoding method: - Extracts features from target model at specific layers - Uses feature fusion layer to compress target features - Generates draft tokens with single-layer decoder - Maps draft vocabulary to target vocabulary via d2t tensor Key changes: - Add LLM_ARCH_EAGLE3 architecture - Add EAGLE3 encoder/decoder graph (src/models/eagle3.cpp) - Add feature extraction from target model layers - Add g_embeddings handling for decoder input - Add GGML_TENSOR_FLAG_SYNC for GPU synchronization - Add --eagle3 flag for speculative-simple example - Add EAGLE3 model conversion in convert_hf_to_gguf.py
|
Judging by the description of this PR, I believe many models with multiple-token prediction also have the same strategy of reusing hidden features from the main model. It can be quite interesting to generalize this features to support other models. I would expect some kind of sub- |
I will definitely be looking at refactoring the implementation to become more generic before merging it. The initial results in terms of performance are really great, but we'll need to work on cleaning up the code and reduce the special-casing in several places. I'll try to provide insights how to do that in the next days. |
Thanks @ggerganov @ngxson for your inputs. Definitely, looking forward to hearing your feedback and improving this PR. |
|
|
||
| // TODO: refactor into llm_graph_input | ||
| ggml_tensor * inp_g = ggml_new_tensor_2d(ctx0, GGML_TYPE_F32, n_embd, n_tokens); | ||
| ggml_set_input(inp_g); | ||
| cb(inp_g, "inp_g_embeddings", -1); // TODO: do not change the name! refactor into llm_graph_input | ||
|
|
There was a problem hiding this comment.
I will change this to llm_graph_input in order to remove the extra "set input" logic in llama_context::process_ubatch.
| // EAGLE3: Extract intermediate layer features from target model at layer INPUT | ||
| if (eagle3 && cparams.eagle3_extract_enabled && !eagle3->extract_layer_indices.empty()) { | ||
| static const char * eagle3_extract_names[] = {"eagle3_extract_0", "eagle3_extract_1", "eagle3_extract_2"}; | ||
| for (size_t i = 0; i < eagle3->extract_layer_indices.size() && i < 3; ++i) { | ||
| if (eagle3->extract_layer_indices[i] == il) { | ||
| cb(inpL, eagle3_extract_names[i], il); | ||
| break; | ||
| } | ||
| } | ||
| } |
There was a problem hiding this comment.
I will next look to remove this ad hoc logic and generalize it some way. Likely by passing the extraction points in some more generic way during llama_context creation. TBD
|
|
||
| // EAGLE3 draft model - target model hidden size | ||
| uint32_t eagle3_target_hidden_size = 0; | ||
|
|
There was a problem hiding this comment.
This can become more generic by renaming it to n_embd_enc and utilizing the n_embd_inp() call.
| // Get pointer to target model features extracted for EAGLE3 encoder | ||
| // Returns NULL if no features are available | ||
| // Format: [3*n_embd, n_tokens] - use model.hparams.n_embd and batch.n_tokens for dimensions | ||
| LLAMA_API const float * llama_get_eagle3_target_features(struct llama_context * ctx); |
There was a problem hiding this comment.
This call should become more generic and not Eagle3 specific. Will be looking how to achieve this in the best way.
| // Set g_embeddings from EAGLE3 encoder output for decoder input | ||
| // g_embd: pointer to encoder output embeddings | ||
| LLAMA_API void llama_set_eagle3_g_embeddings( | ||
| struct llama_context * ctx, | ||
| const float * g_embd, | ||
| int32_t n_embd, | ||
| int32_t n_tokens); | ||
|
|
There was a problem hiding this comment.
Might be possible to avoid this API if we combine the Eagle encoder and decoder in a single context. TBD
There was a problem hiding this comment.
When combining the Eagle3 encoder and decoder into a single context, note that the Eagle3 encoder is used only to fuse the extracted features from the target model, i.e. it is invoked as many times as the target model itself. The Eagle3 decoder, on the other hand, is solely responsible for generating draft tokens in autoregressive way.
llama_set_eagle3_g_embeddings() sets the g_embedding both from the Eagle3 encoder (used in the first generation step of the Eagle3 decoder) and from the Eagle3 decoder itself (used in subsequent generation steps).
There was a problem hiding this comment.
Yup, I noticed this interaction. We don't have a previous use case similar to this, but I think the enc-dec context could be adapted accordingly.
|
Bumping, is there any progress on this? It's probably one of the more coveted features to have right now. |
I'm currently side-tracked by some graph reallocation optimizations. Will probably come back to this after that. |
|
Eagle3 checkpoints for the Qwen3 series (including both dense and MoE models) are now supported, see the updated PR description for details. |
|
One question: it seems that CUDA Graph is disabled when the input n_tokens > 1. During the target model verification stage of speculative decoding, CUDA Graph is always disabled for the target model, since it’s only used for verification with multiple draft tokens > 1. However, we can fix the number of draft tokens (e.g., by using padding) to make it constant and thus enable CUDA Graph (may need to remove n_tokens > 1 constraint)? @ggerganov Context: I’m testing GPT-OSS-120B Eagle3 with llama.cpp, and I found that even with Eagle3 (accept rate 86%), the performance is worse than the naive llama-cli. After profiling, I discovered that CUDA Graph is consistently disabled for the target model during speculative decoding, whereas it remains enabled in llama-cli. This results in the target model’s verification(prefiling) phase being roughly >5× times slower compared to normal autoregressive decoding step. I’ve only observed this performance issue with GPT-OSS-120B Eagle3. For other models, even without CUDA Graph enabled for target model in Eagle3 speculative decoding, the performance remains great. |
I think the small-batch
Possibly, but to me this sounds like second-order optimization. Optimizing the
Hm, this is a bit surprising observation. Can you run a llama-batched-bench -m [gpt-oss-120b] -c 65536 -b 2048 -ub 512 -npp 1024 -ntg 32 -npl 1,2,3,4,5,6,7,8 |
|
Thanks very much for your inputs! @ggerganov
I double-checked the run today. The previous statement about cuda graph was incorrect due to instability and concurrent CPU activity in my test environment, sorry about that! Currently, enabling or disabling CUDA Graphs doesn’t have much impact in llama-cli for GPT-OSS-120B model. (I am testing on DGX Spark)
Also, the results for llama-batched-bench:
I agree. CUDA graphs could be second-order optimization.
For MoE models, prefilling becomes the main performance bottleneck because more active experts are involved. As a result, the assumption that “processing multiple draft tokens concurrently is as fast as processing a single token” no longer holds, which is an important condition for effective speculative decoding. I also saw that as the draft token length increases, the verification cost of the target model also rises. Do you have any rough ideas that how much performance gain we can get through imporving |
|
The I suppose the explanation is that for MoE models, at low batch sizes the amount of data we need to read from the weights for each batch increases linearly with the batch size (i.e. each extra token in the batch activates more experts and at small batch size the experts for each token are very likely different from each other). So it's probably normal that TG for MoE does not scale as well as TG for dense models as a function of the batch size.
Yeah, that's my guess as well. Do we have some references to cross-check this? Does the Eagle3 authors discuss it's performance for MoE models? Do we have sample numbers for
Hm, not sure. Thinking about it now, I feel like |
I tested the Baichuan-M3-235B model that was released yesterday (draft here). It's a finetune of the Qwen3 model above. It quantized successfully but failed due to having a different tensor shape (even in the original weights): I haven't looked into how often this to happen in finetunes of the same model, especially in the context of eagle3. However, the shapes of the tensors changing might be something to account for in the implementation (in this case Qwen3). Unless those will be treated as completely new models, in which case please disregard this comment. |
I spent some time analyzing the Baichuan-EAGLE3 draft model. It has a slightly different architecture compared to the standard Qwen3-EAGLE3 model.
This is because Baichuan-EAGLE3 uses an Attention Output Gate mechanism, which is not present in the standard EAGLE3 model. In this variant:
This is essentially a variant architecture of EAGLE3, not just a tensor shape difference. Supporting this variant would require:
I would suggest we focus this PR on the standard EAGLE3 model first. Once merged, we can consider adding support for this gated variant in a follow-up PR. Have you tested the standard Qwen3-EAGLE3 model as well? Does it work well with the current implementation? If yes, could you please share the t/s and speedup you got with eagle3? @arch-btw |
|
Since EAGLE3 can vary quite a lot for each model, maybe a better way is to consider it as an adapter (the same logic as lora adapter), instead of a dedicated arch? That way, it can hook into existing models more easily, making internal data like KV state, gate, etc, accessible to the draft model. |
Good point. However, Eagle3 doesn’t vary much across models. So far, except for Baichuan-Eagle3, all other models essentially use the same Eagle3 architecture. Please refer to the supported models listed in the PR description. I’d say the majority of models share the same Eagle3 architecture, with only a few exceptions. This standalone Eagle3 architecture strategy is also adopted in TensorRT-LLM, vLLM, and SGLang. |
|
I doubt that. In theory, nothing prevent them or another team from making a variant of eagle3 that get the state of more than 3 layers, or even reuse the KV state from earlier layers. Possibilities are endless, and that's why it's important to think about the bigger picture instead of just trying to make it work with one single existing architecture. I think a more model-agnostic approach via adapter API (or another API based on that form) will likely be the way ultimately. It will allow computing both the next token + draft token in one pass, allowing even higher performance than this approach. |
Could you please share some examples or real-world use cases of this? I’d like to better understand how such an approach might be applied in practice. |
|
The main problem with this PR and #15225 is that both assumes that the MTP (multi-token prediction) to work this way:
(Note: the dash line is to tell that it's may not be the case for all models; some only use the last hidden state)
While it does work for the moment, this approach doesn't address the true nature of MTP models. In other words, it is not truly model-agnostic. The main drawbacks is that you must manually pass the embeddings between 2 models, so you must know where to get the embeddings, its shapes, etc. Instead, we should look at MTP models as a normal LLM with multiple output heads:
In this POV, it's not matter what is the implementation of the In practice, the mtp_head(s) can be:
Now, returning to your question:
If you already get the idea above, then consider gemma3n: the model has 30 layers, but only 20 layers has KV projection. The last 10 layers reuse the KV from the 20-th layer. Some models also implement this idea, notably GLM, bailing. The same idea can be apply to MTP layers. Future models may has MTP layers to not just reuse the layer output hidden state, but also the projected KV inside the layer. While there is no models in the wild currently doing that, Baichuan-EAGLE3 (as you shown), already someway heading towards this direction by exposing both the Q+gate to the MTP model. |
|
(I have to split up my comment otherwise it's too long) My proposal is that we must design this function + the API in a way that it is flexible enough for future models. For EAGLE3, the MTP model is technically a For the API, we must avoid leaking the information about the implementation under the hood. The downstream code must only know about how many tokens can be generated, they don't need to know how to generate these extra tokens. So, an array of API as follow should be enough:
All the info about embeddings and the draft model must be kept private. CC @ggerganov maybe this is helpful for you |
As far as I know, the Eagle3 authors did not discuss their approach to MoE model performance in their paper. I am currently cross-checking the performance of GPT-OSS-120B Eagle3 on DGX Spark using SGLang, which essentially employs the same GPT-OSS-120B-Eagle3 draft model as I used for llama.cpp testing. The running commands I used are as follows: python3 -m sglang.launch_server --model-path gpt-oss-120b --host 0.0.0.0 --port 30000 --trust-remote-code• Eagle3: Set the draft size to 8 and disable tree decoding to ensure a fair comparison with our tests on llama.cpp. python3 -m sglang.launch_server --model gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 8 --speculative-eagle-topk 1 --speculative-num-draft-tokens 8 --trust-remote-code --host 0.0.0.0 --port 30000I am using curl -sS -X POST http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "Write a quicksort algorithm in Python. Write code only.",
"sampling_params": {
"max_new_tokens": 256
}
}' | python3 -c "
import sys, json
d = json.load(sys.stdin)
tokens = d['meta_info']['completion_tokens']
latency = d['meta_info']['e2e_latency']
tps = tokens / latency
print(f'completion_tokens: {tokens}')
print(f'e2e_latency: {latency:.3f}s')
print(f'token/s: {tps:.2f}')
"Here are the test results on DGX spark:
I also tested shorter draft sizes using the following command: python3 -m sglang.launch_server --model /home/nvidia/models/gpt-oss-120b --speculative-algorithm EAGLE3 --speculative-draft-model-path /home/nvidia/ruixiangw/models/lmsys/EAGLE3-gpt-oss-120b-bf16 --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4 --trust-remote-code --host 0.0.0.0 --port 30000The results:
From the tables above, we observed similar performance degradation for GPT-OSS-120B-Eagle3 on a single GPU device in SGLang as well. However, in their blog post, they claimed to have achieved some speedups for GPT-OSS-120B-Eagle3 inference using In summary, I believe that for large MoE models such as GPT-OSS-120B, Eagle3 may not provide a performance gain on single GPU device with single prompt use case. However, this does not apply to all MoE models—for example, we observed a performance improvement with Qwen3-30B-A3B_eagle3. This might be related to the number of active experts per token and the overall model size, where loading active experts (a memory-bound operation) dominates the inference time. |
Thank you very much for taking the time for this insightful proposal. Although we discussed the Eagle3 design(#15902 (reply in thread)) several months ago, it’s still great to hear your perspective. @ggerganov These might be things worth considering. |
|
The mentioned discussion only discuss the internal design, not the public API design. Probably it's best to open a dedicated discussion on the public API design to avoid going to far into a wrong direction. Even after reading #15902, |
| llm_build_eagle3_encode::llm_build_eagle3_encode(const llama_model & model, const llm_graph_params & params) : llm_graph_context(params) { | ||
| ggml_tensor * cur = nullptr; | ||
|
|
||
| cur = build_inp_embd(); | ||
|
|
||
| // Feature fusion layer | ||
| cur = build_lora_mm(model.fc, cur); | ||
| cb(cur, "fc_out", -1); | ||
|
|
||
| // Output: g_embeddings e.g. [4096, n_tokens] | ||
| res->t_embd = cur; | ||
|
|
||
| ggml_build_forward_expand(gf, cur); | ||
| } |
There was a problem hiding this comment.
if the whole point of the encoder is to just do a projection, I think it isn't truly an encoder in transformer terms.
an encoder is responsible for populating KV cache. here, we do not touch the KV cache at all. Instead, I believe this projection can be part of the decoder.
if we need to allow larger input embeddings than n_embd, there is an interface called n_embd_inp that allow doing just that
|
|
||
| // Single decoder layer (il = 0) | ||
| const int il = 0; | ||
| { |
There was a problem hiding this comment.
Hmm ok, I thought that we can fuse this cgraph with the main LLM cgraph. But that won't work very well because we need to call the sampling system to sample a new token each for each decoding pass of eagle3.
In such case, keeping it as a dedicated model seem ok, although I believe that in term of API design, we must keep llama_set_eagle3_g_embeddings private (not exposing it to the public API)
I think the best could be to have a notion of sub-llama_context, where one llama_context can encapsulate another llama_context. Will see if this is something that can easily be implemented or not.
There was a problem hiding this comment.
In such case, keeping it as a dedicated model seem ok, although I believe that in term of API design, we must keep llama_set_eagle3_g_embeddings private (not exposing it to the public API)
I think it can be avoided using an enc-dec context:
There was a problem hiding this comment.
It's not necessary because my comment above suggests that eagle3 is not exactly an enc-dec model, but more like an decoder-only model with n_embd_inp > n_embd
What I'm suggesting here is to pass the embeddings from main LLM to the smaller speculative LLM. Because they are currently on 2 different llama_context, so we currently have no better way than passing them via a public API (which make it less future-proof)
There was a problem hiding this comment.
(I think I'm commenting on the wrong line, this comment should be placed on llama_get_eagle3_target_features)
There was a problem hiding this comment.
I looked deeper into GLM-4.6 implementation today, and I'm pretty confident that eagle3 is almost the same as the MTP model of GLM-4.6
The "encoder" here is basically equivalent to nextn.eh_proj. It is not an enc-dec in transformer terms (i.e. unlike T5), just a bad naming.
And the rest is the same as deepseekv3 MTP style, except that instead of passing the hidden state from one MTP pass to another MTP pass, eagle3 use KV cache
I'm playing around with an implementation on my side that will expose just one single llama_decode_mtp call that will handle hidden state passing under the hood (based on llama_cross), so you can think of the main LLM as the encoder, that will populate the cross, and the MTP as the decoder, in transformer terms.
Will push it when I have a working version.
There was a problem hiding this comment.
In anyways, I'm still not convinced that the linear projection should be a dedicated "encoder" cgraph. As I mentioned, the performance loss in this PR could also be due to the backend synchronization happens between encode and decode pass of eagle3 model
The solution 2 in my last comment seems to be the most feasible, will try to implement that on my PR.
There was a problem hiding this comment.
As I mentioned, the performance loss in this PR could also be due to the backend synchronization happens between encode and decode pass of eagle3 model
No, it is not. As mentioned earlier, the performance degradation occurs only with the MoE model #18039 (comment). This is because the MoE model requires significantly more time for draft token verification compared to the dense model.
If you perform profiling, you will notice that the backend synchronization between the encode and decode passes of the Eagle3 model is relatively negligible.

There was a problem hiding this comment.
Ofc it is negligible if you compare it to the time it takes for the verification pass, but I don't believe that it is negligible if you compare to the time it take to generate one single draft token. The draft model is very small and CPU time can have significant impact on it.
But even if you say that's not important for whatever reason, the more important thing is that copying data to host memory is redundant. At this point, I think it's a better use of my time to just improving this in my implementation instead of arguing here.
There was a problem hiding this comment.
Also, from your profiling screenshot, it seems like there is a big gap between the large cudaMemcpyAsync and the run after it (I suppose that's the encoder pass of eagle3), I'm curious what happens in that big gap, probably some calculations on the CPU?
There was a problem hiding this comment.
I'm curious what happens in that big gap, probably some calculations on the CPU?
Yes. It is the rejection sampling phase during speculative decoding. Once we obtain the logits for the draft tokens from the target model, we need to verify which tokens are accepted and which need be rejected, and prepare these as input for the draft model. Note that the token_id to embedding mappling also happens in CPU.
|
@ichbinhandsome thank you for looking into the Baichuan model.
It took me a bit because I had to download the gguf of Qwen3. It does appear to work, but I'm noticing somewhat of a slowdown:
With EAGLE3: ResultWithout EAGLE3:
By the way, it says "inf tokens per second" under eval time, is that being replaced with the "decoded" section on top? Just making sure I'm reading it correctly. |
Thank you very much for testing this! The slowdown may be due to the short prompt (“Hello!”) or potential MoE performance issues mentioned in this comment. Could you try running the experiments using the same prompts I provided as examples in this PR? I’d expect a higher accept rate with those prompts, which might result in some speedups.
I'm using the same metrics as the original code. I think the reason for the inf value is that the target model is only used for draft token verification (prefill) rather than autoregressive decoding. Since no actual decode steps are performed, the eval time is recorded as 0 ms, resulting in inf t/s. |
|
@ichbinhandsome No problem!
Additional info for Qwen3-235B-A22B EAGLE3Additional info for Qwen3-1.7B EAGLE3 |
|
Thanks for testing! Glad to see the model works. Though the speedup for these models is relatively small or even worse, and the accept rate is quite low. |
|
This was just linked on Reddit today: https://z-lab.ai/projects/dflash/ https://github.com/z-lab/dflash and seems worth thinking about for any future MPT/Eagle API:
|
squash-merge of ggml-org/llama.cpp PR ggml-org#18039 onto main. adds Eagle-3 speculative decoding support for 1.5-2.5x generation speedup with draft model pairing.
checks daily for new llama.cpp releases. auto-rebases cherry-picks (audio ggml-org#18641, outetss ggml-org#12794, eagle-3 ggml-org#18039). creates tagged release on clean rebase, PR on conflicts. PR ggml-org#19460 (GLM-5 DSA) already merged upstream, not in cherry-pick list.


As discussed in #15902, Eagle3 represents the current SOTA in speculative decoding and is widely adopted across the industry. Integrating Eagle3 into llama.cpp enhances its performance and strengthens its competitiveness among leading inference frameworks. With Eagle3 speculative decoding now integrated into llama.cpp, inference performance has been significantly improved, achieving a 2–3× speedup.
This enhancement is the result of close collaboration between the NVIDIA and GGML teams, showcasing a strong technical partnership.
The following provides a brief overview of this PR:
EAGLE3 is an encoder-decoder based speculative decoding method:
Key changes:
EAGLE3 Architecture Overview :
How to run EAGLE3 in llama.cpp
Requirements
This PR currently
only supports twosupports following EAGLE3 models:The following eagle3 models should also work out of the box, though they haven’t been tested yet:
Step 1: Convert Models to GGUF Format
Step 2: Compile llama.cpp
[Optional] Step 3: Quantize the GGUF model
Step 4: Run EAGLE3 Speculative Decoding
Performance Evaluation (RTX A6000 48GB)
Note: Using the chat_template for each model version can improve acceptance rates. Always apply the model’s corresponding chat_template when constructing prompts.
BF16, its Eagle3 withFP16Q4_K_M, its Eagle3 withQ4_K_MQ4_K_M, its Eagle3 withQ4_K_MBF16, its Eagle3 withBF16BF16, its Eagle3 withBF16Q4_K_M, its Eagle3 withQ4_K_MBF16, its Eagle3 withBF16(tested on NVIDIA DGX Spark 128GB, speedup might be better on other hardwares)BF16, its Eagle3 withBF16(tested on NVIDIA DGX Spark 128GB, similar performance issue as GPT-OSS-120B Eagle3)Details of GGML backend modifications(Fixed, no longer needed)In the Eagle3 decoder, two parallel inputs are processed:When both RMS_NORM operations run in the same GPU split, a lack of synchronization causes buffer contention and race conditions (CPU execution is fine as it auto‑syncs between subgraphs).Solution:Useggml_set_sync()to add a synchronization point after the first RMS_NORM, forcing the scheduler to create a split boundary and synchronize before continuing.This ensures correct execution and can be applied to any parallel path that needs synchronization, not just Eagle3.Examples results
Future Steps