Prevent overwriting drafters lm-head and embed_tokens by eldarkurtic · Pull Request #27737 · vllm-project/vllm

eldarkurtic · 2025-10-29T10:15:52Z

Some EAGLE3 drafters might have their own lm_head and/or embed_tokens layers. Existing codebase ignores this and always overrides them with target model's layers.

Test-case 1: for a model which needs copying of lm_head and embed_tokens from verifier model, the behavior should not change

Eval command to verify:

CUDA_VISIBLE_DEVICES=0,1 python examples/offline_inference/spec_decode.py \
  --method "eagle3" \
  --tp 2 \
  --model-dir "openai/gpt-oss-120b" \
  --eagle-dir "nvidia/gpt-oss-120b-Eagle3" \
  --dataset_name "hf" \
  --dataset_path "philschmid/mt-bench" \
  --num-spec-tokens 3

Before this PR:

--------------------------------------------------
total_num_output_tokens: 241888
num_drafts: 105426
num_draft_tokens: 316278
num_accepted_tokens: 136330
mean acceptance length: 2.29
--------------------------------------------------
acceptance at token 0: 0.65
acceptance at token 1: 0.40
acceptance at token 2: 0.24

After this PR:

--------------------------------------------------
total_num_output_tokens: 241168
num_drafts: 105678
num_draft_tokens: 317034
num_accepted_tokens: 135362
mean acceptance length: 2.28
--------------------------------------------------
acceptance at token 0: 0.65
acceptance at token 1: 0.39
acceptance at token 2: 0.24

Test-case 2: for a model with has its own lm_head and embed_tokens, and therefore does not require copying from target's layers, acceptance rates look significantly better

Before this PR:

--------------------------------------------------
total_num_output_tokens: 247973
num_drafts: 187566
num_draft_tokens: 562698
num_accepted_tokens: 59726
mean acceptance length: 1.32
--------------------------------------------------
acceptance at token 0: 0.23
acceptance at token 1: 0.07
acceptance at token 2: 0.02

After this PR:

--------------------------------------------------
total_num_output_tokens: 247974
num_drafts: 99354
num_draft_tokens: 298062
num_accepted_tokens: 148677
mean acceptance length: 2.50
--------------------------------------------------
acceptance at token 0: 0.70
acceptance at token 1: 0.48
acceptance at token 2: 0.31

Note: idea for lm_head check inspired by #27688

gemini-code-assist

Code Review

This pull request introduces a more robust mechanism for handling the lm_head and embed_tokens layers in EAGLE3 drafter models. By replacing the fragile shape-matching heuristic with explicit flags set during weight loading, the change correctly prevents overwriting these layers when the drafter model provides its own. This is a significant improvement for correctness and maintainability. My review includes a suggestion to further strengthen the flag-setting logic to prevent potential edge cases.

vllm/model_executor/models/llama_eagle3.py

rahul-tuli

LGTM!

vllm/model_executor/models/llama_eagle3.py

eldarkurtic · 2025-10-30T13:27:27Z

moved all attributes into SupportsEagle3 interface. Could you please re-review? @NickLucche @rahul-tuli @dsikka

NickLucche

Thanks for the fix @eldarkurtic ! Let's get CI green real quick before merging this one

benchislett · 2025-10-30T16:23:08Z

vllm/model_executor/models/llama_eagle3.py

+            # To prevent overriding with target model's layers
+            if "lm_head" in name:
+                self.has_own_lm_head = True
+            if "embed_tokens" in name:


Is this a change from the default? I thought that EAGLE3 heads usually share the embedding with the base model?

Eagle3 by default doesn’t train these layers. But there is no reason not to train them. This doesn’t affect the standard eagle3 flow, just extends it to support this new use case

benchislett · 2025-10-30T16:27:46Z

There are other EAGLE1-based models which do not have the EAGLE3 mixin, causing the CI failures. Please update the logic to cover EAGLE1 as well

benchislett · 2025-10-30T16:30:00Z

vllm/v1/spec_decode/eagle.py

                logger.info(
-                    "Assuming the EAGLE head shares the same vocab embedding"
-                    " with the target model."
+                    "Draft model embed_tokens are uninitialized. "


This was originally done as a memory optimization since the original EAGLE3 models are released with an embedding layer included, but having the same weights as the base model.

Ideally we would have another check here to delete them if they are present but identical to those of the base model, but that's tricky to implement cleanly. I'm fine with doing it this way for now, but it should be noted that the behaviour is changing

eldarkurtic · 2025-10-30T16:30:05Z

Do they use some other mixin similar to SupportsEagle3?

benchislett · 2025-10-30T16:33:22Z

vllm/model_executor/models/interfaces.py

@@ -922,6 +922,16 @@ class SupportsEagle3(Protocol):
        MRO of your model class.
    """

+    has_own_lm_head: ClassVar[bool] = False


These are not class variables, they are instance variables that are set per-model based on the weight loading.

https://typing.python.org/en/latest/spec/class-compat.html

benchislett · 2025-10-30T16:36:07Z

The proper fix here is to add a base SupportsEagle mixin and have SupportsEagle3 inherit from that. Then the other EAGLE classes can inherit from the new base mixin that will give them a reasonable default.

benchislett · 2025-10-30T16:37:10Z

vllm/model_executor/models/llama_eagle3.py

@@ -328,6 +328,12 @@ def load_weights(self, weights: Iterable[tuple[str, torch.Tensor]]):
                includes_embed_tokens = True
            model_weights[name] = loaded_weight

+            # To prevent overriding with target model's layers
+            if "lm_head" in name:


This should be added to all the EAGLE classes, not just llama_eagle3.

You might want to refactor this into the mixin to reuse the code between them.

hjjq · 2025-11-06T19:39:09Z

Hi @eldarkurtic @robertgshaw2-redhat , just wondering what is the status of this PR? Are we close to getting it merged? Thanks!

mergify · 2025-11-12T10:05:24Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @eldarkurtic.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-11-12T10:06:25Z

Documentation preview: https://vllm--27737.org.readthedocs.build/en/27737/

…ialized Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

…ration (vllm-project#27670) Signed-off-by: KevinCheung2259 <2651309292@qq.com> Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

eldarkurtic · 2025-11-12T11:13:06Z

Closed in favor of a slightly cleaner approach in #28549

eldarkurtic requested review from benchislett and luccafong as code owners October 29, 2025 10:15

eldarkurtic mentioned this pull request Oct 29, 2025

Prevent overwriting drafter's lm_head and embed_tokens #27732

Closed

mergify bot added llama Related to Llama models speculative-decoding v1 labels Oct 29, 2025

gemini-code-assist bot reviewed Oct 29, 2025

View reviewed changes

vllm/model_executor/models/llama_eagle3.py Outdated Show resolved Hide resolved

rahul-tuli approved these changes Oct 29, 2025

View reviewed changes

ProExpertProg added the ready ONLY add when PR is ready to merge/full CI is needed label Oct 29, 2025

NickLucche reviewed Oct 29, 2025

View reviewed changes

vllm/model_executor/models/llama_eagle3.py Outdated Show resolved Hide resolved

robertgshaw2-redhat approved these changes Oct 30, 2025

View reviewed changes

robertgshaw2-redhat enabled auto-merge (squash) October 30, 2025 13:40

NickLucche approved these changes Oct 30, 2025

View reviewed changes

benchislett reviewed Oct 30, 2025

View reviewed changes

HF-001 mentioned this pull request Nov 12, 2025

[bugfix]Prevent overwriting drafters lm-head and embed_tokens vllm-project/vllm-ascend#4134

Open

auto-merge was automatically disabled November 12, 2025 10:04
Head branch was pushed to by a user without write access

mergify bot added the deepseek Related to DeepSeek models label Nov 12, 2025

mergify bot added the needs-rebase label Nov 12, 2025

eldarkurtic force-pushed the fix-eagle3-drafter-init branch from 4495667 to e69c88f Compare November 12, 2025 10:06

eldarkurtic requested review from noooop and tjtanaa as code owners November 12, 2025 10:06

eldarkurtic requested review from 22quinn and tdoublep as code owners November 12, 2025 10:06

github-project-automation bot moved this to To Triage in gpt-oss Issues & Enhancements Nov 12, 2025

github-project-automation bot added this to gpt-oss Issues & Enhancements and NVIDIA Nov 12, 2025

mergify bot added rocm Related to AMD ROCm tpu Related to Google TPUs kv-connector labels Nov 12, 2025

eldarkurtic and others added 4 commits November 12, 2025 10:12

prevent overwriting drafters lm-head and embed_tokens if already init…

abbc081

…ialized Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

move attrs into SupportsEagle3 interface

9cf181f

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

[Fix] import get_kv_cache_torch_dtype error in LMCacheConnector integ…

b6e8c7f

…ration (vllm-project#27670) Signed-off-by: KevinCheung2259 <2651309292@qq.com> Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

address PR feedback

f766651

Signed-off-by: Eldar Kurtic <8884008+eldarkurtic@users.noreply.github.com>

eldarkurtic force-pushed the fix-eagle3-drafter-init branch from e69c88f to f766651 Compare November 12, 2025 10:12

mergify bot removed the tpu Related to Google TPUs label Nov 12, 2025

eldarkurtic added 2 commits November 12, 2025 10:23

fix conflict

9151872

fix conflict

5d66041

eldarkurtic closed this Nov 12, 2025

github-project-automation bot moved this from To Triage to Done in gpt-oss Issues & Enhancements Nov 12, 2025

github-project-automation bot moved this to Done in NVIDIA Nov 12, 2025

Uh oh!

Conversation

eldarkurtic commented Oct 29, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Test-case 1: for a model which needs copying of lm_head and embed_tokens from verifier model, the behavior should not change

Test-case 2: for a model with has its own lm_head and embed_tokens, and therefore does not require copying from target's layers, acceptance rates look significantly better

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

rahul-tuli left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

eldarkurtic commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

NickLucche left a comment

Choose a reason for hiding this comment

Uh oh!

benchislett Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

eldarkurtic Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

benchislett commented Oct 30, 2025

Uh oh!

benchislett Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

eldarkurtic commented Oct 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

benchislett Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

benchislett commented Oct 30, 2025

Uh oh!

benchislett Oct 30, 2025

Choose a reason for hiding this comment

Uh oh!

hjjq commented Nov 6, 2025

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

mergify bot commented Nov 12, 2025

Uh oh!

eldarkurtic commented Nov 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

8 participants

eldarkurtic commented Oct 29, 2025 •

edited by github-actions bot

Loading

eldarkurtic commented Oct 30, 2025 •

edited

Loading

eldarkurtic Oct 30, 2025 •

edited

Loading

eldarkurtic commented Oct 30, 2025 •

edited

Loading