[V1] [Spec decode] Llama4 type eagle support in v1 #18369

RonaldBXu · 2025-05-19T22:43:07Z

This PR adds the capability for llama4-type eagle heads to be used for speculative decoding in vLLM v1. This is my first major PR in vLLM, so feedback is appreciated : )

Signed-off-by: Ronald Xu <[email protected]>

github-actions · 2025-05-19T22:43:15Z

👋 Hi! Thank you for contributing to the vLLM project.

💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels.

Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run fastcheck CI which starts running only a small and essential subset of CI tests to quickly catch errors. You can run other CI tests on top of those by going to your fastcheck build on Buildkite UI (linked in the PR checks section) and unblock them. If you do not have permission to unblock, ping simon-mo or khluu to add you in our Buildkite org.

Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging.

To run CI, PR reviewers can either: Add ready label to the PR or enable auto-merge.

🚀

sarckk · 2025-05-19T23:38:36Z

cc @zixi-qi @morgendave

RonaldBXu · 2025-05-21T04:15:09Z

Ready for review now

WoosukKwon · 2025-05-22T18:43:07Z

@RonaldBXu Looks good to me overall. Could you please add a test? Also, is there any available EAGLE head we can test this on?

aarnphm · 2025-06-05T22:33:17Z

Could you please add a test? Also, is there any available EAGLE head we can test this on?

I found this from nvidia: https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3, but it seems they are using eagle3 architecture

RonaldBXu · 2025-06-11T00:21:53Z

Hi @WoosukKwon when you say add a test do you mean an e2e test like in https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py or https://github.com/vllm-project/vllm/blob/main/tests/models/registry.py#L407? I think I'd have to open-source a compatible eagle head first, right?

Could you point me to other tests I could work on while I wait for approval for a compatible eagle head? Thanks!

Signed-off-by: Ronald Xu <[email protected]>

WoosukKwon · 2025-06-15T05:40:49Z

Hi @RonaldBXu, the PR looks good to me overall, but we'd like to have a test or at least a way to run the code.

Please refer to https://github.com/vllm-project/vllm/blob/main/tests/v1/spec_decode/test_eagle.py and

vllm/tests/v1/e2e/test_spec_decode.py

Line 109 in ee1531b

def test_eagle_correctness(

I think I'd have to open-source a compatible eagle head first, right?

Yes. We need an eagle head for Llama 4. Could we use https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3 (@aarnphm mentioned)?

RonaldBXu · 2025-06-15T05:45:44Z

Thanks, I'll look at those tests. I don't think we can use that head since it is EAGLE3, but the good news is I got approval to release a compatible eagle head for my code. I should hopefully have it ready sometime next week!

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu · 2025-06-21T07:17:15Z

Hi @WoosukKwon , I added the tests. Just wanted to call out that for Llama4 Maverick, tp=1 was not sufficient (cuda out of memory error) so I made my test initialize the LLM on tp=8. Although I guess I could change it to Llama4 scout.. Please let me know what you think would be the best option here. Thanks!

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu · 2025-06-30T01:06:30Z

Oh wait, I know what the problem is.

aarnphm · 2025-06-30T01:16:39Z

Yeah you should update the oracle.

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu · 2025-06-30T01:35:02Z

I think the oracle is fine, the problem is that since the initialization test doesn't have a "method" field in the speculative config, the oracle would fallback to V0 (which is correct, so all the existing eagle models are being tested in V0). However, my implementation is not compatible with V0 (unlike the existing eagle models). I added 2 new field to the initialization tests, v1_only and speculative_method for my model.

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu · 2025-06-30T04:19:25Z

using maverick as the target model leads to OOM. changing to scout.

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu · 2025-06-30T07:44:59Z

It seems like we still get OOM. I'm reducing the max_model_len. I think the test is being run on small hardware so prob will still get OOM. Is there a way to designate a larger hardware for this specific test? Btw, this test works on my local machine now.

RonaldBXu · 2025-06-30T07:50:25Z

Also, I noticed there is an existing FIXME in the initialization test mentioning OOM/memory leaks. I wonder if this is related to me getting OOM in the CI.

RonaldBXu · 2025-07-01T20:27:31Z

Hi @aarnphm what do you think?

RonaldBXu · 2025-07-08T14:01:51Z

Hi @aarnphm @benchislett @WoosukKwon what do you think is the best course of action here? The updated initialization test for my eagle head works locally, but I get OOM in the CI. Thanks

Edit: for now, I saw another test was skipped, so I added mine to the skip list.

Signed-off-by: Ronald Xu <[email protected]>

aarnphm · 2025-07-09T05:15:18Z

I'm good with merging this in for now. We will probably have something in the works soon

RonaldBXu · 2025-07-09T18:15:14Z

Hi @WoosukKwon could you review this again?

RonaldBXu · 2025-07-09T23:14:09Z

total_num_output_tokens: 39840
num_drafts: 27936
num_draft_tokens: 167616
num_accepted_tokens: 11801
mean acceptance length: 1.42
--------------------------------------------------
acceptance at token 0: 0.37
acceptance at token 1: 0.05
acceptance at token 2: 0.00
acceptance at token 3: 0.00
acceptance at token 4: 0.00
acceptance at token 5: 0.00

Some acceptance rate results from running spec_decode.py

mergify · 2025-07-12T06:12:31Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RonaldBXu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

mergify · 2025-07-13T02:45:00Z

This pull request has merge conflicts that must be resolved before it can be
merged. Please rebase the PR, @RonaldBXu.

https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/working-with-forks/syncing-a-fork

aarnphm · 2025-07-13T13:28:31Z

Hi @RonaldBXu, Thanks for all of the hard work. But from the other thread we might want to have Meta to help with the implementation here. Sorry about this.

llama4 type eagle support in v1

da3dc61

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu closed this May 19, 2025

RonaldBXu reopened this May 19, 2025

RonaldBXu closed this May 19, 2025

RonaldBXu reopened this May 21, 2025

Merge branch 'vllm-project:main' into llama4_v1_support

b61e6be

markmc added speculative-decoding v1 labels May 21, 2025

WoosukKwon self-requested a review May 22, 2025 18:41

Merge branch 'main' into llama4_v1_support

924be7b

mergify bot added the llama Related to Llama models label Jun 9, 2025

Merge branch 'vllm-project:main' into llama4_v1_support

f40d973

RonaldBXu and others added 4 commits June 14, 2025 14:13

Merge branch 'vllm-project:main' into llama4_v1_support

a4dd030

updating code to match current standards. removed redundant lm_head

06bfb26

Signed-off-by: Ronald Xu <[email protected]>

add spdx filecopyright text

40df89d

Signed-off-by: Ronald Xu <[email protected]>

fix linter

e9d9241

Signed-off-by: Ronald Xu <[email protected]>

WoosukKwon added the ready ONLY add when PR is ready to merge/full CI is needed label Jun 15, 2025

RonaldBXu and others added 2 commits June 21, 2025 00:07

Merge branch 'vllm-project:main' into llama4_v1_support

88ecec6

tests

25bf276

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu added 2 commits June 21, 2025 07:26

fix linter

1868c12

Signed-off-by: Ronald Xu <[email protected]>

fix linter

23136c8

Signed-off-by: Ronald Xu <[email protected]>

update initialization test

07c5c8c

Signed-off-by: Ronald Xu <[email protected]>

RonaldBXu added 2 commits June 30, 2025 01:38

fix linter

815a8a2

Signed-off-by: Ronald Xu <[email protected]>

change to scout

96f22bd

Signed-off-by: Ronald Xu <[email protected]>

change max model len

963f57c

Signed-off-by: Ronald Xu <[email protected]>

aarnphm mentioned this pull request Jul 8, 2025

[Meta] Llama4 EAGLE Support #20591

Merged

RonaldBXu and others added 2 commits July 8, 2025 07:04

Merge branch 'vllm-project:main' into llama4_v1_support

66d79a3

skip test

7ef96be

Signed-off-by: Ronald Xu <[email protected]>

Merge branch 'vllm-project:main' into llama4_v1_support

7c85ebb

mergify bot added the new-model Requests to new models label Jul 10, 2025

mergify bot added the needs-rebase label Jul 12, 2025

Merge branch 'main' into llama4_v1_support

c07e825

mergify bot removed the needs-rebase label Jul 12, 2025

mergify bot added the needs-rebase label Jul 13, 2025

Merge branch 'main' into llama4_v1_support

929e620

mergify bot removed the needs-rebase label Jul 13, 2025

aarnphm closed this Jul 13, 2025

Uh oh!

[V1] [Spec decode] Llama4 type eagle support in v1 #18369

[V1] [Spec decode] Llama4 type eagle support in v1 #18369

Uh oh!

Conversation

RonaldBXu commented May 19, 2025 • edited by github-actions bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented May 19, 2025

Uh oh!

sarckk commented May 19, 2025

Uh oh!

RonaldBXu commented May 21, 2025

Uh oh!

WoosukKwon commented May 22, 2025

Uh oh!

aarnphm commented Jun 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonaldBXu commented Jun 11, 2025

Uh oh!

WoosukKwon commented Jun 15, 2025

Uh oh!

RonaldBXu commented Jun 15, 2025

Uh oh!

RonaldBXu commented Jun 21, 2025

Uh oh!

RonaldBXu commented Jun 30, 2025

Uh oh!

aarnphm commented Jun 30, 2025

Uh oh!

RonaldBXu commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonaldBXu commented Jun 30, 2025

Uh oh!

RonaldBXu commented Jun 30, 2025

Uh oh!

RonaldBXu commented Jun 30, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

RonaldBXu commented Jul 1, 2025

Uh oh!

RonaldBXu commented Jul 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

aarnphm commented Jul 9, 2025

Uh oh!

RonaldBXu commented Jul 9, 2025

Uh oh!

RonaldBXu commented Jul 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify bot commented Jul 12, 2025

Uh oh!

mergify bot commented Jul 13, 2025

Uh oh!

aarnphm commented Jul 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

RonaldBXu commented May 19, 2025 •

edited by github-actions bot

Loading

aarnphm commented Jun 5, 2025 •

edited

Loading

RonaldBXu commented Jun 30, 2025 •

edited

Loading

RonaldBXu commented Jun 30, 2025 •

edited

Loading

RonaldBXu commented Jul 8, 2025 •

edited

Loading

RonaldBXu commented Jul 9, 2025 •

edited

Loading