-
-
Notifications
You must be signed in to change notification settings - Fork 11.1k
[V1] [Spec decode] Llama4 type eagle support in v1 #18369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: Ronald Xu <[email protected]>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
|
Ready for review now |
|
@RonaldBXu Looks good to me overall. Could you please add a test? Also, is there any available EAGLE head we can test this on? |
I found this from nvidia: https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3, but it seems they are using eagle3 architecture |
|
Hi @WoosukKwon when you say add a test do you mean an e2e test like in https://github.com/vllm-project/vllm/blob/main/tests/spec_decode/e2e/test_eagle_correctness.py or https://github.com/vllm-project/vllm/blob/main/tests/models/registry.py#L407? I think I'd have to open-source a compatible eagle head first, right? Could you point me to other tests I could work on while I wait for approval for a compatible eagle head? Thanks! |
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
|
Hi @RonaldBXu, the PR looks good to me overall, but we'd like to have a test or at least a way to run the code. Please refer to https://github.com/vllm-project/vllm/blob/main/tests/v1/spec_decode/test_eagle.py and vllm/tests/v1/e2e/test_spec_decode.py Line 109 in ee1531b
Yes. We need an eagle head for Llama 4. Could we use https://huggingface.co/nvidia/Llama-4-Maverick-17B-128E-Eagle3 (@aarnphm mentioned)? |
|
Thanks, I'll look at those tests. I don't think we can use that head since it is EAGLE3, but the good news is I got approval to release a compatible eagle head for my code. I should hopefully have it ready sometime next week! |
Signed-off-by: Ronald Xu <[email protected]>
|
Hi @WoosukKwon , I added the tests. Just wanted to call out that for Llama4 Maverick, tp=1 was not sufficient (cuda out of memory error) so I made my test initialize the LLM on tp=8. Although I guess I could change it to Llama4 scout.. Please let me know what you think would be the best option here. Thanks! |
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
|
Oh wait, I know what the problem is. |
|
Yeah you should update the oracle. |
Signed-off-by: Ronald Xu <[email protected]>
|
I think the oracle is fine, the problem is that since the initialization test doesn't have a "method" field in the speculative config, the oracle would fallback to V0 (which is correct, so all the existing eagle models are being tested in V0). However, my implementation is not compatible with V0 (unlike the existing eagle models). I added 2 new field to the initialization tests, |
Signed-off-by: Ronald Xu <[email protected]>
Signed-off-by: Ronald Xu <[email protected]>
|
using maverick as the target model leads to OOM. changing to scout. |
Signed-off-by: Ronald Xu <[email protected]>
|
It seems like we still get OOM. I'm reducing the max_model_len. I think the test is being run on small hardware so prob will still get OOM. Is there a way to designate a larger hardware for this specific test? Btw, this test works on my local machine now. |
|
Also, I noticed there is an existing FIXME in the initialization test mentioning OOM/memory leaks. I wonder if this is related to me getting OOM in the CI. |
|
Hi @aarnphm what do you think? |
|
Hi @aarnphm @benchislett @WoosukKwon what do you think is the best course of action here? The updated initialization test for my eagle head works locally, but I get OOM in the CI. Thanks Edit: for now, I saw another test was skipped, so I added mine to the skip list. |
Signed-off-by: Ronald Xu <[email protected]>
|
I'm good with merging this in for now. We will probably have something in the works soon |
|
Hi @WoosukKwon could you review this again? |
Some acceptance rate results from running spec_decode.py |
|
This pull request has merge conflicts that must be resolved before it can be |
|
This pull request has merge conflicts that must be resolved before it can be |
|
Hi @RonaldBXu, Thanks for all of the hard work. But from the other thread we might want to have Meta to help with the implementation here. Sorry about this. |
This PR adds the capability for llama4-type eagle heads to be used for speculative decoding in vLLM v1. This is my first major PR in vLLM, so feedback is appreciated : )