[Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden_states#27688
[Spec Decode] Add support for EAGLE3 heads that do not use_aux_hidden_states#27688benchislett merged 15 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request adds support for EAGLE3 heads that may not use auxiliary hidden states or may have their own lm_head. It also enhances the speculative configuration hash for eagle3 to differentiate between various draft models. The modifications in llama_eagle3.py, eagle.py, and gpu_model_runner.py seem to correctly implement the necessary logic for these new EAGLE3 variations. However, I've identified a critical issue in speculative.py where the hashing logic for non-eagle3 speculative methods has been altered, which could lead to CUDA graph cache collisions and subsequent runtime errors.
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: hjjq <hanjieq@nvidia.com>
benchislett
left a comment
There was a problem hiding this comment.
I assume that when eagle3_use_aux_hidden_state is False, we don't ever use self.fc. Can you make sure that this weight never gets initialized if the flag is set, and also log a warning and continue gracefully is the "fc" weight is found in the checkpoint?
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: Benjamin Chislett <bchislett@nvidia.com>
Signed-off-by: hjjq <hanjieq@nvidia.com>
Signed-off-by: hjjq <hanjieq@nvidia.com>
Head branch was pushed to by a user without write access
|
I have tested that the compilation hash change is no longer needed in top of tree, potentially due to changes in #26468. As a result, the ci failure should also be gone. |
Signed-off-by: hjjq <hanjieq@nvidia.com>
…_states (vllm-project#27688) Signed-off-by: hjjq <hanjieq@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
…_states (vllm-project#27688) Signed-off-by: hjjq <hanjieq@nvidia.com> Signed-off-by: Benjamin Chislett <bchislett@nvidia.com> Co-authored-by: Benjamin Chislett <bchislett@nvidia.com>
Some eagle3 heads (e.g., nvidia/gpt-oss-120b-Eagle3-v2) do not use auxiliary hidden states and directly uses the last layer output just like eagle1.
Currently, vLLM assumes all Eagle3 heads use aux hidden states. This PR removes this assumption and checks from the draft model config.
Different draft models may also have different shapes for their params, if they share the same torch.compile cache there will be errors when running them one after another. So this PR also adds the draft model hash to the SpeculativeConfig hash.No longer needed. See #27688 (comment)