[V1][Spec decode] Move drafter to model runner #13363
Conversation
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
|
👋 Hi! Thank you for contributing to the vLLM project. 💬 Join our developer Slack at https://slack.vllm.ai to discuss your PR in #pr-reviews, coordinate on features in #feat- channels, or join special interest groups in #sig- channels. Just a reminder: PRs would not trigger full CI run by default. Instead, it would only run Once the PR is approved and ready to go, your PR reviewer(s) can run CI to test the changes comprehensively before merging. To run CI, PR reviewers can either: Add 🚀 |
LiuXiaoxuanPKU
left a comment
There was a problem hiding this comment.
LGTM! Thanks for moving the propose logic to the end of model runner!
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu> Signed-off-by: Louis Ulmer <ulmerlouis@gmail.com>
Signed-off-by: Woosuk Kwon <woosuk.kwon@berkeley.edu>
This PR moves the speculative decoding drafter, from the engine core to the model runner. At the end of each step, taking the large model’s outputs or hidden states, the spec decoding drafter generates extra "unverified" tokens. Those extra tokens are scheduled and verified in the next step, and in this same step, the spec decoding method generates new extra tokens again.