[Misc] Tidy up some spec decode logic in GPUModelRunner#31591
[Misc] Tidy up some spec decode logic in GPUModelRunner#31591njhill merged 4 commits intovllm-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request refactors the speculative decoding logic in GPUModelRunner.sample_tokens. The changes simplify the code by grouping all speculative decoding logic within a check for spec_config is not None, which avoids unnecessary computations when speculative decoding is disabled. The logic for when to propose draft tokens has been clarified by introducing a new boolean flag, propose_drafts_after_bookkeeping, which correctly preserves the original behavior. The refactoring improves code readability and maintainability without introducing any functional changes. The changes look good.
|
CI failures are unrelated, also happening on main |
Signed-off-by: njhill <nickhill123@gmail.com>
Signed-off-by: njhill <nickhill123@gmail.com>
fece86d to
f6229e0
Compare
|
This pull request has merge conflicts that must be resolved before it can be |
# Conflicts: # vllm/v1/worker/gpu_model_runner.py Signed-off-by: Nick Hill <nickhill123@gmail.com>
| self.num_spec_tokens = 0 | ||
| if self.speculative_config: | ||
| self.num_spec_tokens = self.speculative_config.num_speculative_tokens | ||
| draft_config = self.speculative_config.draft_model_config |
There was a problem hiding this comment.
Can you just set self.draft_config and shortcut the multiple None checks when we need to access it?
| if draft_config is not None and draft_config.max_model_len is not None: | ||
| self.effective_drafter_max_model_len = draft_config.max_model_len | ||
| else: | ||
| self.effective_drafter_max_model_len = self.max_model_len |
There was a problem hiding this comment.
Thoughts on making this min(self.max model len, draft max model len)? We have been seeing some logs where the drafter has a very high max model len even when the base model doesn't.
Also if you do this clamping you can move it into a helper fn to share the logic with the update function below
There was a problem hiding this comment.
I was just aiming to keep the existing logic, I'm not sure what makes the most sense, would defer to your judgement.
|
Going to merge since the CI is green, and open follow-on PR for the nits |
…#31591) Signed-off-by: Nick Hill <nickhill123@gmail.com>
…#31591) Signed-off-by: Nick Hill <nickhill123@gmail.com>
…#31591) Signed-off-by: Nick Hill <nickhill123@gmail.com> Signed-off-by: dsuhinin <suhinin.dmitriy@gmail.com>
…#31591) Signed-off-by: Nick Hill <nickhill123@gmail.com>
Simplify messy top-level logic in
GPUModelRunner.sample_tokens, avoid computingeffective_drafter_max_model_lenevery step and only execute this spec-decoding-specific logic when spec decoding is actually enabled.