[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547
[Feature] Add spec v2 (overlap scheduling) to DFlash speculative decoding support#20547dcw02 wants to merge 74 commits intosgl-project:mainfrom
Conversation
…ock-size to server args
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a significant performance enhancement for DFlash speculative decoding by integrating a new version of overlap scheduling, referred to as spec v2. The changes involve adding specialized worker logic and data structures for DFlash, optimizing KV cache operations with fused Triton kernels, and enabling auxiliary hidden state capture in various models. This update aims to boost token generation throughput, as demonstrated by the provided benchmarks, while also laying the groundwork for more advanced speculative decoding capabilities. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
Currently the only missing thing compared to v1 is non-greedy decoding support, it is being worked on. |
There was a problem hiding this comment.
Code Review
The pull request introduces DFLASH speculative decoding, adding a specialized DFlashDraftModel and updating core components like ModelRunner, CudaGraphRunner, and Scheduler to support its specific requirements, including auxiliary hidden state capture and handling DFLASH-specific server arguments. New data structures (DFlashDraftInput, DFlashVerifyInput) and worker implementations (DFlashWorker, DFlashWorkerV2) manage the drafting and verification process, with optimizations like fused KV materialization. A new benchmark script is also included. An improvement opportunity exists in the scheduler to refactor duplicated logic for aborting requests with unsupported DFLASH features into a helper method for better maintainability.
| if self.spec_algorithm.is_dflash() and req.return_logprob: | ||
| req.set_finish_with_abort( | ||
| "DFLASH speculative decoding does not support return_logprob yet." | ||
| ) | ||
| self.init_req_max_new_tokens(req) | ||
| self._add_request_to_queue(req) | ||
| return | ||
| if ( | ||
| self.spec_algorithm.is_dflash() | ||
| and self.enable_overlap | ||
| and req.return_hidden_states | ||
| ): | ||
| req.set_finish_with_abort( | ||
| "DFLASH spec-v2 phase 1 does not support return_hidden_states yet." | ||
| ) | ||
| self.init_req_max_new_tokens(req) | ||
| self._add_request_to_queue(req) | ||
| return | ||
| if self.spec_algorithm.is_dflash() and ( | ||
| req.sampling_params.json_schema is not None | ||
| or req.sampling_params.regex is not None | ||
| or req.sampling_params.ebnf is not None | ||
| or req.sampling_params.structural_tag is not None | ||
| ): | ||
| req.set_finish_with_abort( | ||
| "DFLASH speculative decoding does not support grammar-constrained decoding yet." | ||
| ) | ||
| self.init_req_max_new_tokens(req) | ||
| self._add_request_to_queue(req) | ||
| return | ||
| if ( | ||
| self.spec_algorithm.is_dflash() | ||
| and self.enable_overlap | ||
| and ( | ||
| req.sampling_params.top_k > 1 | ||
| or req.sampling_params.frequency_penalty != 0.0 | ||
| or req.sampling_params.presence_penalty != 0.0 | ||
| or req.sampling_params.repetition_penalty != 1.0 | ||
| or req.sampling_params.logit_bias is not None | ||
| or req.custom_logit_processor is not None | ||
| ) | ||
| ): | ||
| req.set_finish_with_abort( | ||
| "DFLASH spec-v2 phase 1 only supports plain greedy decoding yet. " | ||
| "Non-greedy sampling, penalties, logit_bias, and custom logit processors are not enabled." | ||
| ) | ||
| self.init_req_max_new_tokens(req) | ||
| self._add_request_to_queue(req) | ||
| return |
There was a problem hiding this comment.
The logic for aborting requests with unsupported DFLASH features is duplicated across several if blocks. This can be refactored into a helper method to reduce code repetition and improve maintainability.
For example, you could create a helper like this:
def _abort_dflash_request(self, req: Req, message: str):
req.set_finish_with_abort(message)
self.init_req_max_new_tokens(req)
self._add_request_to_queue(req)Then you can simplify the checks:
if self.spec_algorithm.is_dflash():
if req.return_logprob:
self._abort_dflash_request(req, "DFLASH speculative decoding does not support return_logprob yet.")
return
if self.enable_overlap and req.return_hidden_states:
self._abort_dflash_request(req, "DFLASH spec-v2 phase 1 does not support return_hidden_states yet.")
return
# ... and so on…project#20547) Cherry-pick from sgl-project#20547, resolved conflicts with PR sgl-project#16818. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…project#20547) Cherry-pick from sgl-project#20547 onto v0.5.9, resolved conflicts. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…pec v2 and overlap plan stream), needs clean-up
|
@dcw02 Does it currently support PCG? |
i've enabled it without issues with |
|
i'm closing this PR and reopening it soon, from another branch. have some extra improvements |
Motivation
Add spec v2 path for DFlash. Should be merged after #16818
TLDR
B200, GSM8K, qwen3-8b, tp size 1, concurrency 32, max new tokens 2k, greedy decoding
9,688.26 tok/s->12,360.49 tok/sModifications
Adds v2 worker and related files
Accuracy and Benchmarks
Tested on a gcp b200 machine
Commands:
v1 performance
overlap scheduling (spec v2) performance
Checklist
Review Process
/tag-run-ci-label,/rerun-failed-ci,/tag-and-rerun-ci