[Feature] Speculative decoding support lookahead#9873
[Feature] Speculative decoding support lookahead#9873zhyncs merged 29 commits intosgl-project:mainfrom
Conversation
There was a problem hiding this comment.
Summary of Changes
Hello @a4zhangfei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!
This pull request introduces a new speculative decoding algorithm, 'Lookahead', designed to significantly enhance the inference speed of large language models, particularly those with high output locality. The changes span across the core runtime, integrating a novel C++-based token cache and a dedicated worker to manage the speculative generation and verification process. This feature aims to provide substantial performance gains for suitable model architectures and use cases.
Highlights
- New Speculative Decoding Algorithm: Lookahead: Introduces a new speculative decoding algorithm called 'Lookahead' to improve inference performance, especially for large language models with good output locality. This is a significant addition to the existing 'EAGLE', 'EAGLE3', and 'NEXTN' algorithms.
- Core C++ Implementation for Lookahead Cache: Adds a new C++ implementation for the Lookahead cache, including a Trie-based data structure for efficient pattern matching and insertion of token sequences. This forms the backbone of the Lookahead algorithm's ability to predict future tokens.
- Integration into SGLang Runtime (SRT): Extensively integrates the Lookahead algorithm into the SGLang Runtime (SRT) by modifying various components such as the attention layers, batch scheduling, model worker, CUDA graph runner, and tokenizer manager to support the new speculative decoding flow and its specific data structures.
- Dedicated Lookahead Worker and Utilities: Implements a dedicated
LOOKAHEADWorkerto manage the Lookahead cache, prepare draft tokens, and handle the verification process. New Python and CUDA utilities (lookahead_utils.pyandlookahead_utils.cu) are added for the specific verification logic of the Lookahead algorithm. - New Server Arguments for Lookahead Configuration: Introduces several new command-line arguments and server configurations to fine-tune the Lookahead algorithm's behavior, including parameters for match window size, BFS breadth, branch length, cache capacity, and match type.
Using Gemini Code Assist
The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.
Invoking Gemini
You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.
| Feature | Command | Description |
|---|---|---|
| Code Review | /gemini review |
Performs a code review for the current pull request in its current state. |
| Pull Request Summary | /gemini summary |
Provides a summary of the current pull request in its current state. |
| Comment | @gemini-code-assist | Responds in comments when explicitly tagged, both in issue comments and review comments. |
| Help | /gemini help |
Displays a list of available commands. |
Customization
To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.
Limitations & Feedback
Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.
You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.
Footnotes
-
Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩
There was a problem hiding this comment.
Code Review
This pull request introduces a significant new feature: lookahead speculative decoding. It's a comprehensive change that adds a C++ implementation for the lookahead cache, Python wrappers, and integrates the new algorithm into the serving stack. The implementation is well-structured. I've found a couple of issues, mainly related to CUDA graph compatibility, which I've detailed in the comments below.
|
Great work! Could you please explain how you built the sgl-kernel? The command "export PYTHONPATH=sglang/sgl-kernel/python:$PYTHONPATH" doesn't seem to solve the issue. When I commented out the line |
d659e89 to
be96103
Compare
be96103 to
d02f93d
Compare
However, the cpp/h files are missing when using Need patch like reyoung@d34ac30 |
Ok~, thanks, I will fix it |
* origin/qwen3: (30 commits) chore: bump sgl-kernel 0.3.11 (sgl-project#10630) feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) model support: Sarashina2VisionForCausalLM (sgl-project#10632) [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) [Feature] Speculative decoding support lookahead (sgl-project#9873) refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) [router] refactor worker to builder pattern 1/n (sgl-project#10628) Garbage collector regression in the online server (sgl-project#10621) feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) support qwen3-next-fp8 deepep (sgl-project#10622) update deepep version for qwen3-next deepep moe (sgl-project#10624) Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) [RL] Add destroy process group api (sgl-project#9979) fix deepep assert when PD disaggregation == null (sgl-project#8274) Scale kkt after reduction (sgl-project#10604) [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) ...
Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
awesome!When is the release scheduled to come out? |
|
Why do we call it lookahead? Lookahead is a very confusing name. Can we call it ngram? |
Sure, no problem. |
|
@a4zhangfei Thanks! Please submit a rename PR |
Ok, I will submit the PR within the next few weeks. |
|
@a4zhangfei can you do it this week? We do not want to make breaking server arg name changes, so we need to rename it ASAP |
Ok, I'll do it this week. |
|
Great! Please rename most Specifically,
|
|
@merrymercy
merge sgl-kernel/csrc/speculative/lookahead_utils.cu and sgl-kernel/csrc/speculative/eagle_utils.cu into speculative/tree_utils.cu
We won't do this item for now, as it hasn't been tested on AMD
|
this is the PR: #11010 |
|
TODO:
|
Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
|
How do I use this feature? Are there docs on this? |
Motivation
For large language model with good output locality, using lookahead can achieve significant performance improvement at a relatively low cost. This PR references #2790 .
Modifications
Accuracy Tests
This feature has been running in multiple of our business scenarios for over two months, and its stability and acceleration effect have been verified.
Benchmarking and Profiling
The test results on our own model
We apply this feature to smaller-scale language models with parameter sizes, for use in functions like user intent recognition and behavior routing. These models are characterized by short output and high locality, making them quite suitable for lookahead. The image below shows the accepted length of each forward pass after using lookahead.
The optimal acceleration effect is as follows, with an average accepted length of approximately 4.8 and num prompt 256:
Acceleration effect on the public test set
model: Qwen2.5-Coder-7B-Instruct
dataset: qwen2.5_test_python
num prompt: 1024
NOTE: Before the test, need to convert the format from "parque" to "sharedgpt".
Question
Compared with the full-type tree mask, the qlen-type mask offers an 8% performance improvement. However, the qlen-type mask requires support from FlashInfer. Is there any plan for FlashInfer to support this feature?
Checklist