[Feature] Speculative decoding support lookahead by a4zhangfei · Pull Request #9873 · sgl-project/sglang

a4zhangfei · 2025-09-01T11:12:03Z

Motivation

For large language model with good output locality, using lookahead can achieve significant performance improvement at a relatively low cost. This PR references #2790 .

Modifications

Accuracy Tests

This feature has been running in multiple of our business scenarios for over two months, and its stability and acceleration effect have been verified.

Benchmarking and Profiling

The test results on our own model

We apply this feature to smaller-scale language models with parameter sizes, for use in functions like user intent recognition and behavior routing. These models are characterized by short output and high locality, making them quite suitable for lookahead. The image below shows the accepted length of each forward pass after using lookahead.

The optimal acceleration effect is as follows, with an average accepted length of approximately 4.8 and num prompt 256:

concurrency	base	lookahead	speedup
1	413.14	1079.52	2.61
2	672.43	1489.14	2.21
3	859.09	150.29	1.92
4	996.53	1799.86	1.81

Acceleration effect on the public test set

model: Qwen2.5-Coder-7B-Instruct
dataset: qwen2.5_test_python
num prompt: 1024
NOTE: Before the test, need to convert the format from "parque" to "sharedgpt".

# h20, tp 2
python3 -m sglang.launch_server \
    --host ${h20_4} \
    --port 33337 \
    --model-path ${Model_Path} \
    --tp 2 \
    --mem-fraction-static 0.9 \
    --max-prefill-tokens 131072 \
    --log-level info \
    --speculative-algorithm ${speculative_algo}  --speculative-num-draft-tokens 16 \
    --speculative-lookahead-min-match-window-size 1 \
    --speculative-lookahead-max-match-window-size 16 \
    --speculative-lookahead-min-bfs-breadth 1 \
    --speculative-lookahead-max-bfs-breadth 10 \
    --speculative-lookahead-branch-length 18 \
    --trust-remote-code 2>&1 | tee ${0}.log

# bench
declare -a request_rates=(256 256 256 256)
declare -a max_concurrency=(1 2 3 4)
declare -a num_prompts=(1024 1024 1024 1024)

for i in "${!request_rates[@]}"; do
    RATE=${request_rates[$i]}
    CONCURRENCY=${max_concurrency[$i]}
    PROMPTS=${num_prompts[$i]}

    echo "Running benchmark with request rate $RATE, concurrency $CONCURRENCY, prompts $PROMPTS"
    python3  /sgl-workspace/sglang/python/sglang/bench_serving.yuanshi.py \
        --apply-chat-template \
        --dataset-name sharegpt \
        --dataset-path ${dataset_path} \
        --disable-ignore-eos \
        --backend sglang \
        --flush-cache \
        --model $Model_Path \
        --sharegpt-output-len 9999 \
        --request-rate $RATE \
        --max-concurrency $CONCURRENCY \
        --num-prompts $PROMPTS \
        --host $addr --port $port
    wait
done

concurrency	base	lookahead	accept	speedup
1	265	429	1.45	1.62
2	507	782	1.47	1.54
3	746	987	1.43	1.32
4	971	1126	1.43	1.16

Question

Compared with the full-type tree mask, the qlen-type mask offers an 8% performance improvement. However, the qlen-type mask requires support from FlashInfer. Is there any plan for FlashInfer to support this feature?

Checklist

Format your code according to the Format code with pre-commit.
Add unit tests according to the Run and add unit tests.
Update documentation according to Write documentations.
Provide accuracy and speed benchmark results according to Test the accuracy and Benchmark the speed.

gemini-code-assist

Summary of Changes

Hello @a4zhangfei, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new speculative decoding algorithm, 'Lookahead', designed to significantly enhance the inference speed of large language models, particularly those with high output locality. The changes span across the core runtime, integrating a novel C++-based token cache and a dedicated worker to manage the speculative generation and verification process. This feature aims to provide substantial performance gains for suitable model architectures and use cases.

Highlights

New Speculative Decoding Algorithm: Lookahead: Introduces a new speculative decoding algorithm called 'Lookahead' to improve inference performance, especially for large language models with good output locality. This is a significant addition to the existing 'EAGLE', 'EAGLE3', and 'NEXTN' algorithms.
Core C++ Implementation for Lookahead Cache: Adds a new C++ implementation for the Lookahead cache, including a Trie-based data structure for efficient pattern matching and insertion of token sequences. This forms the backbone of the Lookahead algorithm's ability to predict future tokens.
Integration into SGLang Runtime (SRT): Extensively integrates the Lookahead algorithm into the SGLang Runtime (SRT) by modifying various components such as the attention layers, batch scheduling, model worker, CUDA graph runner, and tokenizer manager to support the new speculative decoding flow and its specific data structures.
Dedicated Lookahead Worker and Utilities: Implements a dedicated LOOKAHEADWorker to manage the Lookahead cache, prepare draft tokens, and handle the verification process. New Python and CUDA utilities (lookahead_utils.py and lookahead_utils.cu) are added for the specific verification logic of the Lookahead algorithm.
New Server Arguments for Lookahead Configuration: Introduces several new command-line arguments and server configurations to fine-tune the Lookahead algorithm's behavior, including parameters for match window size, BFS breadth, branch length, cache capacity, and match type.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in issue comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a significant new feature: lookahead speculative decoding. It's a comprehensive change that adds a C++ implementation for the lookahead cache, Python wrappers, and integrates the new algorithm into the serving stack. The implementation is well-structured. I've found a couple of issues, mainly related to CUDA graph compatibility, which I've detailed in the comments below.

python/sglang/srt/model_executor/cuda_graph_runner.py

python/sglang/srt/layers/attention/flashinfer_backend.py

python/sglang/srt/managers/scheduler.py

valorix25 · 2025-09-02T08:43:09Z

Great work! Could you please explain how you built the sgl-kernel? The command "export PYTHONPATH=sglang/sgl-kernel/python:$PYTHONPATH" doesn't seem to solve the issue.

When I commented out the line # from sgl_kernel import common_ops in sglang/sgl-kernel/python/sgl_kernel/__init__.py, I encountered the following error:

Traceback (most recent call last):
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 39, in <module>
    main()
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 22, in main
    llm = sgl.Engine(model_path=model_path, speculative_algorithm='LOOKAHEAD',
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/utils.py", line 313, in __call__
    return module(*args, **kwargs)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 127, in __init__
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 715, in _launch_subprocesses
    _set_envs_and_config(server_args)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 682, in _set_envs_and_config
    assert_pkg_version(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/utils.py", line 820, in assert_pkg_version
    raise Exception(
Exception: sgl-kernel is installed with version 0.3.7, which is less than the minimum required version 0.3.7.post1. Please reinstall the latest version with `pip install sgl-kernel --force-reinstall`

python/sglang/srt/speculative/cpp_lookahead/lookahead.cpp

sgl-kernel/csrc/speculative/lookahead_utils.cu

reyoung · 2025-09-16T13:00:50Z

I referred to this code:

However, the cpp/h files are missing when using pip install and in the wheel package.

Need patch like reyoung@d34ac30

a4zhangfei · 2025-09-17T03:01:19Z

I referred to this code:

However, the cpp/h files are missing when using pip install and in the wheel package.

Ok~, thanks, I will fix it

sgl-kernel/setup_rocm.py

* origin/qwen3: (30 commits) chore: bump sgl-kernel 0.3.11 (sgl-project#10630) feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631) model support: Sarashina2VisionForCausalLM (sgl-project#10632) [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586) [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553) [Feature] Speculative decoding support lookahead (sgl-project#9873) refactor: use registry for _get_attention_backend_from_str (sgl-project#10629) [router] refactor worker to builder pattern 1/n (sgl-project#10628) Garbage collector regression in the online server (sgl-project#10621) feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947) Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579) [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595) Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610) support qwen3-next-fp8 deepep (sgl-project#10622) update deepep version for qwen3-next deepep moe (sgl-project#10624) Feat/add heartbeat mechanism for nixl conn (sgl-project#10222) [RL] Add destroy process group api (sgl-project#9979) fix deepep assert when PD disaggregation == null (sgl-project#8274) Scale kkt after reduction (sgl-project#10604) [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525) ...

Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>

MrKiven · 2025-09-22T11:49:01Z

awesome！When is the release scheduled to come out?

merrymercy · 2025-09-23T02:11:17Z

Why do we call it lookahead? Lookahead is a very confusing name. Can we call it ngram?

a4zhangfei · 2025-09-23T02:56:14Z

Why do we call it lookahead? Lookahead is a very confusing name. Can we call it ngram?

Sure, no problem.

merrymercy · 2025-09-23T17:13:31Z

@a4zhangfei Thanks! Please submit a rename PR

a4zhangfei · 2025-09-25T08:38:25Z

@a4zhangfei Thanks! Please submit a rename PR

Ok, I will submit the PR within the next few weeks.

merrymercy · 2025-09-28T02:53:42Z

@a4zhangfei can you do it this week? We do not want to make breaking server arg name changes, so we need to rename it ASAP

a4zhangfei · 2025-09-28T02:56:20Z

@a4zhangfei can you do it this week? We do not want to make breaking server arg name changes, so we need to rename it ASAP

Ok, I'll do it this week.

merrymercy · 2025-09-28T03:43:18Z

Great! Please rename most lookahead to ngram in both code and filenames. Thanks for your help.

Specifically,

rename all lookahead server args to ngram
rename srt/speculative/cpp_lookahead/*.h" to *cpp_ngram
rename speculative/lookahead_worker.py, speculative/lookahead_utils.py to ngram_worker.py and ngram_utils.py
rename filenames lookahead_utils.py, lookahead_worker.py
merge sgl-kernel/csrc/speculative/lookahead_utils.cu and sgl-kernel/csrc/speculative/lookahead_utils.cu into speculative/tree_utils.cu
merge sgl-kernel/tests/speculative/test_lookahead_utils.py and sgl-kernel/tests/speculative/test_eagle_utils.py into test_tree_utils.py

a4zhangfei · 2025-09-28T06:27:51Z

@merrymercy merge sgl-kernel/csrc/speculative/lookahead_utils.cu and sgl-kernel/csrc/speculative/eagle_utils.cu into speculative/tree_utils.cu We won't do this item for now, as it hasn't been tested on AMD

a4zhangfei · 2025-09-28T07:12:07Z

Great! Please rename most lookahead to ngram in both code and filenames. Thanks for your help.

Specifically,

rename all lookahead server args to ngram

rename srt/speculative/cpp_lookahead/*.h" to *cpp_ngram

rename speculative/lookahead_worker.py, speculative/lookahead_utils.py to ngram_worker.py and ngram_utils.py

rename filenames lookahead_utils.py, lookahead_worker.py

merge sgl-kernel/csrc/speculative/lookahead_utils.cu and sgl-kernel/csrc/speculative/lookahead_utils.cu into speculative/tree_utils.cu

merge sgl-kernel/tests/speculative/test_lookahead_utils.py and sgl-kernel/tests/speculative/test_eagle_utils.py into test_tree_utils.py

this is the PR: #11010

merrymercy · 2025-09-30T04:23:12Z

TODO:

support user-passing argument as reference, similar to openai api https://platform.openai.com/docs/api-reference/chat/create#chat-create-prediction
@yzh119 , can you support qlen_only tree mask for efficient verification?

sglang/python/sglang/srt/speculative/build_eagle_tree.py

Line 47 in d17986f

QLEN_ONLY = 1
Get a profile by python3 -m sglang.profiler

Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>

kevinlu1248 · 2025-10-20T21:47:47Z

How do I use this feature? Are there docs on this?

a4zhangfei requested review from BBuf, Edwardf0t1, FlamingoPg, HaiShaw, HandH1998, Ying1123, ch-wan, hnyls2002, ispobock, kssteven418, kushanam, merrymercy, rkooo567, xiezhq-hermann, yizhang2077 and zhyncs as code owners September 1, 2025 11:12

gemini-code-assist bot reviewed Sep 1, 2025

View reviewed changes

python/sglang/srt/model_executor/cuda_graph_runner.py Outdated Show resolved Hide resolved

python/sglang/srt/layers/attention/flashinfer_backend.py Outdated Show resolved Hide resolved

python/sglang/srt/managers/scheduler.py Outdated Show resolved Hide resolved

This was referenced Sep 1, 2025

Speculative decoding with lookahead #2790

Closed

[Feature] support ngram #2681

Closed

a4zhangfei force-pushed the lookahead branch from d659e89 to be96103 Compare September 2, 2025 12:12

zhyncs assigned ispobock and hnyls2002 Sep 2, 2025

zhyncs added enhancement New feature or request high priority labels Sep 2, 2025

zhyncs assigned Qiaolin-Yu and merrymercy Sep 2, 2025

ispobock reviewed Sep 2, 2025

View reviewed changes

python/sglang/srt/speculative/cpp_lookahead/lookahead.cpp Outdated Show resolved Hide resolved

sgl-kernel/csrc/speculative/lookahead_utils.cu Outdated Show resolved Hide resolved

a4zhangfei force-pushed the lookahead branch from be96103 to d02f93d Compare September 3, 2025 02:06

This was referenced Sep 16, 2025

[willing to PR] Add Lookahead speculative decoding #2772

Closed

Can this work be adapted to DeepSeek-R1 and DeepSeek-NextN ? smart-lty/ParallelSpeculativeDecoding#8

Closed

Merge branch 'main' into lookahead

d0125db

fix

00be3ca

zhyncs reviewed Sep 17, 2025

View reviewed changes

sgl-kernel/setup_rocm.py Outdated Show resolved Hide resolved

a4zhangfei and others added 3 commits September 17, 2025 15:40

lookahead only support cuda device

b7c369d

fix amd ci

26c7004

Merge branch 'main' into lookahead

9453b98

zhyncs merged commit e7bc600 into sgl-project:main Sep 18, 2025
129 of 146 checks passed

reyoung mentioned this pull request Sep 19, 2025

[Bug] missing output_token_logprobs when using lookahead speculative decoding #10660

Closed

5 tasks

a4zhangfei deleted the lookahead branch September 20, 2025 08:52

lifuhuang pushed a commit that referenced this pull request Sep 20, 2025

[Feature] Speculative decoding support lookahead (#9873)

d05af40

Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>

HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025

[Feature] Speculative decoding support lookahead (sgl-project#9873)

fb869f2

Co-authored-by: a4zhangfei <a4zhangfei@qq.com> Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>

merrymercy mentioned this pull request Oct 23, 2025

Development Roadmap (2025 Q3) #7736

Closed

1 task

Conversation

a4zhangfei commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Motivation

Modifications

Accuracy Tests

Benchmarking and Profiling

The test results on our own model

Acceleration effect on the public test set

Question

Checklist

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

valorix25 commented Sep 2, 2025

Uh oh!

Uh oh!

Uh oh!

reyoung commented Sep 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a4zhangfei commented Sep 17, 2025

Uh oh!

Uh oh!

Uh oh!

MrKiven commented Sep 22, 2025

Uh oh!

merrymercy commented Sep 23, 2025

Uh oh!

a4zhangfei commented Sep 23, 2025

Uh oh!

merrymercy commented Sep 23, 2025

Uh oh!

a4zhangfei commented Sep 25, 2025

Uh oh!

merrymercy commented Sep 28, 2025

Uh oh!

a4zhangfei commented Sep 28, 2025

Uh oh!

merrymercy commented Sep 28, 2025

Uh oh!

a4zhangfei commented Sep 28, 2025 via email • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a4zhangfei commented Sep 28, 2025

Uh oh!

merrymercy commented Sep 30, 2025

Uh oh!

kevinlu1248 commented Oct 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

a4zhangfei commented Sep 1, 2025 •

edited

Loading

reyoung commented Sep 16, 2025 •

edited

Loading

a4zhangfei commented Sep 28, 2025 via email •

edited

Loading