Skip to content

[Feature] Speculative decoding support lookahead#9873

Merged
zhyncs merged 29 commits intosgl-project:mainfrom
a4zhangfei:lookahead
Sep 18, 2025
Merged

[Feature] Speculative decoding support lookahead#9873
zhyncs merged 29 commits intosgl-project:mainfrom
a4zhangfei:lookahead

Conversation

@a4zhangfei
Copy link
Copy Markdown
Contributor

@a4zhangfei a4zhangfei commented Sep 1, 2025

Motivation

For large language model with good output locality, using lookahead can achieve significant performance improvement at a relatively low cost. This PR references #2790 .

Modifications

Accuracy Tests

This feature has been running in multiple of our business scenarios for over two months, and its stability and acceleration effect have been verified.

Benchmarking and Profiling

The test results on our own model

We apply this feature to smaller-scale language models with parameter sizes, for use in functions like user intent recognition and behavior routing. These models are characterized by short output and high locality, making them quite suitable for lookahead. The image below shows the accepted length of each forward pass after using lookahead.

image

The optimal acceleration effect is as follows, with an average accepted length of approximately 4.8 and num prompt 256:

concurrency base lookahead speedup
1 413.14 1079.52 2.61
2 672.43 1489.14 2.21
3 859.09 150.29 1.92
4 996.53 1799.86 1.81

Acceleration effect on the public test set

model: Qwen2.5-Coder-7B-Instruct
dataset: qwen2.5_test_python
num prompt: 1024
NOTE: Before the test, need to convert the format from "parque" to "sharedgpt".

# h20, tp 2
python3 -m sglang.launch_server \
    --host ${h20_4} \
    --port 33337 \
    --model-path ${Model_Path} \
    --tp 2 \
    --mem-fraction-static 0.9 \
    --max-prefill-tokens 131072 \
    --log-level info \
    --speculative-algorithm ${speculative_algo}  --speculative-num-draft-tokens 16 \
    --speculative-lookahead-min-match-window-size 1 \
    --speculative-lookahead-max-match-window-size 16 \
    --speculative-lookahead-min-bfs-breadth 1 \
    --speculative-lookahead-max-bfs-breadth 10 \
    --speculative-lookahead-branch-length 18 \
    --trust-remote-code 2>&1 | tee ${0}.log
# bench
declare -a request_rates=(256 256 256 256)
declare -a max_concurrency=(1 2 3 4)
declare -a num_prompts=(1024 1024 1024 1024)

for i in "${!request_rates[@]}"; do
    RATE=${request_rates[$i]}
    CONCURRENCY=${max_concurrency[$i]}
    PROMPTS=${num_prompts[$i]}

    echo "Running benchmark with request rate $RATE, concurrency $CONCURRENCY, prompts $PROMPTS"
    python3  /sgl-workspace/sglang/python/sglang/bench_serving.yuanshi.py \
        --apply-chat-template \
        --dataset-name sharegpt \
        --dataset-path ${dataset_path} \
        --disable-ignore-eos \
        --backend sglang \
        --flush-cache \
        --model $Model_Path \
        --sharegpt-output-len 9999 \
        --request-rate $RATE \
        --max-concurrency $CONCURRENCY \
        --num-prompts $PROMPTS \
        --host $addr --port $port
    wait
done
concurrency base lookahead accept speedup
1 265 429 1.45 1.62
2 507 782 1.47 1.54
3 746 987 1.43 1.32
4 971 1126 1.43 1.16

Question

Compared with the full-type tree mask, the qlen-type mask offers an 8% performance improvement. However, the qlen-type mask requires support from FlashInfer. Is there any plan for FlashInfer to support this feature?

Checklist

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Summary of Changes

Hello @a4zhangfei, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a new speculative decoding algorithm, 'Lookahead', designed to significantly enhance the inference speed of large language models, particularly those with high output locality. The changes span across the core runtime, integrating a novel C++-based token cache and a dedicated worker to manage the speculative generation and verification process. This feature aims to provide substantial performance gains for suitable model architectures and use cases.

Highlights

  • New Speculative Decoding Algorithm: Lookahead: Introduces a new speculative decoding algorithm called 'Lookahead' to improve inference performance, especially for large language models with good output locality. This is a significant addition to the existing 'EAGLE', 'EAGLE3', and 'NEXTN' algorithms.
  • Core C++ Implementation for Lookahead Cache: Adds a new C++ implementation for the Lookahead cache, including a Trie-based data structure for efficient pattern matching and insertion of token sequences. This forms the backbone of the Lookahead algorithm's ability to predict future tokens.
  • Integration into SGLang Runtime (SRT): Extensively integrates the Lookahead algorithm into the SGLang Runtime (SRT) by modifying various components such as the attention layers, batch scheduling, model worker, CUDA graph runner, and tokenizer manager to support the new speculative decoding flow and its specific data structures.
  • Dedicated Lookahead Worker and Utilities: Implements a dedicated LOOKAHEADWorker to manage the Lookahead cache, prepare draft tokens, and handle the verification process. New Python and CUDA utilities (lookahead_utils.py and lookahead_utils.cu) are added for the specific verification logic of the Lookahead algorithm.
  • New Server Arguments for Lookahead Configuration: Introduces several new command-line arguments and server configurations to fine-tune the Lookahead algorithm's behavior, including parameters for match window size, BFS breadth, branch length, cache capacity, and match type.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point in your pull request via creating an issue comment (i.e. comment on the pull request page) using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in issue comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a significant new feature: lookahead speculative decoding. It's a comprehensive change that adds a C++ implementation for the lookahead cache, Python wrappers, and integrates the new algorithm into the serving stack. The implementation is well-structured. I've found a couple of issues, mainly related to CUDA graph compatibility, which I've detailed in the comments below.

@valorix25
Copy link
Copy Markdown

Great work! Could you please explain how you built the sgl-kernel? The command "export PYTHONPATH=sglang/sgl-kernel/python:$PYTHONPATH" doesn't seem to solve the issue.

When I commented out the line # from sgl_kernel import common_ops in sglang/sgl-kernel/python/sgl_kernel/__init__.py, I encountered the following error:

Traceback (most recent call last):
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 39, in <module>
    main()
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/scripts/test_lookahead.py", line 22, in main
    llm = sgl.Engine(model_path=model_path, speculative_algorithm='LOOKAHEAD',
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/utils.py", line 313, in __call__
    return module(*args, **kwargs)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 127, in __init__
    tokenizer_manager, template_manager, scheduler_info = _launch_subprocesses(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 715, in _launch_subprocesses
    _set_envs_and_config(server_args)
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/entrypoints/engine.py", line 682, in _set_envs_and_config
    assert_pkg_version(
  File "/nfs/ofs-llm-ssd/user/chenjiaxing/sglang_fork/sglang/python/sglang/srt/utils.py", line 820, in assert_pkg_version
    raise Exception(
Exception: sgl-kernel is installed with version 0.3.7, which is less than the minimum required version 0.3.7.post1. Please reinstall the latest version with `pip install sgl-kernel --force-reinstall`

@reyoung
Copy link
Copy Markdown
Contributor

reyoung commented Sep 16, 2025

I referred to this code:

However, the cpp/h files are missing when using pip install and in the wheel package.

Need patch like reyoung@d34ac30

@a4zhangfei
Copy link
Copy Markdown
Contributor Author

I referred to this code:

However, the cpp/h files are missing when using pip install and in the wheel package.

Ok~, thanks, I will fix it

@zhyncs zhyncs merged commit e7bc600 into sgl-project:main Sep 18, 2025
129 of 146 checks passed
chenxu140 added a commit to ping1jing2/sglang that referenced this pull request Sep 20, 2025
* origin/qwen3: (30 commits)
  chore: bump sgl-kernel 0.3.11 (sgl-project#10630)
  feat: add fused moe config for Qwen3-Next-80B-A3B-Instruct on B200 (sgl-project#10631)
  model support: Sarashina2VisionForCausalLM (sgl-project#10632)
  [Performance] Qwen3-Next: speed up update_mamba_state_after_mtp_verify by 10x; e2e up to 3.54% faster (sgl-project#10586)
  [Performance] Qwen3-Next: replace arange to cached query_start_loc_li… (sgl-project#10553)
  [Feature] Speculative decoding support lookahead (sgl-project#9873)
  refactor: use registry for _get_attention_backend_from_str (sgl-project#10629)
  [router] refactor worker to builder pattern 1/n (sgl-project#10628)
  Garbage collector regression in the online server (sgl-project#10621)
  feat: Add FlexAttention Backend for Efficient Sparse Attention (sgl-project#9947)
  Fix bias handling in TritonMoeQuantInfo within quantization/mxfp4.py (sgl-project#10579)
  [Performance] qwen3-next improve causal conv1d in prefill phase (sgl-project#10595)
  Fix sgl_kernel import failure on devices other than CUDA (sgl-project#10610)
  support qwen3-next-fp8 deepep (sgl-project#10622)
  update deepep version for qwen3-next deepep moe (sgl-project#10624)
  Feat/add heartbeat mechanism for nixl conn (sgl-project#10222)
  [RL] Add destroy process group api (sgl-project#9979)
  fix deepep assert when PD disaggregation == null (sgl-project#8274)
  Scale kkt after reduction (sgl-project#10604)
  [improvement] add average input/output token length for hicache benchmark stats output (sgl-project#10525)
  ...
@a4zhangfei a4zhangfei deleted the lookahead branch September 20, 2025 08:52
lifuhuang pushed a commit that referenced this pull request Sep 20, 2025
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
@MrKiven
Copy link
Copy Markdown

MrKiven commented Sep 22, 2025

awesome!When is the release scheduled to come out?

@merrymercy
Copy link
Copy Markdown
Contributor

Why do we call it lookahead? Lookahead is a very confusing name. Can we call it ngram?

@a4zhangfei
Copy link
Copy Markdown
Contributor Author

Why do we call it lookahead? Lookahead is a very confusing name. Can we call it ngram?

Sure, no problem.

@merrymercy
Copy link
Copy Markdown
Contributor

@a4zhangfei Thanks! Please submit a rename PR

@a4zhangfei
Copy link
Copy Markdown
Contributor Author

@a4zhangfei Thanks! Please submit a rename PR

Ok, I will submit the PR within the next few weeks.

@merrymercy
Copy link
Copy Markdown
Contributor

@a4zhangfei can you do it this week? We do not want to make breaking server arg name changes, so we need to rename it ASAP

@a4zhangfei
Copy link
Copy Markdown
Contributor Author

@a4zhangfei can you do it this week? We do not want to make breaking server arg name changes, so we need to rename it ASAP

Ok, I'll do it this week.

@merrymercy
Copy link
Copy Markdown
Contributor

Great! Please rename most lookahead to ngram in both code and filenames. Thanks for your help.

Specifically,

  • rename all lookahead server args to ngram
  • rename srt/speculative/cpp_lookahead/*.h" to *cpp_ngram
  • rename speculative/lookahead_worker.py, speculative/lookahead_utils.py to ngram_worker.py and ngram_utils.py
  • rename filenames lookahead_utils.py, lookahead_worker.py
  • merge sgl-kernel/csrc/speculative/lookahead_utils.cu and sgl-kernel/csrc/speculative/lookahead_utils.cu into speculative/tree_utils.cu
  • merge sgl-kernel/tests/speculative/test_lookahead_utils.py and sgl-kernel/tests/speculative/test_eagle_utils.py into test_tree_utils.py

@a4zhangfei
Copy link
Copy Markdown
Contributor Author

a4zhangfei commented Sep 28, 2025 via email

@a4zhangfei
Copy link
Copy Markdown
Contributor Author

Great! Please rename most lookahead to ngram in both code and filenames. Thanks for your help.

Specifically,

  • rename all lookahead server args to ngram
  • rename srt/speculative/cpp_lookahead/*.h" to *cpp_ngram
  • rename speculative/lookahead_worker.py, speculative/lookahead_utils.py to ngram_worker.py and ngram_utils.py
  • rename filenames lookahead_utils.py, lookahead_worker.py
  • merge sgl-kernel/csrc/speculative/lookahead_utils.cu and sgl-kernel/csrc/speculative/lookahead_utils.cu into speculative/tree_utils.cu
  • merge sgl-kernel/tests/speculative/test_lookahead_utils.py and sgl-kernel/tests/speculative/test_eagle_utils.py into test_tree_utils.py

this is the PR: #11010

@merrymercy
Copy link
Copy Markdown
Contributor

TODO:

  1. support user-passing argument as reference, similar to openai api https://platform.openai.com/docs/api-reference/chat/create#chat-create-prediction
  2. @yzh119 , can you support qlen_only tree mask for efficient verification?
  3. Get a profile by python3 -m sglang.profiler

HanHan009527 pushed a commit to HanHan009527/sglang that referenced this pull request Oct 9, 2025
Co-authored-by: a4zhangfei <a4zhangfei@qq.com>
Co-authored-by: Qiaolin-Yu <liin1211@outlook.com>
@kevinlu1248
Copy link
Copy Markdown

How do I use this feature? Are there docs on this?

@merrymercy merrymercy mentioned this pull request Oct 23, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.