[Eagle] Refactor eagle speculative decoding by Ying1123 · Pull Request #3986 · sgl-project/sglang

Ying1123 · 2025-03-02T02:56:49Z

Prefix caching and chunked prefill will be compatible with eagle speculative decoding after this PR.

Co-authored-by: SangBin Cho rkooo567@gmail.com
Co-authored-by: Sehoon Kim kssteven418@gmail.com
Co-authored-by: Lianmin Zheng lianminzheng@gmail.com

zhyncs · 2025-03-03T03:20:25Z

@Ying1123 May you help resolve the conflicts? Thanks!

xiezhq-hermann · 2025-03-03T05:53:28Z

python/sglang/srt/managers/tp_worker.py

I think style-wise this is a bit confusing, if we define the allocator to be in charge of the agnostic memory operations and define another memory pool class for the underlying layouts, we should be using allocators consistently in scheduler and only use memory pool at lower level codes.

Re @xiezhq-hermann: Let us merge this first to reduce the code divergence. Feel free to refactor it later with a better design.

how about token_to_kv_indices_pool 😂

mpjlu · 2025-03-03T11:39:33Z

commit a574770: there is illegal bug when run DeepSeek: 409 File "/data/peng/sglang/python/sglang/srt/managers/scheduler.py", line 1218, in run_batch 410 ) = self.draft_worker.forward_batch_speculative_generation(batch) 411 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 189, in forward_batch_speculative_generation 412 spec_info, to_free_cache_loc = self.draft(batch) 413 File "/data/peng/sglang/python/sglang/srt/speculative/eagle_worker.py", line 244, in draft 414 assign_draft_cache_locs[(num_seqs,)]( 415 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 345, in 416 return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs) 417 File "/usr/local/lib/python3.10/dist-packages/triton/runtime/jit.py", line 691, in run 418 kernel.run(grid_0, grid_1, grid_2, stream, kernel.function, kernel.packed_metadata, launch_metadata, 419 File "/usr/local/lib/python3.10/dist-packages/triton/backends/nvidia/driver.py", line 365, in call 420 self.launch(*args, **kwargs) 421 RuntimeError: Triton Error [CUDA]: an illegal memory access was encountered

@mpjlu Thanks for reporting. Could you provide the reproducible command? @ispobock @zhyncs Could you also help take a look?
The following command can reproduce:
python3 -m sglang.launch_server
--model-path $model_path
--tp $tp_size
--dist-init-addr 29.224.56.106:5000
--nnodes 2
--node-rank 0
--trust-remote-code
--mem-fraction-static 0.6
--max-running-requests 64
--speculative-draft-model-path $draft_path
--speculative-algorithm NEXTN
--speculative-num-steps 2 \
--speculative-eagle-topk 2 \
--speculative-num-draft-tokens 4 \
--disable-cuda-graph \

test.py

import openai
client = openai.Client(
base_url="http://127.0.0.1:30000/v1", api_key="EMPTY")
import time

Chat completion

start = time.time()
for i in range(100):
response = client.chat.completions.create(
model="default",
messages=[
{"role": "user", "content": "请以诚信为主题写一篇1000字作文？"},
], temperature=0.6, max_tokens=1000,
extra_body={"top_p": 0.6, "top_k": 50}
)
print(response)
print("dur=", time.time() - start)

ispobock · 2025-03-04T07:56:04Z

I verified e19e733 on 8*H200 for DeepSeek-V3 model with nextn enabled and it works fine.

@mpjlu I cannot reproduce your error in my environment. I am not sure if it's an error for multi-node setting since I tried the same args as your command but on one node.

mpjlu · 2025-03-04T14:01:43Z

I verified e19e733 on 8*H200 for DeepSeek-V3 model with nextn enabled and it works fine.

@mpjlu I cannot reproduce your error in my environment. I am not sure if it's an error for multi-node setting since I tried the same args as your command but on one node.

Thanks very much.
We also can run with 8H20, but cannot run with 16H20 with TP 16.

ispobock · 2025-03-04T18:28:54Z

python/sglang/srt/server_args.py

-            self.disable_radix_cache = True
-            self.chunked_prefill_size = -1
+            if self.max_running_requests is None:
+                self.max_running_requests = 32


This setting may affect throughput, especially for throughput oriented model like DeepSeek. I tried request rate 16 on ShareGPT datasets, the TTFT is higher and throughput is lower with this limit. We may need to figure out a solution to enable for larger batch sizes.

ispobock · 2025-03-04T18:30:10Z

I verified e19e733 on 8*H200 for DeepSeek-V3 model with nextn enabled and it works fine.
@mpjlu I cannot reproduce your error in my environment. I am not sure if it's an error for multi-node setting since I tried the same args as your command but on one node.

Thanks very much. We also can run with 8_H20, but cannot run with 16_H20 with TP 16.

We will verify it for TP 16.

ispobock · 2025-03-05T03:31:52Z

@mpjlu Could you help test the latest commit again?

python/sglang/srt/entrypoints/http_server.py

python/sglang/srt/managers/io_struct.py

python/sglang/srt/managers/scheduler.py

merrymercy · 2025-03-05T07:58:15Z

python/sglang/srt/managers/tp_worker.py

Re @xiezhq-hermann: Let us merge this first to reduce the code divergence. Feel free to refactor it later with a better design.

python/sglang/srt/speculative/eagle_draft_cuda_graph_runner.py

Edenzzzz · 2025-03-05T22:21:07Z

python/sglang/srt/mem_cache/memory_pool.py


-class BaseTokenToKVPool:
+class TokenToKVPoolAllocator:
    """A memory pool that maps a token location to its kv cache data."""


How about "A memory pool that stores free slots in the kv cache data"?

Co-authored-by: Ke Bao <ISPObaoke@163.com>

Ying1123 requested review from ByronHsu, hnyls2002, ispobock, merrymercy and zhyncs as code owners March 2, 2025 02:56

Ying1123 marked this pull request as draft March 2, 2025 02:56

Ying1123 force-pushed the ying-eagle branch 5 times, most recently from e80b2de to 9c28d33 Compare March 2, 2025 10:20

zhyncs mentioned this pull request Mar 2, 2025

[feat] add small vocab table for eagle's draft model[1]. #3822

Merged

6 tasks

Ying1123 force-pushed the ying-eagle branch from 9c28d33 to 397c13e Compare March 3, 2025 01:28

Ying1123 marked this pull request as ready for review March 3, 2025 01:50

Ying1123 requested a review from HaiShaw as a code owner March 3, 2025 01:50

zhyncs added the high priority label Mar 3, 2025

Ying1123 force-pushed the ying-eagle branch 3 times, most recently from 5e9c58a to 172c25f Compare March 3, 2025 05:46

xiezhq-hermann reviewed Mar 3, 2025

View reviewed changes

Ying1123 force-pushed the ying-eagle branch 7 times, most recently from 14f5d33 to 5d9a5ec Compare March 3, 2025 08:08

merrymercy force-pushed the main branch from 3f77ac7 to ac23872 Compare March 3, 2025 08:12

Ying1123 force-pushed the ying-eagle branch from 5d9a5ec to 3dd1d00 Compare March 3, 2025 08:27

Ying1123 force-pushed the ying-eagle branch from a574770 to e19e733 Compare March 3, 2025 10:49

zhyncs mentioned this pull request Mar 3, 2025

chore: bump v0.4.4 #4041

Merged

12 tasks

ispobock reviewed Mar 4, 2025

View reviewed changes

ispobock and others added 3 commits March 5, 2025 10:45

Merge branch 'main' into ying-eagle

97dc7db

fix lint

2283c8f

fix test

3e18654

zhyncs and others added 2 commits March 4, 2025 21:52

Merge branch 'main' into ying-eagle

4b64d26

fix ut

26064a0

ispobock approved these changes Mar 5, 2025

View reviewed changes

merrymercy reviewed Mar 5, 2025

View reviewed changes

python/sglang/srt/entrypoints/http_server.py Show resolved Hide resolved

merrymercy reviewed Mar 5, 2025

View reviewed changes

python/sglang/srt/managers/io_struct.py Show resolved Hide resolved

Ying1123 changed the title ~~Refactor eagle speculative decoding~~ [Eagle] Support prefix caching and chunked prefill for eagle speculative decoding Mar 5, 2025

Ying1123 changed the title ~~[Eagle] Support prefix caching and chunked prefill for eagle speculative decoding~~ [Eagle] Refactor eagle speculative decoding Mar 5, 2025

merrymercy reviewed Mar 5, 2025

View reviewed changes

ispobock and others added 2 commits March 5, 2025 02:59

update

b0d7b83

Merge branch 'main' into ying-eagle

3a12422

zhyncs merged commit d3d4d76 into main Mar 5, 2025
33 of 36 checks passed

zhyncs deleted the ying-eagle branch March 5, 2025 16:06

Edenzzzz reviewed Mar 5, 2025

View reviewed changes

merrymercy mentioned this pull request Mar 6, 2025

fix: fix wrong param passed to num_verify_tokens #3981

Closed

6 tasks

This was referenced Mar 6, 2025

[Hoxfix] Fix incomplete token_to_kv_pool refactor #4121

Merged

Add test for Radix cache variants #4125

Closed

Ying1123 mentioned this pull request Mar 6, 2025

Add codeowners for eagle implementations #4131

Merged

xiezhq-hermann mentioned this pull request Mar 7, 2025

Memory pool fix for upstream change about eagle #4170

Merged

6 tasks

aoshen524 pushed a commit to aoshen524/sglang that referenced this pull request Mar 10, 2025

[Eagle] Refactor eagle speculative decoding (sgl-project#3986)

a8a4bd8

Co-authored-by: Ke Bao <ISPObaoke@163.com>

Conversation

Ying1123 commented Mar 2, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhyncs commented Mar 3, 2025

Uh oh!

xiezhq-hermann Mar 3, 2025

Choose a reason for hiding this comment

Uh oh!

merrymercy Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Edenzzzz Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

mpjlu commented Mar 3, 2025

Chat completion

Uh oh!

ispobock commented Mar 4, 2025

Uh oh!

mpjlu commented Mar 4, 2025

Uh oh!

ispobock Mar 4, 2025

Choose a reason for hiding this comment

Uh oh!

ispobock commented Mar 4, 2025

Uh oh!

ispobock commented Mar 5, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

merrymercy Mar 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Edenzzzz Mar 5, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

Ying1123 commented Mar 2, 2025 •

edited

Loading

merrymercy Mar 5, 2025 •

edited

Loading

merrymercy Mar 5, 2025 •

edited

Loading