Skip to content

[Spec] Route seq_lens through FutureMap; drop verify_done.wait#25879

Merged
hnyls2002 merged 25 commits into
mainfrom
lsyin/draft-prefix-lens
May 21, 2026
Merged

[Spec] Route seq_lens through FutureMap; drop verify_done.wait#25879
hnyls2002 merged 25 commits into
mainfrom
lsyin/draft-prefix-lens

Conversation

@hnyls2002
Copy link
Copy Markdown
Collaborator

@hnyls2002 hnyls2002 commented May 20, 2026

Summary

  • Drop verify_done.wait() cross-stream barrier — route seq_lens through FutureMap so schedule-stream consumers gate on a forward-stream publish_ready event instead
  • Split FutureMap buf writes by consumer: publish (schedule-consumed new_seq_lens_buf, fence-gated) vs stash (forward-only fields, FIFO-covered)
  • Fence recorded at verify-end via worker on_verify_complete callback (fires between sample and draft_extend), preserving schedule prep / draft_extend overlap

Mechanism

Between iters (schedule stream)

  • batch.seq_lens = -future_indices.indices — schedule-stream sentinel
  • FutureMap.resolve_seq_lens_cpu(batch) pulls seq_lens_cpu from new_seq_lens_buf via D2H, gated on publish_ready

Inside isolation (forward stream)

  • resolve_future reassigns batch.seq_lens from new_seq_lens_buf[indices]
  • Worker fires on_verify_complete(new_seq_lens) between sample-end and draft_extendFutureMap.publish writes new_seq_lens_buf + records publish_ready
  • After draft_extend, FutureMap.stash writes forward-only fields (topk / hidden / bonus); same-stream FIFO covers the next iter's resolve_future read

Changes

FutureMap (overlap_utils.py)

  • New publish / stash methods (replace store_to_map)
  • New resolve_seq_lens_cpu for schedule-stream D2H of new_seq_lens_buf
  • new_seq_lens_buf eager-allocated (fixed shape/dtype); forward-only bufs stay lazy
  • publish_ready event lives on FutureMap (no per-FutureIndices event)

Workers (eagle_worker_v2.py, multi_layer_eagle_worker_v2.py)

  • forward_batch_generation accepts on_verify_complete kwarg
  • Both extend and decode branches fire the callback after sample-end

Scheduler (scheduler.py)

  • Sets batch.seq_lens = -future_indices.indices between iters
  • Calls resolve_seq_lens_cpu pre-isolation, gated by batch.is_spec_v2
  • Uses functools.partial to bind future_indices for the callback

ScheduleBatch (schedule_batch.py)

  • Drop refresh_seq_lens_cpu helper; inline seq_lens_sum = int(seq_lens_cpu.sum()) at call sites
  • Drop maybe_wait_verify_done (replaced by the FutureMap fence)

EagleDraftInput (eagle_info.py)

  • Drop verify_done field (fence moved to FutureMap)

CI States

Latest PR Test (Base): 🚫 Run #26205726688
Latest PR Test (Extra): ❌ Run #26205726614

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the prepare_for_extend_to_fill_draft_kvcache function in eagle_info_v2.py to materialize sequence lengths from the GPU to the CPU using a single transfer. This change calculates seq_lens_cpu, seq_lens_sum, and prefix_lens directly from the materialized data, which removes the previous dependency on refresh_seq_lens_cpu and simplifies the batch metadata updates. I have no feedback to provide as there were no review comments.

Base automatically changed from lsyin/lazy-seq-lens to main May 20, 2026 11:42
@hnyls2002
Copy link
Copy Markdown
Collaborator Author

/tag-and-rerun-ci extra

@hnyls2002 hnyls2002 force-pushed the lsyin/draft-prefix-lens branch 2 times, most recently from dea0b25 to d3331c9 Compare May 21, 2026 00:21
@hnyls2002 hnyls2002 closed this May 21, 2026
@hnyls2002 hnyls2002 force-pushed the lsyin/draft-prefix-lens branch from d3331c9 to 512d164 Compare May 21, 2026 01:02
@hnyls2002 hnyls2002 reopened this May 21, 2026
@hnyls2002 hnyls2002 changed the title spec_v2 draft-extend: localize D2H, drop seq_lens_cpu mirror read [Spec] Route seq_lens through FutureMap; drop verify_done.wait May 21, 2026
hnyls2002 added a commit that referenced this pull request May 21, 2026
SGLANG_SPEC_V2_NO_VERIFY_SYNC=ON fully skips the remaining D2H sync
on top of #25879's FutureMap design:

- scheduler.py: gate FutureMap.resolve_seq_lens_cpu on the env so
  batch.seq_lens_cpu stays None across the schedule prep
- eagle_info_v2.prepare_for_extend_to_fill_draft_kvcache: add gpu_only
  branch (triggered when batch.seq_lens_cpu is None) that produces
  extend_lens / prefix_lens as device tensors directly, avoiding
  .tolist() + later H2D inside ForwardBatch.init_new
- forward_batch_info.init_new: tolerate None seq_lens_cpu and accept
  Tensor extend_seq_lens / extend_prefix_lens unchanged
- eagle_worker_v2 / multi_layer_eagle_worker_v2: lazily compute
  seq_lens_sum just before build_tree_kernel_efficient when no
  preallocated mask buf forces the value
@hnyls2002 hnyls2002 merged commit baeac17 into main May 21, 2026
575 of 640 checks passed
@hnyls2002 hnyls2002 deleted the lsyin/draft-prefix-lens branch May 21, 2026 08:51
fzyzcjy added a commit to fzyzcjy/sglang that referenced this pull request May 25, 2026
schedule_batch.py: drop self.maybe_wait_verify_done() call in merge_batch —
  upstream removed verify_done.wait via FutureMap routing (sgl-project#25879); keep our
  branch's assert against chunked/dllm reqs in other.reqs.
test/registered/unit/managers/test_scheduler_chunked_req_gate.py: keep
  HEAD's deletion (v1 gate removed in v2); upstream's array.array
  migration is moot since the file goes away.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant