Skip to content

spec : discard last drafted token with low prob#22506

Merged
ggerganov merged 1 commit into
masterfrom
gg/spec-draft-discard-low-prob-token
Apr 29, 2026
Merged

spec : discard last drafted token with low prob#22506
ggerganov merged 1 commit into
masterfrom
gg/spec-draft-discard-low-prob-token

Conversation

@ggerganov
Copy link
Copy Markdown
Member

@ggerganov ggerganov commented Apr 29, 2026

Overview

In the majority of drafts, the last token is low-prob. For non-recurrent models, it's not a big issue to keep that token. But for recurrent models, this causes the logic to restore checkpoints and re-evaluate the same draft (minus the last token) quite often.

This PR simply discards the low-prob token from the draft. Should result in significant improvement for draft-based speculative decoding for recurrent models due to having to restore the speculative checkpoint less often.

Also fix the stats for drafted/accepted number of tokens - on master, we always incorrectly report 100% acceptance rate.

Requirements

@ggerganov ggerganov requested review from a team as code owners April 29, 2026 08:50
@ggerganov ggerganov merged commit 683c5ac into master Apr 29, 2026
46 checks passed
@ggerganov ggerganov deleted the gg/spec-draft-discard-low-prob-token branch April 29, 2026 14:00
tekintian added a commit to tekintian/llama.cpp that referenced this pull request May 1, 2026
* 'master' of github.com:tekintian/llama.cpp: (659 commits)
  ggml-webgpu: Improve performance of mat-vec and mat-mat for MUL_MAT_ID (ggml-org#22464)
  Update llama-mmap to use ftello/fseeko (ggml-org#22497)
  common : check for null getpwuid in hf-cache (ggml-org#22550)
  vulkan: add get/set tensor 2d functions (ggml-org#22514)
  spec: fix argument typo (ggml-org#22552)
  ci : bump ty to 0.0.33 (ggml-org#22535)
  vendor : update cpp-httplib to 0.43.2 (ggml-org#22548)
  CUDA: fix tile FA kernel on Pascal (ggml-org#22541)
  scripts : add wc2wt.sh - create worktree from current HEAD (ggml-org#22513)
  add fast matmul iquants (ggml-org#22504)
  spec : fix draft model checkpoints (ggml-org#22521)
  spec : fix vocab compat checks in spec example (ggml-org#22426)
  common : do not pass prompt tokens to reasoning budget sampler (ggml-org#22488)
  hexagon: make vmem and buffer-size configurable (ggml-org#22487)
  CUDA: fuse SSM_CONV + ADD(bias) + SILU (ggml-org#22478)
  spec : disacard last drafted token with low prob (ggml-org#22506)
  sync : ggml
  ggml : bump version to 0.10.1 (ggml/1469)
  webui: fix slow mic stop and WAV encode (ggml-org#22480)
  ggml-cpu : disable tiled matmul on AIX to fix page boundary segfault (ggml-org#22293)
  ...

# Conflicts:
#	.gitignore
rsenthilkumar6 pushed a commit to rsenthilkumar6/llama.cpp that referenced this pull request May 1, 2026
samuraieng pushed a commit to samuraieng/llama.cpp that referenced this pull request May 6, 2026
ljubomirj pushed a commit to ljubomirj/llama.cpp that referenced this pull request May 6, 2026
meh pushed a commit to meh/llama.cpp that referenced this pull request May 10, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants