Skip to content

mtp: use inp_out_ids for skipping logit computation#23433

Merged
am17an merged 1 commit into
ggml-org:masterfrom
am17an:mtp-inp-out-ids
May 21, 2026
Merged

mtp: use inp_out_ids for skipping logit computation#23433
am17an merged 1 commit into
ggml-org:masterfrom
am17an:mtp-inp-out-ids

Conversation

@am17an
Copy link
Copy Markdown
Contributor

@am17an am17an commented May 20, 2026

When doing a follow-up decode for the draft model, we were always doing the logits computation even though it is not required. Thanks for comment at #23230 (comment) for pointing this out

Overview

Additional information

Requirements

when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
@am17an am17an requested a review from CISC as a code owner May 20, 2026 17:04
@Bushido76
Copy link
Copy Markdown

Bushido76 commented May 20, 2026

@am17an Thank you very much for your very quick reaction and first implementation. I'm a newby regarding the llama.cpp software architecture, but I did implement and compile your implementation locally. Unfortunately - at least for now - it seems to slow down the token processing speed (tg128) by around 15-20%. With the "background" of the fix (avoiding unneccesary logit computation) - imo the opposite should be the case.

Testing on an AMD Unified Memory APU (Strix Halo, Vulkan/RADV) ... passing zero-sized tensors down the graph pipeline seems to cause severe pipeline stalls and driver overhead in Vulkan, compared to cleanly stopping graph expansion on the host side when n_outputs == 0.

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 20, 2026

@Bushido76 indeed there is probably some error in how you measure. This PR avoids extra computations, so either it should 0 or positive effect.

@github-actions github-actions Bot added the model Model specific label May 21, 2026
@engrtipusultan
Copy link
Copy Markdown

@Bushido76 indeed there is probably some error in how you measure. This PR avoids extra computations, so either it should 0 or positive effect.

Yes it is certainly faster.

Before this PR. Your modified bench with extra variance.
https://gist.github.com/engrtipusultan/3c5985c6ff58a56bb5f7fc0c02e26f40

Before Bench

 bash  python3 mtp-bench.py --url http://127.0.0.1:52497
  code_python        pred= 192 draft= 132 acc= 124 rate=0.939 tok/s=16.6
  code_cpp           pred=  54 draft=  36 acc=  36 rate=1.000 tok/s=17.2
  explain_concept    pred= 192 draft= 189 acc=  95 rate=0.503 tok/s=11.5
  summarize          pred=  48 draft=  40 acc=  28 rate=0.700 tok/s=13.8
  qa_factual         pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.2
  translation        pred=  17 draft=  12 acc=  10 rate=0.833 tok/s=16.1
  creative_short     pred=  30 draft=  32 acc=  14 rate=0.438 tok/s=10.7
  stepwise_math      pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.2
  long_code_review   pred= 192 draft= 201 acc=  90 rate=0.448 tok/s=10.8
  rust_boilerplate   pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=15.6
  yaml_automation    pred= 192 draft= 163 acc= 109 rate=0.669 tok/s=13.5
  json_extraction    pred= 106 draft=  70 acc=  70 rate=1.000 tok/s=17.4
  js_async_retry     pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=15.6
  regex_parsing      pred= 192 draft= 133 acc= 124 rate=0.932 tok/s=16.5
  system_tuning_explanation pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=13.6
  creative_unpredictable pred= 192 draft= 202 acc=  89 rate=0.441 tok/s=10.8

Aggregate: {
  "n_requests": 16,
  "total_predicted": 2367,
  "total_draft": 1940,
  "total_draft_accepted": 1375,
  "aggregate_accept_rate": 0.7088,
  "wall_s_total": 202.54
}

PR Bench.

Details

 bash  python3 mtp-bench.py --url http://127.0.0.1:39879
  code_python        pred= 192 draft= 132 acc= 124 rate=0.939 tok/s=17.4
  code_cpp           pred=  54 draft=  36 acc=  36 rate=1.000 tok/s=18.2
  explain_concept    pred= 192 draft= 189 acc=  95 rate=0.503 tok/s=12.2
  summarize          pred=  48 draft=  40 acc=  28 rate=0.700 tok/s=14.5
  qa_factual         pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.7
  translation        pred=  17 draft=  12 acc=  10 rate=0.833 tok/s=15.9
  creative_short     pred=  30 draft=  32 acc=  14 rate=0.438 tok/s=10.5
  stepwise_math      pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.3
  long_code_review   pred= 192 draft= 201 acc=  90 rate=0.448 tok/s=11.4
  rust_boilerplate   pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=16.3
  yaml_automation    pred= 192 draft= 163 acc= 109 rate=0.669 tok/s=14.1
  json_extraction    pred= 106 draft=  70 acc=  70 rate=1.000 tok/s=18.2
  js_async_retry     pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=16.3
  regex_parsing      pred= 192 draft= 133 acc= 124 rate=0.932 tok/s=17.0
  system_tuning_explanation pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=14.1
  creative_unpredictable pred= 192 draft= 202 acc=  89 rate=0.441 tok/s=11.3

Aggregate: {
  "n_requests": 16,
  "total_predicted": 2367,
  "total_draft": 1940,
  "total_draft_accepted": 1375,
  "aggregate_accept_rate": 0.7088,
  "wall_s_total": 194.94
}

@am17an
Copy link
Copy Markdown
Contributor Author

am17an commented May 21, 2026

can I get another approval?

@am17an am17an merged commit 12e5d99 into ggml-org:master May 21, 2026
48 of 49 checks passed
@am17an am17an deleted the mtp-inp-out-ids branch May 21, 2026 07:23
ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 21, 2026
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 21, 2026
* origin/master: (138 commits)
fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)
tests : move save-load-state from examples to tests (ggml-org#23336)
server: expose prompt token counts in /slots endpoint (ggml-org#23454)
metal : optimize concat kernel and fix set kernel threads (ggml-org#23411)
server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461)
server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442)
app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459)
mtp: use inp_out_ids for skipping logit computation (ggml-org#23433)
vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410)
doc: fix spec mtp typo (ggml-org#23435)
ui: Improve Git Hooks for UI development (ggml-org#23403)
ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306)
llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131)
hexagon: ssm-conv fix for large prompts (ggml-org#23307)
app : show version (ggml-org#23426)
mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329)
ui: Add max image size option (ggml-org#22849)
Move to backend sampling for MTP draft path (ggml-org#23287)
opencl: refactor backend initilization (ggml-org#23318)
common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386)
...
spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request May 22, 2026
Notable upstream changes:
- MTP cleanup: rename state→impl, accept(is_other), p_min re-enabled,
  top_k=10, backend sampling (ggml-org#23287, ggml-org#23269)
- fit_params accounts for mmproj memory via mtmd_get_memory_usage (ggml-org#21489)
- Free draft/MTP resources on sleep (ggml-org#23461)
- MTP inp_out_ids optimization (ggml-org#23433)
- PDL for Hopper+ (ggml-org#22522)
- SWA-only model null-buffer fix (ggml-org#23131)
- Perplexity integer overflow fix (ggml-org#23496)

Fork conflict resolutions:
- speculative.cpp: updated fork classes (suffix, copyspec, recycle, dflash)
  to 3-arg accept() signature; renamed state→impl references
- server-context.cpp: integrated upstream mmproj memory measurement for
  non-swap path; kept fork's pre-doubling auto-fit for mmproj-gpu-swap
  path (now uses mtmd_get_memory_usage instead of file-size heuristic);
  added upstream's mtmd_helper_log_set to mmproj init

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

model Model specific

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants