mtp: use inp_out_ids for skipping logit computation by am17an · Pull Request #23433 · ggml-org/llama.cpp

am17an · 2026-05-20T17:04:41Z

When doing a follow-up decode for the draft model, we were always doing the logits computation even though it is not required. Thanks for comment at #23230 (comment) for pointing this out

Overview

Additional information

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.

Bushido76 · 2026-05-20T19:08:26Z

@am17an Thank you very much for your very quick reaction and first implementation. I'm a newby regarding the llama.cpp software architecture, but I did implement and compile your implementation locally. Unfortunately - at least for now - it seems to slow down the token processing speed (tg128) by around 15-20%. With the "background" of the fix (avoiding unneccesary logit computation) - imo the opposite should be the case.

Testing on an AMD Unified Memory APU (Strix Halo, Vulkan/RADV) ... passing zero-sized tensors down the graph pipeline seems to cause severe pipeline stalls and driver overhead in Vulkan, compared to cleanly stopping graph expansion on the host side when n_outputs == 0.

am17an · 2026-05-20T23:46:06Z

@Bushido76 indeed there is probably some error in how you measure. This PR avoids extra computations, so either it should 0 or positive effect.

engrtipusultan · 2026-05-21T06:46:43Z

@Bushido76 indeed there is probably some error in how you measure. This PR avoids extra computations, so either it should 0 or positive effect.

Yes it is certainly faster.

Before this PR. Your modified bench with extra variance.
https://gist.github.com/engrtipusultan/3c5985c6ff58a56bb5f7fc0c02e26f40

Before Bench

 bash  python3 mtp-bench.py --url http://127.0.0.1:52497
  code_python        pred= 192 draft= 132 acc= 124 rate=0.939 tok/s=16.6
  code_cpp           pred=  54 draft=  36 acc=  36 rate=1.000 tok/s=17.2
  explain_concept    pred= 192 draft= 189 acc=  95 rate=0.503 tok/s=11.5
  summarize          pred=  48 draft=  40 acc=  28 rate=0.700 tok/s=13.8
  qa_factual         pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.2
  translation        pred=  17 draft=  12 acc=  10 rate=0.833 tok/s=16.1
  creative_short     pred=  30 draft=  32 acc=  14 rate=0.438 tok/s=10.7
  stepwise_math      pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.2
  long_code_review   pred= 192 draft= 201 acc=  90 rate=0.448 tok/s=10.8
  rust_boilerplate   pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=15.6
  yaml_automation    pred= 192 draft= 163 acc= 109 rate=0.669 tok/s=13.5
  json_extraction    pred= 106 draft=  70 acc=  70 rate=1.000 tok/s=17.4
  js_async_retry     pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=15.6
  regex_parsing      pred= 192 draft= 133 acc= 124 rate=0.932 tok/s=16.5
  system_tuning_explanation pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=13.6
  creative_unpredictable pred= 192 draft= 202 acc=  89 rate=0.441 tok/s=10.8

Aggregate: {
  "n_requests": 16,
  "total_predicted": 2367,
  "total_draft": 1940,
  "total_draft_accepted": 1375,
  "aggregate_accept_rate": 0.7088,
  "wall_s_total": 202.54
}

PR Bench.

Details

 bash  python3 mtp-bench.py --url http://127.0.0.1:39879
  code_python        pred= 192 draft= 132 acc= 124 rate=0.939 tok/s=17.4
  code_cpp           pred=  54 draft=  36 acc=  36 rate=1.000 tok/s=18.2
  explain_concept    pred= 192 draft= 189 acc=  95 rate=0.503 tok/s=12.2
  summarize          pred=  48 draft=  40 acc=  28 rate=0.700 tok/s=14.5
  qa_factual         pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.7
  translation        pred=  17 draft=  12 acc=  10 rate=0.833 tok/s=15.9
  creative_short     pred=  30 draft=  32 acc=  14 rate=0.438 tok/s=10.5
  stepwise_math      pred= 192 draft= 144 acc= 118 rate=0.819 tok/s=15.3
  long_code_review   pred= 192 draft= 201 acc=  90 rate=0.448 tok/s=11.4
  rust_boilerplate   pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=16.3
  yaml_automation    pred= 192 draft= 163 acc= 109 rate=0.669 tok/s=14.1
  json_extraction    pred= 106 draft=  70 acc=  70 rate=1.000 tok/s=18.2
  js_async_retry     pred= 192 draft= 140 acc= 120 rate=0.857 tok/s=16.3
  regex_parsing      pred= 192 draft= 133 acc= 124 rate=0.932 tok/s=17.0
  system_tuning_explanation pred= 192 draft= 162 acc= 110 rate=0.679 tok/s=14.1
  creative_unpredictable pred= 192 draft= 202 acc=  89 rate=0.441 tok/s=11.3

Aggregate: {
  "n_requests": 16,
  "total_predicted": 2367,
  "total_draft": 1940,
  "total_draft_accepted": 1375,
  "aggregate_accept_rate": 0.7088,
  "wall_s_total": 194.94
}

am17an · 2026-05-21T07:16:18Z

can I get another approval?

when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.

* origin/master: (138 commits) fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372) tests : move save-load-state from examples to tests (ggml-org#23336) server: expose prompt token counts in /slots endpoint (ggml-org#23454) metal : optimize concat kernel and fix set kernel threads (ggml-org#23411) server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461) server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442) app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459) mtp: use inp_out_ids for skipping logit computation (ggml-org#23433) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) ...

Notable upstream changes: - MTP cleanup: rename state→impl, accept(is_other), p_min re-enabled, top_k=10, backend sampling (ggml-org#23287, ggml-org#23269) - fit_params accounts for mmproj memory via mtmd_get_memory_usage (ggml-org#21489) - Free draft/MTP resources on sleep (ggml-org#23461) - MTP inp_out_ids optimization (ggml-org#23433) - PDL for Hopper+ (ggml-org#22522) - SWA-only model null-buffer fix (ggml-org#23131) - Perplexity integer overflow fix (ggml-org#23496) Fork conflict resolutions: - speculative.cpp: updated fork classes (suffix, copyspec, recycle, dflash) to 3-arg accept() signature; renamed state→impl references - server-context.cpp: integrated upstream mmproj memory measurement for non-swap path; kept fork's pre-doubling auto-fit for mmproj-gpu-swap path (now uses mtmd_get_memory_usage instead of file-size heuristic); added upstream's mtmd_helper_log_set to mmproj init Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.

mtp: use inp_out_ids for skipping logit computation

649e9a8

when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.

am17an requested a review from CISC as a code owner May 20, 2026 17:04

CISC approved these changes May 20, 2026

View reviewed changes

ggerganov mentioned this pull request May 20, 2026

Move to backend sampling for MTP draft path #23287

Merged

github-actions Bot added the model Model specific label May 21, 2026

0cc4m approved these changes May 21, 2026

View reviewed changes

am17an merged commit 12e5d99 into ggml-org:master May 21, 2026
48 of 49 checks passed

am17an deleted the mtp-inp-out-ids branch May 21, 2026 07:23

nyo16 mentioned this pull request May 21, 2026

Bump llama.cpp to 52fb93a2b (30 commits) nyo16/llama_cpp_ex#42

Merged

4 tasks

drauh mentioned this pull request May 22, 2026

Misc. bug: Could save ~3 GB VRAM in graph_reserve when caller doesn't need logits (big-vocab models at large ub) #23527

Closed

a-ghorbani mentioned this pull request May 25, 2026

chore(deps): upgrade llama.rn to 0.12.4 a-ghorbani/pocketpal-ai#743

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

mtp: use inp_out_ids for skipping logit computation#23433

mtp: use inp_out_ids for skipping logit computation#23433
am17an merged 1 commit into
ggml-org:masterfrom
am17an:mtp-inp-out-ids

am17an commented May 20, 2026

Uh oh!

Bushido76 commented May 20, 2026 •

edited

Loading

Uh oh!

am17an commented May 20, 2026

Uh oh!

engrtipusultan commented May 21, 2026

Uh oh!

am17an commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

am17an commented May 20, 2026

Overview

Additional information

Requirements

Uh oh!

Bushido76 commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

am17an commented May 20, 2026

Uh oh!

engrtipusultan commented May 21, 2026

Uh oh!

am17an commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Bushido76 commented May 20, 2026 •

edited

Loading