mtp: use inp_out_ids for skipping logit computation#23433
Conversation
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
|
@am17an Thank you very much for your very quick reaction and first implementation. I'm a newby regarding the llama.cpp software architecture, but I did implement and compile your implementation locally. Unfortunately - at least for now - it seems to slow down the token processing speed (tg128) by around 15-20%. With the "background" of the fix (avoiding unneccesary logit computation) - imo the opposite should be the case. Testing on an AMD Unified Memory APU (Strix Halo, Vulkan/RADV) ... passing zero-sized tensors down the graph pipeline seems to cause severe pipeline stalls and driver overhead in Vulkan, compared to cleanly stopping graph expansion on the host side when n_outputs == 0. |
|
@Bushido76 indeed there is probably some error in how you measure. This PR avoids extra computations, so either it should 0 or positive effect. |
Yes it is certainly faster. Before this PR. Your modified bench with extra variance. Before Bench
PR Bench. Details
|
|
can I get another approval? |
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
* origin/master: (138 commits) fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372) tests : move save-load-state from examples to tests (ggml-org#23336) server: expose prompt token counts in /slots endpoint (ggml-org#23454) metal : optimize concat kernel and fix set kernel threads (ggml-org#23411) server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461) server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442) app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459) mtp: use inp_out_ids for skipping logit computation (ggml-org#23433) vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410) doc: fix spec mtp typo (ggml-org#23435) ui: Improve Git Hooks for UI development (ggml-org#23403) ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306) llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131) hexagon: ssm-conv fix for large prompts (ggml-org#23307) app : show version (ggml-org#23426) mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329) ui: Add max image size option (ggml-org#22849) Move to backend sampling for MTP draft path (ggml-org#23287) opencl: refactor backend initilization (ggml-org#23318) common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386) ...
Notable upstream changes: - MTP cleanup: rename state→impl, accept(is_other), p_min re-enabled, top_k=10, backend sampling (ggml-org#23287, ggml-org#23269) - fit_params accounts for mmproj memory via mtmd_get_memory_usage (ggml-org#21489) - Free draft/MTP resources on sleep (ggml-org#23461) - MTP inp_out_ids optimization (ggml-org#23433) - PDL for Hopper+ (ggml-org#22522) - SWA-only model null-buffer fix (ggml-org#23131) - Perplexity integer overflow fix (ggml-org#23496) Fork conflict resolutions: - speculative.cpp: updated fork classes (suffix, copyspec, recycle, dflash) to 3-arg accept() signature; renamed state→impl references - server-context.cpp: integrated upstream mmproj memory measurement for non-swap path; kept fork's pre-doubling auto-fit for mmproj-gpu-swap path (now uses mtmd_get_memory_usage instead of file-size heuristic); added upstream's mtmd_helper_log_set to mmproj init Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
when doing a follow-up decode for the draft model, we were always doing the logit computation even though it is not required.
When doing a follow-up decode for the draft model, we were always doing the logits computation even though it is not required. Thanks for comment at #23230 (comment) for pointing this out
Overview
Additional information
Requirements