Skip to content

Move to backend sampling for MTP draft path#23287

Merged
gaugarg-nv merged 3 commits into
ggml-org:masterfrom
gaugarg-nv:mtp_backend_sampling
May 20, 2026
Merged

Move to backend sampling for MTP draft path#23287
gaugarg-nv merged 3 commits into
ggml-org:masterfrom
gaugarg-nv:mtp_backend_sampling

Conversation

@gaugarg-nv
Copy link
Copy Markdown
Contributor

@gaugarg-nv gaugarg-nv commented May 18, 2026

The observation was that MTP draft is quite small, and for this reason, draft sampling can dominate the draft execution time. This PR tries to optimize the MTP draft sampling.

Replace D2H logit copies and CPU-side sort with argmax on the backend.

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support ARG_MAX.

Introduce a new sampler in API that is greedy, but doesn't expose logits: llama_sampler_init_greedy_token_only.

Performance on 2x RTX 5090 with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf and --spec-draft-n-max 3 improves by ~7%.

Command: ./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp --spec-draft-n-max 3

Master:

python3 mtp-bench.py
  code_python        pred= 192 draft= 177 acc= 131 rate=0.740 tok/s=279.4
  code_cpp           pred= 192 draft= 210 acc= 121 rate=0.576 tok/s=244.6
  explain_concept    pred= 192 draft= 204 acc= 121 rate=0.593 tok/s=244.9
  summarize          pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=296.3
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=293.9
  translation        pred= 192 draft= 186 acc= 128 rate=0.688 tok/s=272.9
  creative_short     pred= 192 draft= 203 acc= 122 rate=0.601 tok/s=249.6
  stepwise_math      pred= 192 draft= 182 acc= 130 rate=0.714 tok/s=278.3
  long_code_review   pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=261.3

PR

python3 mtp-bench.py
  code_python        pred= 192 draft= 177 acc= 131 rate=0.740 tok/s=298.4
  code_cpp           pred= 192 draft= 210 acc= 121 rate=0.576 tok/s=262.0
  explain_concept    pred= 192 draft= 204 acc= 121 rate=0.593 tok/s=262.2
  summarize          pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=315.7
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=313.9
  translation        pred= 192 draft= 186 acc= 128 rate=0.688 tok/s=291.5
  creative_short     pred= 192 draft= 203 acc= 122 rate=0.601 tok/s=266.1
  stepwise_math      pred= 192 draft= 182 acc= 130 rate=0.714 tok/s=297.6
  long_code_review   pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=280.1

Requirements

  • I have read and agree with the contributing guidelines
  • AI usage disclosure: Yes, code review and further iteration on the code were done with AI after initial implementation.

@gaugarg-nv gaugarg-nv requested review from a team and ggerganov as code owners May 18, 2026 16:14
@gaugarg-nv gaugarg-nv requested a review from am17an May 18, 2026 16:15
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 18, 2026

I discussed with @ggerganov about this earlier, and he said the backend sampling has issues we need to resolve first. Pasting his reply

In theory yes, but it needs quite some work to get usable:

  • Currently asserts with tensor parllel - someone who is familiar with the Meta backend has to take a look and fix it
  • The Metal top-k kernels are slow, so it currently does not help with performance on Mac
  • Need support for multiple-sequences in the batch: sampling : support multiple outputs per sequence #19833

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

  • Currently asserts with tensor parllel - someone who is familiar with the Meta backend has to take a look and fix it

In this PR, I decided to fallback to CPU sampling in case of tensor parallel. Enabling backend sampling for "tensor parallel" is actually a quite involved change, and I was planning to work on it as a follow-up.

  • The Metal top-k kernels are slow, so it currently does not help with performance on Mac

ok, I was not aware of it. This PR just makes use of argmax. Is that slow too, @ggerganov ?

Could you explain what is currently missing, and does #19833 address this?

@ggerganov
Copy link
Copy Markdown
Member

ggerganov commented May 19, 2026

Could you explain what is currently missing, and does #19833 address this?

So this is actually needed if we want to run the backend sampling for the target model. But here the idea is to run it on the draft. So for the time being, we can ignore this point.

  • The Metal top-k kernels are slow, so it currently does not help with performance on Mac

ok, I was not aware of it. This PR just makes use of argmax. Is that slow too, @ggerganov ?

Argmax is OK I think. But with the merge of #23269 we would want to use a top-k sampler in order to be able to utilize the --spec-draft-p-min parameter (we need some probability estimate for the greedy token). On CUDA does the argmax make a big difference compared to a regular top-k sampler? The latter should transfer just 10 probs to host which we use for the p-min thold.

@ORippler
Copy link
Copy Markdown
Collaborator

On CUDA does the argmax make a big difference compared to a regular top-k sampler? The latter should transfer just 10 probs to host which we use for the p-min thold.

argmax is simpler to do than top-k, so yes perf implications are expected:

ARGMAX(type=f32,ne=[200000,16,1,1]):                 40275 runs -    25.97 us/run -    12500 kB/run -  459.07 GB/s
ARGSORT(type=f32,ne=[200000,16,1,1],order=0):                 1343 runs -   905.10 us/run -    25000 kB/run -   26.34 GB/s
TOP_K(type=f32,ne=[200000,16,1,1],k=10,ties=0):               5370 runs -   503.69 us/run -    12500 kB/run -   23.67 GB/s

all of these should be significantly faster on CUDA than on CPU based on my previous involvement in backend sampling

@gaugarg-nv gaugarg-nv force-pushed the mtp_backend_sampling branch from 61f9de9 to 4c330e3 Compare May 19, 2026 15:47
@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

I have rebased the PR on top of master. Now it moves top_k(10) to backend. With RTX 5090, I'm seeing an improvement of ~8%.

Master: ./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp --spec-draft-n-max 3

python3 mtp-bench.py
  code_python        pred= 192 draft= 177 acc= 131 rate=0.740 tok/s=283.1
  code_cpp           pred= 192 draft= 210 acc= 121 rate=0.576 tok/s=248.5
  explain_concept    pred= 192 draft= 204 acc= 121 rate=0.593 tok/s=249.2
  summarize          pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=299.3
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=297.6
  translation        pred= 192 draft= 186 acc= 128 rate=0.688 tok/s=276.5
  creative_short     pred= 192 draft= 203 acc= 122 rate=0.601 tok/s=253.5
  stepwise_math      pred= 192 draft= 182 acc= 130 rate=0.714 tok/s=282.3
  long_code_review   pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=264.6

PR: ./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp --spec-draft-n-max 3

python3 mtp-bench.py
  code_python        pred= 192 draft= 177 acc= 131 rate=0.740 tok/s=303.7
  code_cpp           pred= 192 draft= 210 acc= 121 rate=0.576 tok/s=266.9
  explain_concept    pred= 192 draft= 204 acc= 121 rate=0.593 tok/s=267.7
  summarize          pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=322.4
  qa_factual         pred= 192 draft= 171 acc= 133 rate=0.778 tok/s=320.2
  translation        pred= 192 draft= 186 acc= 128 rate=0.688 tok/s=297.3
  creative_short     pred= 192 draft= 203 acc= 122 rate=0.601 tok/s=272.2
  stepwise_math      pred= 192 draft= 182 acc= 130 rate=0.714 tok/s=304.0
  long_code_review   pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=285.2

Master: ./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-p-min 0.5

python3 mtp-bench.py
  code_python        pred= 192 draft= 178 acc= 121 rate=0.680 tok/s=232.9
  code_cpp           pred= 192 draft= 159 acc= 128 rate=0.805 tok/s=270.7
  explain_concept    pred= 192 draft= 160 acc= 120 rate=0.750 tok/s=233.5
  summarize          pred= 192 draft= 151 acc= 133 rate=0.881 tok/s=289.4
  qa_factual         pred= 192 draft= 161 acc= 131 rate=0.814 tok/s=273.8
  translation        pred= 192 draft= 159 acc= 117 rate=0.736 tok/s=231.2
  creative_short     pred= 192 draft= 141 acc= 119 rate=0.844 tok/s=243.7
  stepwise_math      pred= 192 draft= 167 acc= 126 rate=0.754 tok/s=260.9
  long_code_review   pred= 192 draft= 158 acc= 118 rate=0.747 tok/s=239.0

PR: ./llama-server -m Qwen3.6-35B-A3B-UD-Q4_K_M.gguf --spec-type draft-mtp --spec-draft-n-max 3 --spec-draft-p-min 0.5

python3 mtp-bench.py
  code_python        pred= 192 draft= 178 acc= 121 rate=0.680 tok/s=253.8
  code_cpp           pred= 192 draft= 159 acc= 128 rate=0.805 tok/s=294.0
  explain_concept    pred= 192 draft= 160 acc= 120 rate=0.750 tok/s=255.4
  summarize          pred= 192 draft= 151 acc= 133 rate=0.881 tok/s=312.9
  qa_factual         pred= 192 draft= 161 acc= 131 rate=0.814 tok/s=296.7
  translation        pred= 192 draft= 159 acc= 117 rate=0.736 tok/s=248.6
  creative_short     pred= 192 draft= 141 acc= 119 rate=0.844 tok/s=261.4
  stepwise_math      pred= 192 draft= 167 acc= 126 rate=0.754 tok/s=279.7
  long_code_review   pred= 192 draft= 158 acc= 118 rate=0.747 tok/s=256.9

@ggerganov Can you see how it performs on Metal?

@ggerganov
Copy link
Copy Markdown
Member

There isn't a noticeable impact in the Metal perf using this change. I think the unoptimized top-k balances with the reduced data transfer and CPU sampling. When it gets optimized, we should likely also see some improvement.

Comment thread src/llama-sampler.cpp Outdated
@ggerganov
Copy link
Copy Markdown
Member

all of these should be significantly faster on CUDA than on CPU based on my previous involvement in backend sampling

Ok, let's consider adding an argmax-based sampler as originally proposed, after we merge this version that uses top-k since it should be very simple change and does not involve changes to the libllama API.

@ggerganov ggerganov self-assigned this May 20, 2026
@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 20, 2026

This does not require --backend-sampling to be passed?

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

This does not require --backend-sampling to be passed?

No, it uses backend sampling by default for MTP draft. It doesn't modify the sampling for the target model.

@am17an
Copy link
Copy Markdown
Contributor

am17an commented May 20, 2026

Does it make sense to add a toggle for this? --spec-draft-backend-sampling which is by default on perhaps.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

Vulkan performance on RTX 5090: Average gain of ~4%.

Master:

python3 mtp-bench.py
  code_python        pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=242.1
  code_cpp           pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=233.8
  explain_concept    pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=239.2
  summarize          pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=267.1
  qa_factual         pred= 192 draft= 183 acc= 129 rate=0.705 tok/s=253.8
  translation        pred= 192 draft= 187 acc= 127 rate=0.679 tok/s=245.7
  creative_short     pred= 192 draft= 203 acc= 123 rate=0.606 tok/s=230.1
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=257.2
  long_code_review   pred= 192 draft= 193 acc= 126 rate=0.653 tok/s=234.8

PR:

python3 mtp-bench.py
  code_python        pred= 192 draft= 174 acc= 133 rate=0.764 tok/s=272.5
  code_cpp           pred= 192 draft= 199 acc= 124 rate=0.623 tok/s=238.2
  explain_concept    pred= 192 draft= 192 acc= 126 rate=0.656 tok/s=246.9
  summarize          pred= 192 draft= 170 acc= 134 rate=0.788 tok/s=273.8
  qa_factual         pred= 192 draft= 183 acc= 129 rate=0.705 tok/s=261.8
  translation        pred= 192 draft= 187 acc= 127 rate=0.679 tok/s=254.7
  creative_short     pred= 192 draft= 203 acc= 123 rate=0.606 tok/s=236.8
  stepwise_math      pred= 192 draft= 178 acc= 131 rate=0.736 tok/s=266.0
  long_code_review   pred= 192 draft= 193 acc= 126 rate=0.653 tok/s=245.6

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

Does it make sense to add a toggle for this? --spec-draft-backend-sampling which is by default on perhaps.

I don't know how stable backend sampling is across all backends. So, it might be a good idea for debugging. @ggerganov ?

@ggerganov
Copy link
Copy Markdown
Member

Ok, let's add the argument - I am also not sure how stable the implementation is.

@gaugarg-nv
Copy link
Copy Markdown
Contributor Author

Added the argument. Kept it ON by default.

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.
@gaugarg-nv gaugarg-nv force-pushed the mtp_backend_sampling branch from 9fd101a to b0061f7 Compare May 20, 2026 12:04
@gaugarg-nv gaugarg-nv merged commit ad27757 into ggml-org:master May 20, 2026
49 checks passed
@homemdesgraca
Copy link
Copy Markdown

homemdesgraca commented May 20, 2026

Have any changes related to "--fit" calculations been made?

Before this PR, Qwen-35B-A3B loaded fine with "-fitt 1536" and had about 90%~ of it's context full without crashing. Now it crashes with the first prompt:

Full logs
0.00.281.580 I log_info: verbosity = 3 (adjust with the `-lv N` CLI arg)
0.00.281.583 I device_info:
0.00.549.637 I   - CUDA0   : NVIDIA GeForce RTX 3060 (11902 MiB, 10767 MiB free)
0.00.549.647 I   - CPU     : AMD Ryzen 5 4500 6-Core Processor (31980 MiB, 31980 MiB free)
0.00.549.730 I system_info: n_threads = 6 (n_threads_batch = 6) / 12 | CUDA : ARCHS = 860 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | BMI2 = 1 | LLAMAFILE = 1 | OPENMP = 1 | REPACK = 1 | 
0.00.549.771 I srv          init: running without SSL
0.00.549.806 I srv          init: using 11 threads for HTTP server
0.00.549.899 I srv         start: binding port with default address family
0.00.551.036 I srv  llama_server: loading model
0.00.551.043 I srv    load_model: loading model './Models/Qwen3.6-35B-A3B-MTP-UD-Q3_K_XL.gguf'
0.00.551.094 I common_init_result: fitting params to device memory ...
0.00.551.096 I common_init_result: (for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on)
0.25.024.939 W llama_context: n_ctx_seq (100096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.25.290.114 I common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
0.26.695.606 I srv    load_model: creating MTP draft context against the target model './Models/Qwen3.6-35B-A3B-MTP-UD-Q3_K_XL.gguf'
0.26.695.646 W llama_context: n_ctx_seq (100096) < n_ctx_train (262144) -- the full capacity of the model will not be utilized
0.26.915.019 I srv    load_model: initializing slots, n_slots = 1
0.27.957.500 I common_context_can_seq_rm: the context supports bounded partial sequence removal
0.27.963.489 I common_speculative_impl_draft_mtp: adding speculative implementation 'draft-mtp'
0.27.963.498 I common_speculative_impl_draft_mtp: - n_max=1, n_min=0, p_min=0.00, n_embd=2048, backend_sampling=1
0.27.963.501 I common_speculative_impl_draft_mtp: - gpu_layers=-1, cache_k=q4_0, cache_v=q4_0, ctx_tgt=yes, ctx_dft=yes, devices=[default]
0.27.964.025 I srv    load_model: speculative decoding context initialized
0.27.964.029 I slot   load_model: id  0 | task -1 | new slot, n_ctx = 100096
0.27.964.153 I srv    load_model: prompt cache is enabled, size limit: 8192 MiB
0.27.964.155 I srv    load_model: use `--cache-ram 0` to disable the prompt cache
0.27.964.155 I srv    load_model: for more info see https://github.com/ggml-org/llama.cpp/pull/16391
0.27.964.379 I srv          init: idle slots will be saved to prompt cache and cleared upon starting a new task
0.27.997.901 I init: chat template, example_format: '<|im_start|>system
You are a helpful assistant<|im_end|>
<|im_start|>user
Hello<|im_end|>
<|im_start|>assistant
Hi there<|im_end|>
<|im_start|>user
How are you?<|im_end|>
<|im_start|>assistant
<think>

</think>

'
0.28.014.635 I srv          init: init: chat template, thinking = 0
0.28.014.698 I srv  llama_server: model loaded
0.28.014.710 I srv  llama_server: server is listening on http://0.0.0.0:5000
0.28.014.719 I srv  update_slots: all slots are idle
0.33.119.110 I srv  params_from_: Chat format: peg-native
0.33.122.564 I slot get_availabl: id  0 | task -1 | selected slot by LRU, t_last = -1
0.33.122.569 I srv  get_availabl: updating prompt cache
0.33.122.574 I srv          load:  - looking for better prompt, base f_keep = -1.000, sim = 0.000
0.33.122.577 I srv        update:  - cache state: 0 prompts, 0.000 MiB (limits: 8192.000 MiB, 100096 tokens, 8589934592 est)
0.33.122.578 I srv  get_availabl: prompt cache update took 0.01 ms
0.33.128.209 I reasoning-budget: activated, budget=2147483647 tokens
0.33.128.214 I reasoning-budget: deactivated (natural end)
0.33.128.238 I slot launch_slot_: id  0 | task 0 | processing task, is_child = 0
0.38.433.052 E ggml_backend_cuda_buffer_type_alloc_buffer: allocating 1964.95 MiB on device 0: cudaMalloc failed: out of memory
0.38.433.057 E ggml_gallocr_reserve_n_impl: failed to allocate CUDA0 buffer of size 2060396544
0.38.433.067 E graph_reserve: failed to allocate compute buffers
[New LWP 60463]
[New LWP 60462]
[New LWP 60461]
[New LWP 60460]
[New LWP 60459]
[New LWP 60319]
[New LWP 60318]
[New LWP 60317]
[New LWP 60316]
[New LWP 60315]
[New LWP 60314]
[New LWP 60313]
[New LWP 60312]
[New LWP 60311]
[New LWP 60310]
[New LWP 60309]
[New LWP 60308]
[New LWP 60307]
[New LWP 60306]
[New LWP 60305]
[New LWP 60304]

This GDB supports auto-downloading debuginfo from the following URLs:
  <https://debuginfod.archlinux.org>
  <https://debuginfod.cachyos.org>
Enable debuginfod for this session? (y or [n]) [answered N; input not from terminal]
Debuginfod has been disabled.
To make this setting permanent, add 'set debuginfod enabled off' to .gdbinit.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/usr/lib/libthread_db.so.1".
0x00007f30988b4e22 in ?? () from /usr/lib/libc.so.6
#0  0x00007f30988b4e22 in ?? () from /usr/lib/libc.so.6
#1  0x00007f30988a8178 in ?? () from /usr/lib/libc.so.6
#2  0x00007f309892fa6b in wait4 () from /usr/lib/libc.so.6
#3  0x00007f30a2f4f3bb in ggml_print_backtrace () from /home/homemdesgraca/Misc/llama.cpp/build/bin/libggml-base.so.0
#4  0x00007f30a2f627a9 in ggml_uncaught_exception() () from /home/homemdesgraca/Misc/llama.cpp/build/bin/libggml-base.so.0
#5  0x00007f3098cbb4bc in ?? () from /usr/lib/libstdc++.so.6
#6  0x00007f3098c9a654 in std::terminate() () from /usr/lib/libstdc++.so.6
#7  0x00007f3098cbb79d in __cxa_throw () from /usr/lib/libstdc++.so.6
#8  0x00007f30a2c89f87 in llama_context::sched_reserve() [clone .cold] () from /home/homemdesgraca/Misc/llama.cpp/build/bin/libllama.so.0
#9  0x00007f30a2cd1fad in llama_context::decode(llama_batch const&) () from /home/homemdesgraca/Misc/llama.cpp/build/bin/libllama.so.0
#10 0x00007f30a2cd436e in llama_decode () from /home/homemdesgraca/Misc/llama.cpp/build/bin/libllama.so.0
#11 0x00007f30a32f73a2 in common_speculative_impl_draft_mtp::process(llama_batch const&) () from /home/homemdesgraca/Misc/llama.cpp/build/bin/libllama-common.so.0
#12 0x00007f30a32f1de4 in common_speculative_process(common_speculative*, llama_batch const&) () from /home/homemdesgraca/Misc/llama.cpp/build/bin/libllama-common.so.0
#13 0x00005649fc4d9652 in server_context_impl::update_slots() ()
#14 0x00005649fc57c591 in server_queue::start_loop(long) ()
#15 0x00005649fc42be8c in llama_server(int, char**) ()
#16 0x00007f3098827c8e in ?? () from /usr/lib/libc.so.6
#17 0x00007f3098827dcb in __libc_start_main () from /usr/lib/libc.so.6
#18 0x00005649fc424d75 in _start ()
[Inferior 1 (process 60302) detached]
terminate called after throwing an instance of 'std::runtime_error'
  what():  failed to allocate compute pp buffers

It loads with "-fitt 2048". Even with less VRAM I got a 10%~ improvement in TG, wow!

@ggerganov
Copy link
Copy Markdown
Member

The extra VRAM usage might be fixed with #23433 - need to check

@homemdesgraca
Copy link
Copy Markdown

The extra VRAM usage might be fixed with #23433 - need to check

Yep! #23433 allowed me to load the model with "-fitt 1024" without crashing. Compared to the results before #23287 + #23433, PP is 20%~ higher (prob because of more VRAM space) and TG is 18%~ better. Amazing job!!

ProTekk pushed a commit to ProTekk/buun-llama-cpp that referenced this pull request May 20, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
dbrain pushed a commit to dbrain/hbd-llama-cpp-turboquant that referenced this pull request May 21, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request May 21, 2026
* origin/master: (138 commits)
fix(flash-attn): replace f32 with kv_type and q_type (ggml-org#23372)
tests : move save-load-state from examples to tests (ggml-org#23336)
server: expose prompt token counts in /slots endpoint (ggml-org#23454)
metal : optimize concat kernel and fix set kernel threads (ggml-org#23411)
server : free draft/MTP resources on sleep to fix VRAM leak (ggml-org#23461)
server: re-inject subcommand when router spawns children under unified binary (ggml-org#23442)
app : add batched-bench, fit-params, quantize & perplexity (ggml-org#23459)
mtp: use inp_out_ids for skipping logit computation (ggml-org#23433)
vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410)
doc: fix spec mtp typo (ggml-org#23435)
ui: Improve Git Hooks for UI development (ggml-org#23403)
ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306)
llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131)
hexagon: ssm-conv fix for large prompts (ggml-org#23307)
app : show version (ggml-org#23426)
mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329)
ui: Add max image size option (ggml-org#22849)
Move to backend sampling for MTP draft path (ggml-org#23287)
opencl: refactor backend initilization (ggml-org#23318)
common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386)
...
spiritbuun added a commit to spiritbuun/buun-llama-cpp that referenced this pull request May 22, 2026
Notable upstream changes:
- MTP cleanup: rename state→impl, accept(is_other), p_min re-enabled,
  top_k=10, backend sampling (ggml-org#23287, ggml-org#23269)
- fit_params accounts for mmproj memory via mtmd_get_memory_usage (ggml-org#21489)
- Free draft/MTP resources on sleep (ggml-org#23461)
- MTP inp_out_ids optimization (ggml-org#23433)
- PDL for Hopper+ (ggml-org#22522)
- SWA-only model null-buffer fix (ggml-org#23131)
- Perplexity integer overflow fix (ggml-org#23496)

Fork conflict resolutions:
- speculative.cpp: updated fork classes (suffix, copyspec, recycle, dflash)
  to 3-arg accept() signature; renamed state→impl references
- server-context.cpp: integrated upstream mmproj memory measurement for
  non-swap path; kept fork's pre-doubling auto-fit for mmproj-gpu-swap
  path (now uses mtmd_get_memory_usage instead of file-size heuristic);
  added upstream's mtmd_helper_log_set to mmproj init

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
baramofme pushed a commit to baramofme/llama-cpp-turboquant that referenced this pull request May 23, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
Jcfunk added a commit to Jcfunk/llama.cpp that referenced this pull request May 23, 2026
* upstream/HEAD: (38 commits)
  vocab : add Carbon-3B (HybridDNATokenizer) support (ggml-org#23410)
  doc: fix spec mtp typo (ggml-org#23435)
  ui: Improve Git Hooks for UI development (ggml-org#23403)
  ggml : Check the right iface method before using the fallback 2d get (ggml-org#23306)
  llama-graph: fix null-buffer crash in llm_graph_input_attn_kv_iswa for SWA-only models (ggml-org#23131)
  hexagon: ssm-conv fix for large prompts (ggml-org#23307)
  app : show version (ggml-org#23426)
  mtmd, model : merge HunyuanOCR into HunyuanVL and fix OCR vision precision (ggml-org#23329)
  ui: Add max image size option (ggml-org#22849)
  Move to backend sampling for MTP draft path (ggml-org#23287)
  opencl: refactor backend initilization (ggml-org#23318)
  common/speculative : fix nullptr crash in get_devices_str (ggml-org#23386)
  mtmd : DeepSeek-OCR image processing fixes, img_tool::resize padding refactor (ggml-org#23345)
  vulkan: optimize operations in the IM2COL shader (ggml-org#22685)
  feat: Add WAV MIME type variants and improve audio format detection (ggml-org#23396)
  hexagon: HMX quantized matmul rework (ggml-org#23368)
  Programmatic Dependent Launch (PDL) for more performance on newer NVIDIA GPUs (Hopper+) (ggml-org#22522)
  app : introduce the llama unified executable (ggml-org#23296)
  refactor: Move text attachments up before the message content in chat completions payload (ggml-org#23406)
  mtmd: fit_params now take into account mmproj (ggml-org#21489)
  ...
srossitto79 pushed a commit to srossitto79/llama.cpp that referenced this pull request May 23, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
jimbothigpen added a commit to jimbothigpen/llama.cpp that referenced this pull request May 24, 2026
ggml-org#23287 (commit 0bbdec4) flipped the default to true, offloading
draft top-k(10) to the backend with D2H of only the top 10 logits.
The optimization regressed V-J accept rate on Qwen3.5-35B-A3B-MTP-IQ4_XS
from 79% (anchor 71a0c46) to 0.20% (tip 67edf5e) — root-caused by
worker mtp-accept-rate-bisect-2026-05-24 to a host-side argmax that no
longer matches the in-graph argmax over the same logits row.

This default-flip is a STOP-GAP — restores pre-0bbdec4aa semantics
(host pulls full vocab via D2H + samples locally). Once the actual
backend top-k bug is fixed, the optimization can be safely re-enabled.
Users who want it now can pass --spec-draft-backend-sampling on the CLI.

Brief: kernel-work/orchestrator-inbox/completed/processed/orchestrator-brief-mtp-accept-rate-bisect-2026-05-24.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jimbothigpen pushed a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
jimbothigpen added a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
ggml-org#23287 (commit 0bbdec4) flipped the default to true, offloading
draft top-k(10) to the backend with D2H of only the top 10 logits.
The optimization regressed V-J accept rate on Qwen3.5-35B-A3B-MTP-IQ4_XS
from 79% (anchor 71a0c46) to 0.20% (tip 67edf5e) — root-caused by
worker mtp-accept-rate-bisect-2026-05-24 to a host-side argmax that no
longer matches the in-graph argmax over the same logits row.

This default-flip is a STOP-GAP — restores pre-0bbdec4aa semantics
(host pulls full vocab via D2H + samples locally). Once the actual
backend top-k bug is fixed, the optimization can be safely re-enabled.
Users who want it now can pass --spec-draft-backend-sampling on the CLI.

Brief: kernel-work/orchestrator-inbox/completed/processed/orchestrator-brief-mtp-accept-rate-bisect-2026-05-24.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
jimbothigpen pushed a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
jimbothigpen added a commit to jimbothigpen/llama.cpp that referenced this pull request May 25, 2026
ggml-org#23287 (commit 0bbdec4) flipped the default to true, offloading
draft top-k(10) to the backend with D2H of only the top 10 logits.
The optimization regressed V-J accept rate on Qwen3.5-35B-A3B-MTP-IQ4_XS
from 79% (anchor 71a0c46) to 0.20% (tip 67edf5e) — root-caused by
worker mtp-accept-rate-bisect-2026-05-24 to a host-side argmax that no
longer matches the in-graph argmax over the same logits row.

This default-flip is a STOP-GAP — restores pre-0bbdec4aa semantics
(host pulls full vocab via D2H + samples locally). Once the actual
backend top-k bug is fixed, the optimization can be safely re-enabled.
Users who want it now can pass --spec-draft-backend-sampling on the CLI.

Brief: kernel-work/orchestrator-inbox/completed/processed/orchestrator-brief-mtp-accept-rate-bisect-2026-05-24.md

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@gaugarg-nv gaugarg-nv deleted the mtp_backend_sampling branch May 30, 2026 20:34
ssam18 added a commit to ssam18/llama.cpp that referenced this pull request May 30, 2026
PR ggml-org#23287 enabled backend draft sampling by default for the MTP path, attaching a per-seq_id sampler chain (top_k=10) to the draft context. This adds compute-buffer footprint that scales with n_seq, so configs that fit comfortably in VRAM at --parallel N>1 on b9246 now OOM during the first decode on b9410+ (see ggml-org#23903 for the bisect, b9246 fit two slots in 15.6 GB, b9426 needs essentially the full 16 GB for one slot
under the same model and flags).

Default the new behavior off so the regression does not fire on configs that worked before. Users wanting backend sampling can opt back in with --spec-draft-backend-sampling (already wired by PR ggml-org#23287).

The help text auto-reflects the new default via
string_format("default: %s", ... ? "enabled" : "disabled").
fewtarius pushed a commit to fewtarius/llama.cpp that referenced this pull request May 30, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
turbo-tan pushed a commit to turbo-tan/llama.cpp-tq3 that referenced this pull request Jun 2, 2026
* Move to backend sampling for MTP draft path

Run top_k(10) on the draft backend. D2H transfers happen only for the top 10 logits

Make backend sampling more robust and fallback to CPU on failure cases, such as with "-sm tensor" or when a backend doesn't support TOP_K.

* Allow sampler chains to be partially offloaded to backend

* Add --spec-draft-backend-sampling argument. Enabled by default.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants