studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values by danielhanchen · Pull Request #5623 · unslothai/unsloth

danielhanchen · 2026-05-19T14:57:27Z

Summary

Builds on PR #5582's 5-mode Speculative Decoding dropdown by adding two upstream-driven knobs that landed in llama.cpp #23269 (MTP clean-up):

spec_draft_p_min — minimum draft probability for MTP speculative decoding (--spec-draft-p-min). Drafts below this probability are rejected. Non-functional pre-#23269; now defaults to 0.0 upstream. Surfaced in Chat Settings as a Draft p-min numeric input below Draft Tokens, visible only when the dropdown is MTP or MTP+Ngram.
ngram-map-k / ngram-map-k4v — new ngram spec types added alongside ngram-mod. Each carries its own knob triplet (--spec-{variant}-size-n/m/min-hits). NOT in the dropdown (kept lean); accepted via the load API for power users. The resolver emits the correct knob triplet when the binary's probe advertises support.

Backend

_canonicalize_spec_mode recognises ngram-map-k / ngram-map-k4v.
New helper _build_ngram_map_k_flags(caps, variant=...) emits the knob triplet only when the binary advertises the knobs as real flags (not removal stubs).
_build_speculative_flags grows two new branches (one per variant) and an inline _maybe_emit_p_min helper that flows p_min through MTP paths only. Auto on an MTP GGUF still gets p_min applied because the resolved emission is draft-mtp.
LoadRequest.spec_draft_p_min (Optional[float], 0..1). Threaded through routes/inference.py at the four wire sites and _request_matches_loaded_settings.
_already_in_target_state accepts spec_draft_p_min so a changed p_min bounces a reload even on the Auto-promoted path.
probe_server_capabilities now reports spec_draft_p_min_flag, supports_ngram_map_k, supports_ngram_map_k4v.

Frontend

chat-runtime-store: specDraftPMin / loadedSpecDraftPMin / setter.
use-chat-model-runtime: hydrates p_min from /api/inference/status and the load response; resets p_min alongside spec mode + n_max when switching to a different model.
chat-settings-sheet: new Draft p-min number input (range 0..1, step 0.05), visible when speculativeType is mtp or mtp+ngram. Wired into the Reset and dirty-state machinery.

Tests

12 new assertions in test_llama_cpp_mtp_detection.py:

p_min emission matrix: MTP modes only; never for auto/ngram/off; auto-promoted draft-mtp still gets p_min; graceful degrade when binary lacks --spec-draft-p-min.
ngram-map-k / ngram-map-k4v emission with the correct knob triplet; no-emit when the binary doesn't advertise support.
Canonicalise recognises both ngram-map variants (incl. mixed-case).

373 backend tests pass (was 361 before).

Test plan

python -m pytest backend/tests/test_llama_cpp_mtp_detection.py backend/tests/test_llama_server_args.py backend/tests/test_gguf_reload_inheritance.py backend/tests/test_kv_cache_estimation.py -q -> 373 passed
npx tsc --noEmit -p tsconfig.app.json -> 0 errors
Manual probe against live Studio with a fresh llama.cpp build (post-#23269) confirming Draft p-min input emits --spec-draft-p-min and ngram-map-k4v via API emits the knob triplet.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4e3d05d81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T15:00:40Z

+        backend_mode in ("mtp", "mtp+ngram")
+        and request.spec_draft_p_min is not None
+        and abs(
+            float(request.spec_draft_p_min) - (llama_backend.spec_draft_p_min or 0.0)
+        )


Compare spec_draft_p_min when auto resolves to MTP

This check gates spec_draft_p_min comparison on backend_mode in ("mtp", "mtp+ngram"), but for auto-promoted MTP loads requested_spec_mode stays "auto". In that case a follow-up /load with a different spec_draft_p_min is treated as "already loaded" and skips reload, so the new threshold is silently ignored. This affects API/power-user flows that send speculative_type="auto" with spec_draft_p_min on MTP GGUFs.

Useful? React with 👍 / 👎.

danielhanchen · 2026-05-19T15:02:11Z

Bench: post-#23269 chain vs MTP-only on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)

Re-ran the clean-sweep harness (each prompt once after two unrelated warmups; n_predict=128; temp=0; top_k=1) against a llama.cpp build that includes #23269 (MTP clean-up + chain accept-token propagation fix).

The chain (ngram+mtp) is no longer the bug-ridden laggard PR #5582's bench saw. n=4 chain jumped from 90.4 -> 113.7 t/s (+25.8%); n=5 chain jumped 84.5 -> 108.5 (+28.5%). MTP-only is essentially unchanged. The fix isolated the regression to the chain path.

config	pre (#22822)	post (#23269)	delta	x vs OFF
OFF	78.8 t/s	78.9 t/s	+0.1%	1.00x
MTP n=2	112.1 t/s	114.9 t/s	+2.4%	1.46x
MTP n=4	115.4 t/s	116.6 t/s	+1.1%	1.48x
MTP n=5	109.1 t/s	109.3 t/s	+0.2%	1.39x
ngram-only	76.1 t/s	76.1 t/s	-0.3%	0.96x
ngram+mtp n=2	99.2 t/s	111.0 t/s	+11.9%	1.41x
ngram+mtp n=4	90.4 t/s	113.7 t/s	+25.8%	1.44x
ngram+mtp n=5	84.5 t/s	108.5 t/s	+28.5%	1.37x

Implications for Studio's Auto policy on GPU

MTP-only n=2/4 (114.9 / 116.6 t/s) still beats the chain n=2/4 (111.0 / 113.7) by ~2-3%. So Auto's current "MTP-only on GPU, chain on CPU" policy stays correct; no policy flip needed. The chain is now a viable user-forced option (via the new MTP+Ngram dropdown choice) without regressing perf below MTP-only.

n_max=2 stays the safe universal default. n=4 squeezes ~1-2% more on most prompts but loses 15-20% on essay (low-acceptance, long-form). The conservative n=2 covers the worst-case better.

Raw CSV: temp/bench_post_23269_27b.csv.

gemini-code-assist

Code Review

This pull request implements support for ngram-map-k and ngram-map-k4v speculative decoding modes and adds a spec_draft_p_min parameter for MTP decoding. These changes involve backend capability detection, server flag construction, and frontend UI updates to allow users to configure the new parameter. A review comment suggests refactoring the capability key lookup in llama_cpp.py to use an f-string for improved conciseness.

gemini-code-assist · 2026-05-19T15:02:18Z

+            supported = map_caps.get(
+                "supports_ngram_map_k4v"
+                if effective_mode == "ngram-map-k4v"
+                else "supports_ngram_map_k"
+            )


This logic for determining the capability key can be simplified using an f-string, which would make the code more concise and easier to read.

supported = map_caps.get(f"supports_{effective_mode.replace('-', '_')}")

danielhanchen · 2026-05-19T16:29:25Z

Bench: --spec-draft-p-min sweep 0.0..1.0 on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)

Re-ran the clean-sweep harness 22 times (one llama-server per cell) on the post-#23269 binary, fixing n_max=4 and stepping --spec-draft-p-min from 0.0 to 1.0 in 0.1 increments. Same methodology as the earlier bench: each prompt once after two unrelated warmups; n_predict=128; temp=0; top_k=1.

Mean throughput per (mode, p_min)

p_min	MTP-only n=4	delta-MTP	MTP+Ngram chain n=4	delta-chain
0.0	116.6 t/s	+0.0%	111.9 t/s	+0.0%
0.1	116.8 t/s	+0.2%	113.1 t/s	+1.1%
0.2	116.6 t/s	+0.1%	113.5 t/s	+1.4%
0.3	114.0 t/s	-2.2%	110.1 t/s	-1.6%
0.4	107.1 t/s	-8.1%	107.1 t/s	-4.3%
0.5	104.6 t/s	-10.2%	104.1 t/s	-7.0%
0.6	105.1 t/s	-9.8%	102.6 t/s	-8.3%
0.7	99.0 t/s	-15.0%	99.0 t/s	-11.5%
0.8	96.9 t/s	-16.9%	95.5 t/s	-14.7%
0.9	90.5 t/s	-22.4%	91.3 t/s	-18.4%
1.0	60.2 t/s	-48.3%	60.1 t/s	-46.3%

Takeaways

p_min has a clear monotonic negative effect on throughput above ~0.2. Setting it stricter rejects more drafts, so the speedup amortizes over fewer accepted tokens.
Safe band staying within 2% of the p_min=0.0 baseline: MTP-only [0.0, 0.2], chain [0.0, 0.3]. Chain is slightly more robust because even when MTP drafts are rejected the ngram-mod path still contributes.
p_min=1.0 collapses to ~60 t/s on both modes, slightly worse than spec OFF (~78.9 t/s on the same model + binary in the earlier bench). At p_min=1.0 the server is still doing draft work but rejecting it, so it pays draft cost for zero acceptance. Effectively the worst-of-both setting.
Upstream's old default (0.75) lands in the -15% range on Qwen3.6-27B; #23269 dropped the default to 0.0, which lands in the optimal zone for this model.
The Studio UI knob is genuinely useful for users who care about quality-vs-speed tradeoffs (lowering acceptance keeps more "high-confidence-only" drafts), but the throughput-optimal default is 0.0 (matches upstream post-#23269) — which is also Studio's default since the field is None until the user sets it. No policy change suggested.

CSV: temp/pmin_sweep.csv (22 cells x 9 prompts = 198 rows).

danielhanchen · 2026-05-20T08:47:46Z

Bench: full cartesian p_min x n_max x {MTP-only, MTP+Ngram chain} across all 9 MTP GGUF models (B200, post-#23269)

594 cells total (66 per model x 9 models), 5346 timing rows. Each cell = mean t/s across 9 prompts (essay/code/story/math/science/json/summary/translate/reasoning), run once after 2 unrelated warmups, fresh llama-server per cell. n_max in {0..6}, p_min in {0.0, 0.2, 0.5, 0.8, 1.0}. (mode=mtp, n_max=0) is the spec-OFF baseline.

Best cell per model

model	best MTP-only	best chain	OFF baseline	speedup vs OFF
Qwen3.5-0.8B	n=0 p=0.0 -> 450.7 t/s	n=0 p=1.0 -> 503.7 t/s	450.7 t/s	1.12x
Qwen3.5-2B	n=0 p=0.0 -> 375.7 t/s	n=0 p=0.5 -> 368.7 t/s	375.7 t/s	1.00x
Qwen3.5-4B	n=2 p=0.0 -> 264.9 t/s	n=2 p=0.0 -> 259.5 t/s	240.8 t/s	1.10x
Qwen3.5-9B	n=2 p=0.2 -> 225.5 t/s	n=2 p=0.0 -> 228.1 t/s	202.4 t/s	1.13x
Qwen3.5-27B	n=2 p=0.0 -> 110.8 t/s	n=2 p=0.0 -> 110.1 t/s	78.9 t/s	1.41x
Qwen3.5-35B-A3B	n=2 p=0.0 -> 213.6 t/s	n=2 p=0.0 -> 211.8 t/s	191.3 t/s	1.12x
Qwen3.6-27B	n=4 p=0.2 -> 116.7 t/s	n=4 p=0.0 -> 113.9 t/s	79.1 t/s	1.48x
Qwen3.6-35B-A3B	n=3 p=0.0 -> 220.6 t/s	n=2 p=0.2 -> 216.6 t/s	191.8 t/s	1.15x
Qwen3.5-122B-A10B	n=2 p=0.0 -> 145.3 t/s	n=2 p=0.2 -> 143.1 t/s	115.6 t/s	1.26x

Key findings

1. Sub-3B models do not benefit from MTP. Qwen3.5-0.8B and Qwen3.5-2B both peak at n_max=0 (= spec OFF). On 0.8B, chain-mode n=0 with p_min=1.0 nudges to 1.12x but that is ngram-mod-only territory and within noise. Studio's existing Auto policy (ngram-mod fallback for sub-3B MTP) is validated by data.

2. Optimal n_max is 2 for 4B-122B, except Qwen3.6-27B which prefers n=4. Qwen3.6-27B has the highest speedup (1.48x at n=4 p_min=0.2) of any model in the matrix. Beyond n=4, additional draft length costs more than it saves (more drafts get rejected per generated token).

3. p_min=0.0 is the universally-best default. Above 0.2, throughput degrades monotonically on every model. The only exception is the smallest models (0.8B, 2B) where the curve flattens or inverts at high p_min on n>0 cells, because rejecting drafts saves wasted work when the small model rarely produces accepted drafts in the first place. Studio's None -> upstream 0.0 default is optimal.

4. p_min=1.0 is the worst-of-both setting on every model. Pays draft cost for zero acceptance. Drops 30-50% on every model (122B chain: 78.6 vs 121.8 at n=6).

5. Chain mode is never the clear winner. Chain matches or trails MTP-only on every model except 0.8B (which doesn't want MTP anyway). The 27B+ tier shows MTP-only ~2-3% ahead of chain, consistent with the Auto policy's GPU branch (MTP-only on GPU, chain on CPU).

Implications for Studio defaults

Auto policy holds: sub-3B -> ngram-mod, MTP+ -> draft-mtp on GPU. No change.
n_max default 2 stays correct for almost every model; Qwen3.6-27B users who want the extra 3-5% can bump to 4 via the new UI knob. No global default change recommended.
p_min default None (-> upstream 0.0) is optimal. The UI knob is genuinely useful for users who want quality-vs-speed tradeoffs but the default should remain None.

CSV: temp/pmin_cartesian.csv (5346 rows). Full per-(model, mode) heatmaps in temp/pmin_cart_summary.md (271 lines, 18 tables).

Builds on the 5-mode Speculative Decoding dropdown (PR #5582) by adding two upstream-driven knobs that landed in llama.cpp #23269 (MTP clean-up): 1. spec_draft_p_min: minimum draft probability threshold for MTP speculative decoding (--spec-draft-p-min). Drafts below this probability are rejected. Was non-functional pre-#23269; now defaults to 0.0 upstream. Studio exposes it as a "Draft p-min" numeric input below "Draft Tokens", visible only when the dropdown is MTP or MTP+Ngram (the only modes where MTP actually engages and the knob has effect). 2. ngram-map-k / ngram-map-k4v: new spec types added alongside ngram-mod. Each carries its own knob triplet (--spec-{variant}-size-n/m/min-hits). They are NOT in the dropdown -- power-user-only -- but the load API accepts them and the resolver emits the correct flag set when probed support is present. Backend - _canonicalize_spec_mode recognises ngram-map-k / ngram-map-k4v. - New helper _build_ngram_map_k_flags(caps, variant=...) emits the knob triplet only when the binary advertises the knobs as real flags (not removal stubs). - _build_speculative_flags grows two branches and an inline _maybe_emit_p_min helper that flows p_min through the MTP path only. Auto on an MTP GGUF still gets p_min applied because the resolved emission is MTP. - LoadRequest.spec_draft_p_min (Optional[float], 0..1). Threaded through routes/inference.py at the four wire sites and the _request_matches_loaded_settings comparator. - _already_in_target_state takes spec_draft_p_min so a changed p_min bounces a reload even on the Auto-promoted path. - probe_server_capabilities now reports spec_draft_p_min_flag, supports_ngram_map_k, and supports_ngram_map_k4v. Frontend - chat-runtime-store: specDraftPMin / loadedSpecDraftPMin / setter. - use-chat-model-runtime: hydrate p_min from /api/inference/status and the load response. Reset p_min alongside spec mode and n_max when the user switches to a different model. - chat-settings-sheet: new "Draft p-min" number input (min 0, max 1, step 0.05), visible when speculativeType is mtp or mtp+ngram. Wired into the Reset and dirty-state machinery. Tests - 12 new assertions in test_llama_cpp_mtp_detection.py: p_min emission matrix (MTP modes only; never for auto/ngram/off; auto-promoted draft-mtp still gets p_min; graceful degrade when binary lacks --spec-draft-p-min), ngram-map-k / ngram-map-k4v emission with the right knob triplet, no-emit-when-unsupported, canonicalize recognition. 373 total backend tests pass (was 361 before).

for more information, see https://pre-commit.ci

danielhanchen requested a review from rolandtannous as a code owner May 19, 2026 14:57

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

gemini-code-assist Bot reviewed May 19, 2026

View reviewed changes

melroy89 mentioned this pull request May 20, 2026

[Feature] Draft model / speculative decoding #4753

Closed

danielhanchen and others added 2 commits May 22, 2026 01:10

[pre-commit.ci] auto fixes from pre-commit.com hooks

d8ed88e

for more information, see https://pre-commit.ci

danielhanchen force-pushed the studio-spec-pmin-and-ngram-map branch from d4e3d05 to d8ed88e Compare May 22, 2026 01:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values#5623

studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values#5623
danielhanchen wants to merge 2 commits into
mainfrom
studio-spec-pmin-and-ngram-map

danielhanchen commented May 19, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Uh oh!

danielhanchen commented May 19, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 19, 2026

Uh oh!

danielhanchen commented May 19, 2026

Uh oh!

danielhanchen commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented May 19, 2026

Summary

Backend

Frontend

Tests

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 19, 2026

Bench: post-#23269 chain vs MTP-only on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)

Implications for Studio's Auto policy on GPU

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 19, 2026

Bench: --spec-draft-p-min sweep 0.0..1.0 on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)

Mean throughput per (mode, p_min)

Takeaways

Uh oh!

danielhanchen commented May 20, 2026

Bench: full cartesian p_min x n_max x {MTP-only, MTP+Ngram chain} across all 9 MTP GGUF models (B200, post-#23269)

Best cell per model

Key findings

Implications for Studio defaults

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant