Skip to content

studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values#5623

Open
danielhanchen wants to merge 2 commits into
mainfrom
studio-spec-pmin-and-ngram-map
Open

studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values#5623
danielhanchen wants to merge 2 commits into
mainfrom
studio-spec-pmin-and-ngram-map

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Builds on PR #5582's 5-mode Speculative Decoding dropdown by adding two upstream-driven knobs that landed in llama.cpp #23269 (MTP clean-up):

  • spec_draft_p_min — minimum draft probability for MTP speculative decoding (--spec-draft-p-min). Drafts below this probability are rejected. Non-functional pre-#23269; now defaults to 0.0 upstream. Surfaced in Chat Settings as a Draft p-min numeric input below Draft Tokens, visible only when the dropdown is MTP or MTP+Ngram.
  • ngram-map-k / ngram-map-k4v — new ngram spec types added alongside ngram-mod. Each carries its own knob triplet (--spec-{variant}-size-n/m/min-hits). NOT in the dropdown (kept lean); accepted via the load API for power users. The resolver emits the correct knob triplet when the binary's probe advertises support.

Backend

  • _canonicalize_spec_mode recognises ngram-map-k / ngram-map-k4v.
  • New helper _build_ngram_map_k_flags(caps, variant=...) emits the knob triplet only when the binary advertises the knobs as real flags (not removal stubs).
  • _build_speculative_flags grows two new branches (one per variant) and an inline _maybe_emit_p_min helper that flows p_min through MTP paths only. Auto on an MTP GGUF still gets p_min applied because the resolved emission is draft-mtp.
  • LoadRequest.spec_draft_p_min (Optional[float], 0..1). Threaded through routes/inference.py at the four wire sites and _request_matches_loaded_settings.
  • _already_in_target_state accepts spec_draft_p_min so a changed p_min bounces a reload even on the Auto-promoted path.
  • probe_server_capabilities now reports spec_draft_p_min_flag, supports_ngram_map_k, supports_ngram_map_k4v.

Frontend

  • chat-runtime-store: specDraftPMin / loadedSpecDraftPMin / setter.
  • use-chat-model-runtime: hydrates p_min from /api/inference/status and the load response; resets p_min alongside spec mode + n_max when switching to a different model.
  • chat-settings-sheet: new Draft p-min number input (range 0..1, step 0.05), visible when speculativeType is mtp or mtp+ngram. Wired into the Reset and dirty-state machinery.

Tests

12 new assertions in test_llama_cpp_mtp_detection.py:

  • p_min emission matrix: MTP modes only; never for auto/ngram/off; auto-promoted draft-mtp still gets p_min; graceful degrade when binary lacks --spec-draft-p-min.
  • ngram-map-k / ngram-map-k4v emission with the correct knob triplet; no-emit when the binary doesn't advertise support.
  • Canonicalise recognises both ngram-map variants (incl. mixed-case).

373 backend tests pass (was 361 before).

Test plan

  • python -m pytest backend/tests/test_llama_cpp_mtp_detection.py backend/tests/test_llama_server_args.py backend/tests/test_gguf_reload_inheritance.py backend/tests/test_kv_cache_estimation.py -q -> 373 passed
  • npx tsc --noEmit -p tsconfig.app.json -> 0 errors
  • Manual probe against live Studio with a fresh llama.cpp build (post-#23269) confirming Draft p-min input emits --spec-draft-p-min and ngram-map-k4v via API emits the knob triplet.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d4e3d05d81

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +510 to +514
backend_mode in ("mtp", "mtp+ngram")
and request.spec_draft_p_min is not None
and abs(
float(request.spec_draft_p_min) - (llama_backend.spec_draft_p_min or 0.0)
)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compare spec_draft_p_min when auto resolves to MTP

This check gates spec_draft_p_min comparison on backend_mode in ("mtp", "mtp+ngram"), but for auto-promoted MTP loads requested_spec_mode stays "auto". In that case a follow-up /load with a different spec_draft_p_min is treated as "already loaded" and skips reload, so the new threshold is silently ignored. This affects API/power-user flows that send speculative_type="auto" with spec_draft_p_min on MTP GGUFs.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

Bench: post-#23269 chain vs MTP-only on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)

Re-ran the clean-sweep harness (each prompt once after two unrelated warmups; n_predict=128; temp=0; top_k=1) against a llama.cpp build that includes #23269 (MTP clean-up + chain accept-token propagation fix).

The chain (ngram+mtp) is no longer the bug-ridden laggard PR #5582's bench saw. n=4 chain jumped from 90.4 -> 113.7 t/s (+25.8%); n=5 chain jumped 84.5 -> 108.5 (+28.5%). MTP-only is essentially unchanged. The fix isolated the regression to the chain path.

config pre (#22822) post (#23269) delta x vs OFF
OFF 78.8 t/s 78.9 t/s +0.1% 1.00x
MTP n=2 112.1 t/s 114.9 t/s +2.4% 1.46x
MTP n=4 115.4 t/s 116.6 t/s +1.1% 1.48x
MTP n=5 109.1 t/s 109.3 t/s +0.2% 1.39x
ngram-only 76.1 t/s 76.1 t/s -0.3% 0.96x
ngram+mtp n=2 99.2 t/s 111.0 t/s +11.9% 1.41x
ngram+mtp n=4 90.4 t/s 113.7 t/s +25.8% 1.44x
ngram+mtp n=5 84.5 t/s 108.5 t/s +28.5% 1.37x

Implications for Studio's Auto policy on GPU

MTP-only n=2/4 (114.9 / 116.6 t/s) still beats the chain n=2/4 (111.0 / 113.7) by ~2-3%. So Auto's current "MTP-only on GPU, chain on CPU" policy stays correct; no policy flip needed. The chain is now a viable user-forced option (via the new MTP+Ngram dropdown choice) without regressing perf below MTP-only.

n_max=2 stays the safe universal default. n=4 squeezes ~1-2% more on most prompts but loses 15-20% on essay (low-acceptance, long-form). The conservative n=2 covers the worst-case better.

Raw CSV: temp/bench_post_23269_27b.csv.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements support for ngram-map-k and ngram-map-k4v speculative decoding modes and adds a spec_draft_p_min parameter for MTP decoding. These changes involve backend capability detection, server flag construction, and frontend UI updates to allow users to configure the new parameter. A review comment suggests refactoring the capability key lookup in llama_cpp.py to use an f-string for improved conciseness.

Comment on lines +3517 to +3521
supported = map_caps.get(
"supports_ngram_map_k4v"
if effective_mode == "ngram-map-k4v"
else "supports_ngram_map_k"
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

This logic for determining the capability key can be simplified using an f-string, which would make the code more concise and easier to read.

            supported = map_caps.get(f"supports_{effective_mode.replace('-', '_')}")

@danielhanchen

Copy link
Copy Markdown
Member Author

Bench: --spec-draft-p-min sweep 0.0..1.0 on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)

Re-ran the clean-sweep harness 22 times (one llama-server per cell) on the post-#23269 binary, fixing n_max=4 and stepping --spec-draft-p-min from 0.0 to 1.0 in 0.1 increments. Same methodology as the earlier bench: each prompt once after two unrelated warmups; n_predict=128; temp=0; top_k=1.

Mean throughput per (mode, p_min)

p_min MTP-only n=4 delta-MTP MTP+Ngram chain n=4 delta-chain
0.0 116.6 t/s +0.0% 111.9 t/s +0.0%
0.1 116.8 t/s +0.2% 113.1 t/s +1.1%
0.2 116.6 t/s +0.1% 113.5 t/s +1.4%
0.3 114.0 t/s -2.2% 110.1 t/s -1.6%
0.4 107.1 t/s -8.1% 107.1 t/s -4.3%
0.5 104.6 t/s -10.2% 104.1 t/s -7.0%
0.6 105.1 t/s -9.8% 102.6 t/s -8.3%
0.7 99.0 t/s -15.0% 99.0 t/s -11.5%
0.8 96.9 t/s -16.9% 95.5 t/s -14.7%
0.9 90.5 t/s -22.4% 91.3 t/s -18.4%
1.0 60.2 t/s -48.3% 60.1 t/s -46.3%

Takeaways

  • p_min has a clear monotonic negative effect on throughput above ~0.2. Setting it stricter rejects more drafts, so the speedup amortizes over fewer accepted tokens.
  • Safe band staying within 2% of the p_min=0.0 baseline: MTP-only [0.0, 0.2], chain [0.0, 0.3]. Chain is slightly more robust because even when MTP drafts are rejected the ngram-mod path still contributes.
  • p_min=1.0 collapses to ~60 t/s on both modes, slightly worse than spec OFF (~78.9 t/s on the same model + binary in the earlier bench). At p_min=1.0 the server is still doing draft work but rejecting it, so it pays draft cost for zero acceptance. Effectively the worst-of-both setting.
  • Upstream's old default (0.75) lands in the -15% range on Qwen3.6-27B; #23269 dropped the default to 0.0, which lands in the optimal zone for this model.
  • The Studio UI knob is genuinely useful for users who care about quality-vs-speed tradeoffs (lowering acceptance keeps more "high-confidence-only" drafts), but the throughput-optimal default is 0.0 (matches upstream post-#23269) — which is also Studio's default since the field is None until the user sets it. No policy change suggested.

CSV: temp/pmin_sweep.csv (22 cells x 9 prompts = 198 rows).

@danielhanchen

Copy link
Copy Markdown
Member Author

Bench: full cartesian p_min x n_max x {MTP-only, MTP+Ngram chain} across all 9 MTP GGUF models (B200, post-#23269)

594 cells total (66 per model x 9 models), 5346 timing rows. Each cell = mean t/s across 9 prompts (essay/code/story/math/science/json/summary/translate/reasoning), run once after 2 unrelated warmups, fresh llama-server per cell. n_max in {0..6}, p_min in {0.0, 0.2, 0.5, 0.8, 1.0}. (mode=mtp, n_max=0) is the spec-OFF baseline.

Best cell per model

model best MTP-only best chain OFF baseline speedup vs OFF
Qwen3.5-0.8B n=0 p=0.0 -> 450.7 t/s n=0 p=1.0 -> 503.7 t/s 450.7 t/s 1.12x
Qwen3.5-2B n=0 p=0.0 -> 375.7 t/s n=0 p=0.5 -> 368.7 t/s 375.7 t/s 1.00x
Qwen3.5-4B n=2 p=0.0 -> 264.9 t/s n=2 p=0.0 -> 259.5 t/s 240.8 t/s 1.10x
Qwen3.5-9B n=2 p=0.2 -> 225.5 t/s n=2 p=0.0 -> 228.1 t/s 202.4 t/s 1.13x
Qwen3.5-27B n=2 p=0.0 -> 110.8 t/s n=2 p=0.0 -> 110.1 t/s 78.9 t/s 1.41x
Qwen3.5-35B-A3B n=2 p=0.0 -> 213.6 t/s n=2 p=0.0 -> 211.8 t/s 191.3 t/s 1.12x
Qwen3.6-27B n=4 p=0.2 -> 116.7 t/s n=4 p=0.0 -> 113.9 t/s 79.1 t/s 1.48x
Qwen3.6-35B-A3B n=3 p=0.0 -> 220.6 t/s n=2 p=0.2 -> 216.6 t/s 191.8 t/s 1.15x
Qwen3.5-122B-A10B n=2 p=0.0 -> 145.3 t/s n=2 p=0.2 -> 143.1 t/s 115.6 t/s 1.26x

Key findings

1. Sub-3B models do not benefit from MTP. Qwen3.5-0.8B and Qwen3.5-2B both peak at n_max=0 (= spec OFF). On 0.8B, chain-mode n=0 with p_min=1.0 nudges to 1.12x but that is ngram-mod-only territory and within noise. Studio's existing Auto policy (ngram-mod fallback for sub-3B MTP) is validated by data.

2. Optimal n_max is 2 for 4B-122B, except Qwen3.6-27B which prefers n=4. Qwen3.6-27B has the highest speedup (1.48x at n=4 p_min=0.2) of any model in the matrix. Beyond n=4, additional draft length costs more than it saves (more drafts get rejected per generated token).

3. p_min=0.0 is the universally-best default. Above 0.2, throughput degrades monotonically on every model. The only exception is the smallest models (0.8B, 2B) where the curve flattens or inverts at high p_min on n>0 cells, because rejecting drafts saves wasted work when the small model rarely produces accepted drafts in the first place. Studio's None -> upstream 0.0 default is optimal.

4. p_min=1.0 is the worst-of-both setting on every model. Pays draft cost for zero acceptance. Drops 30-50% on every model (122B chain: 78.6 vs 121.8 at n=6).

5. Chain mode is never the clear winner. Chain matches or trails MTP-only on every model except 0.8B (which doesn't want MTP anyway). The 27B+ tier shows MTP-only ~2-3% ahead of chain, consistent with the Auto policy's GPU branch (MTP-only on GPU, chain on CPU).

Implications for Studio defaults

  • Auto policy holds: sub-3B -> ngram-mod, MTP+ -> draft-mtp on GPU. No change.
  • n_max default 2 stays correct for almost every model; Qwen3.6-27B users who want the extra 3-5% can bump to 4 via the new UI knob. No global default change recommended.
  • p_min default None (-> upstream 0.0) is optimal. The UI knob is genuinely useful for users who want quality-vs-speed tradeoffs but the default should remain None.

CSV: temp/pmin_cartesian.csv (5346 rows). Full per-(model, mode) heatmaps in temp/pmin_cart_summary.md (271 lines, 18 tables).

danielhanchen and others added 2 commits May 22, 2026 01:10
Builds on the 5-mode Speculative Decoding dropdown (PR #5582) by adding
two upstream-driven knobs that landed in llama.cpp #23269 (MTP
clean-up):

1. spec_draft_p_min: minimum draft probability threshold for MTP
   speculative decoding (--spec-draft-p-min). Drafts below this
   probability are rejected. Was non-functional pre-#23269; now
   defaults to 0.0 upstream. Studio exposes it as a "Draft p-min"
   numeric input below "Draft Tokens", visible only when the dropdown
   is MTP or MTP+Ngram (the only modes where MTP actually engages and
   the knob has effect).

2. ngram-map-k / ngram-map-k4v: new spec types added alongside
   ngram-mod. Each carries its own knob triplet
   (--spec-{variant}-size-n/m/min-hits). They are NOT in the
   dropdown -- power-user-only -- but the load API accepts them and
   the resolver emits the correct flag set when probed support is
   present.

Backend
- _canonicalize_spec_mode recognises ngram-map-k / ngram-map-k4v.
- New helper _build_ngram_map_k_flags(caps, variant=...) emits the
  knob triplet only when the binary advertises the knobs as real
  flags (not removal stubs).
- _build_speculative_flags grows two branches and an inline
  _maybe_emit_p_min helper that flows p_min through the MTP path
  only. Auto on an MTP GGUF still gets p_min applied because the
  resolved emission is MTP.
- LoadRequest.spec_draft_p_min (Optional[float], 0..1). Threaded
  through routes/inference.py at the four wire sites and the
  _request_matches_loaded_settings comparator.
- _already_in_target_state takes spec_draft_p_min so a changed p_min
  bounces a reload even on the Auto-promoted path.
- probe_server_capabilities now reports spec_draft_p_min_flag,
  supports_ngram_map_k, and supports_ngram_map_k4v.

Frontend
- chat-runtime-store: specDraftPMin / loadedSpecDraftPMin / setter.
- use-chat-model-runtime: hydrate p_min from /api/inference/status
  and the load response. Reset p_min alongside spec mode and n_max
  when the user switches to a different model.
- chat-settings-sheet: new "Draft p-min" number input
  (min 0, max 1, step 0.05), visible when speculativeType is mtp
  or mtp+ngram. Wired into the Reset and dirty-state machinery.

Tests
- 12 new assertions in test_llama_cpp_mtp_detection.py: p_min emission
  matrix (MTP modes only; never for auto/ngram/off; auto-promoted
  draft-mtp still gets p_min; graceful degrade when binary lacks
  --spec-draft-p-min), ngram-map-k / ngram-map-k4v emission with the
  right knob triplet, no-emit-when-unsupported, canonicalize
  recognition. 373 total backend tests pass (was 361 before).
@danielhanchen danielhanchen force-pushed the studio-spec-pmin-and-ngram-map branch from d4e3d05 to d8ed88e Compare May 22, 2026 01:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant