studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values#5623
studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values#5623danielhanchen wants to merge 2 commits into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d4e3d05d81
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| backend_mode in ("mtp", "mtp+ngram") | ||
| and request.spec_draft_p_min is not None | ||
| and abs( | ||
| float(request.spec_draft_p_min) - (llama_backend.spec_draft_p_min or 0.0) | ||
| ) |
There was a problem hiding this comment.
Compare spec_draft_p_min when auto resolves to MTP
This check gates spec_draft_p_min comparison on backend_mode in ("mtp", "mtp+ngram"), but for auto-promoted MTP loads requested_spec_mode stays "auto". In that case a follow-up /load with a different spec_draft_p_min is treated as "already loaded" and skips reload, so the new threshold is silently ignored. This affects API/power-user flows that send speculative_type="auto" with spec_draft_p_min on MTP GGUFs.
Useful? React with 👍 / 👎.
Bench: post-#23269 chain vs MTP-only on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)Re-ran the clean-sweep harness (each prompt once after two unrelated warmups; n_predict=128; temp=0; top_k=1) against a llama.cpp build that includes #23269 (MTP clean-up + chain accept-token propagation fix). The chain (
Implications for Studio's Auto policy on GPUMTP-only n=2/4 (114.9 / 116.6 t/s) still beats the chain n=2/4 (111.0 / 113.7) by ~2-3%. So Auto's current "MTP-only on GPU, chain on CPU" policy stays correct; no policy flip needed. The chain is now a viable user-forced option (via the new MTP+Ngram dropdown choice) without regressing perf below MTP-only. n_max=2 stays the safe universal default. n=4 squeezes ~1-2% more on most prompts but loses 15-20% on essay (low-acceptance, long-form). The conservative n=2 covers the worst-case better. Raw CSV: |
There was a problem hiding this comment.
Code Review
This pull request implements support for ngram-map-k and ngram-map-k4v speculative decoding modes and adds a spec_draft_p_min parameter for MTP decoding. These changes involve backend capability detection, server flag construction, and frontend UI updates to allow users to configure the new parameter. A review comment suggests refactoring the capability key lookup in llama_cpp.py to use an f-string for improved conciseness.
| supported = map_caps.get( | ||
| "supports_ngram_map_k4v" | ||
| if effective_mode == "ngram-map-k4v" | ||
| else "supports_ngram_map_k" | ||
| ) |
Bench: --spec-draft-p-min sweep 0.0..1.0 on Qwen3.6-27B-MTP UD-Q4_K_XL (B200)Re-ran the clean-sweep harness 22 times (one llama-server per cell) on the post-#23269 binary, fixing n_max=4 and stepping Mean throughput per (mode, p_min)
Takeaways
CSV: |
Bench: full cartesian p_min x n_max x {MTP-only, MTP+Ngram chain} across all 9 MTP GGUF models (B200, post-#23269)594 cells total (66 per model x 9 models), 5346 timing rows. Each cell = mean t/s across 9 prompts (essay/code/story/math/science/json/summary/translate/reasoning), run once after 2 unrelated warmups, fresh llama-server per cell. n_max in {0..6}, p_min in {0.0, 0.2, 0.5, 0.8, 1.0}. (mode=mtp, n_max=0) is the spec-OFF baseline. Best cell per model
Key findings1. Sub-3B models do not benefit from MTP. Qwen3.5-0.8B and Qwen3.5-2B both peak at n_max=0 (= spec OFF). On 0.8B, chain-mode n=0 with p_min=1.0 nudges to 1.12x but that is ngram-mod-only territory and within noise. Studio's existing Auto policy (ngram-mod fallback for sub-3B MTP) is validated by data. 2. Optimal n_max is 2 for 4B-122B, except Qwen3.6-27B which prefers n=4. Qwen3.6-27B has the highest speedup (1.48x at n=4 p_min=0.2) of any model in the matrix. Beyond n=4, additional draft length costs more than it saves (more drafts get rejected per generated token). 3. p_min=0.0 is the universally-best default. Above 0.2, throughput degrades monotonically on every model. The only exception is the smallest models (0.8B, 2B) where the curve flattens or inverts at high p_min on n>0 cells, because rejecting drafts saves wasted work when the small model rarely produces accepted drafts in the first place. Studio's None -> upstream 0.0 default is optimal. 4. p_min=1.0 is the worst-of-both setting on every model. Pays draft cost for zero acceptance. Drops 30-50% on every model (122B chain: 78.6 vs 121.8 at n=6). 5. Chain mode is never the clear winner. Chain matches or trails MTP-only on every model except 0.8B (which doesn't want MTP anyway). The 27B+ tier shows MTP-only ~2-3% ahead of chain, consistent with the Auto policy's GPU branch (MTP-only on GPU, chain on CPU). Implications for Studio defaults
CSV: |
Builds on the 5-mode Speculative Decoding dropdown (PR #5582) by adding two upstream-driven knobs that landed in llama.cpp #23269 (MTP clean-up): 1. spec_draft_p_min: minimum draft probability threshold for MTP speculative decoding (--spec-draft-p-min). Drafts below this probability are rejected. Was non-functional pre-#23269; now defaults to 0.0 upstream. Studio exposes it as a "Draft p-min" numeric input below "Draft Tokens", visible only when the dropdown is MTP or MTP+Ngram (the only modes where MTP actually engages and the knob has effect). 2. ngram-map-k / ngram-map-k4v: new spec types added alongside ngram-mod. Each carries its own knob triplet (--spec-{variant}-size-n/m/min-hits). They are NOT in the dropdown -- power-user-only -- but the load API accepts them and the resolver emits the correct flag set when probed support is present. Backend - _canonicalize_spec_mode recognises ngram-map-k / ngram-map-k4v. - New helper _build_ngram_map_k_flags(caps, variant=...) emits the knob triplet only when the binary advertises the knobs as real flags (not removal stubs). - _build_speculative_flags grows two branches and an inline _maybe_emit_p_min helper that flows p_min through the MTP path only. Auto on an MTP GGUF still gets p_min applied because the resolved emission is MTP. - LoadRequest.spec_draft_p_min (Optional[float], 0..1). Threaded through routes/inference.py at the four wire sites and the _request_matches_loaded_settings comparator. - _already_in_target_state takes spec_draft_p_min so a changed p_min bounces a reload even on the Auto-promoted path. - probe_server_capabilities now reports spec_draft_p_min_flag, supports_ngram_map_k, and supports_ngram_map_k4v. Frontend - chat-runtime-store: specDraftPMin / loadedSpecDraftPMin / setter. - use-chat-model-runtime: hydrate p_min from /api/inference/status and the load response. Reset p_min alongside spec mode and n_max when the user switches to a different model. - chat-settings-sheet: new "Draft p-min" number input (min 0, max 1, step 0.05), visible when speculativeType is mtp or mtp+ngram. Wired into the Reset and dirty-state machinery. Tests - 12 new assertions in test_llama_cpp_mtp_detection.py: p_min emission matrix (MTP modes only; never for auto/ngram/off; auto-promoted draft-mtp still gets p_min; graceful degrade when binary lacks --spec-draft-p-min), ngram-map-k / ngram-map-k4v emission with the right knob triplet, no-emit-when-unsupported, canonicalize recognition. 373 total backend tests pass (was 361 before).
for more information, see https://pre-commit.ci
d4e3d05 to
d8ed88e
Compare
Summary
Builds on PR #5582's 5-mode Speculative Decoding dropdown by adding two upstream-driven knobs that landed in llama.cpp #23269 (MTP clean-up):
spec_draft_p_min— minimum draft probability for MTP speculative decoding (--spec-draft-p-min). Drafts below this probability are rejected. Non-functional pre-#23269; now defaults to0.0upstream. Surfaced in Chat Settings as a Draft p-min numeric input below Draft Tokens, visible only when the dropdown is MTP or MTP+Ngram.ngram-map-k/ngram-map-k4v— new ngram spec types added alongsidengram-mod. Each carries its own knob triplet (--spec-{variant}-size-n/m/min-hits). NOT in the dropdown (kept lean); accepted via the load API for power users. The resolver emits the correct knob triplet when the binary's probe advertises support.Backend
_canonicalize_spec_moderecognisesngram-map-k/ngram-map-k4v._build_ngram_map_k_flags(caps, variant=...)emits the knob triplet only when the binary advertises the knobs as real flags (not removal stubs)._build_speculative_flagsgrows two new branches (one per variant) and an inline_maybe_emit_p_minhelper that flowsp_minthrough MTP paths only. Auto on an MTP GGUF still getsp_minapplied because the resolved emission isdraft-mtp.LoadRequest.spec_draft_p_min(Optional[float], 0..1). Threaded throughroutes/inference.pyat the four wire sites and_request_matches_loaded_settings._already_in_target_stateacceptsspec_draft_p_minso a changedp_minbounces a reload even on the Auto-promoted path.probe_server_capabilitiesnow reportsspec_draft_p_min_flag,supports_ngram_map_k,supports_ngram_map_k4v.Frontend
chat-runtime-store:specDraftPMin/loadedSpecDraftPMin/ setter.use-chat-model-runtime: hydratesp_minfrom/api/inference/statusand the load response; resetsp_minalongside spec mode +n_maxwhen switching to a different model.chat-settings-sheet: new Draft p-min number input (range 0..1, step 0.05), visible whenspeculativeTypeismtpormtp+ngram. Wired into the Reset and dirty-state machinery.Tests
12 new assertions in
test_llama_cpp_mtp_detection.py:p_minemission matrix: MTP modes only; never for auto/ngram/off; auto-promoted draft-mtp still getsp_min; graceful degrade when binary lacks--spec-draft-p-min.ngram-map-k/ngram-map-k4vemission with the correct knob triplet; no-emit when the binary doesn't advertise support.373 backend tests pass (was 361 before).
Test plan
python -m pytest backend/tests/test_llama_cpp_mtp_detection.py backend/tests/test_llama_server_args.py backend/tests/test_gguf_reload_inheritance.py backend/tests/test_kv_cache_estimation.py -q-> 373 passednpx tsc --noEmit -p tsconfig.app.json-> 0 errors--spec-draft-p-minandngram-map-k4vvia API emits the knob triplet.