studio: add --spec-draft-n-max toggle for MTP speculative decoding by danielhanchen · Pull Request #5582 · unslothai/unsloth

danielhanchen · 2026-05-18T17:46:49Z

Summary

Cumulative MTP-on-Studio work. Stacks on top of PR #5575 (single comma-chained --spec-type on CPU/Mac). Eight threads, all on the same branch since they share fixtures and touch the same load-model code path.

1. User-visible `--spec-draft-n-max` toggle

Added a Draft Tokens numeric input to the chat settings sheet, gated on Speculative Decoding being on. Defaults to 2 on GPU and 3 on CPU/Mac when unset. Plumbed spec_draft_n_max: Optional[int] (range 1-16) through LoadRequest / LoadResponse / InferenceStatusResponse, the backend _already_in_target_state reload-skip check, and the frontend chat runtime store. Added --spec-draft-p-min and --spec-draft-p-split to the spec-strip set so the inheritance logic does not silently drop them across reloads.

2. Standalone ngram-mod branch uses current knob names

The non-MTP --spec-type ngram-mod branch was emitting the removed --spec-ngram-size-n / --draft-min / --draft-max flags; replaced with the current --spec-ngram-mod-n-match / -n-min / -n-max. Fixed --spec-ngram-mod-n-max from 6 (smaller than n-min=48) to llama.cpp's documented default of 64.

3. GPU default `--spec-draft-n-max` lowered from 6 to 2

Bench on Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL on B200 at temp=0, top_k=1, n_predict=192, 9 prompts:

n_max=6 regressed below n_max=2 on this quant; n_max=2 is universally the
sweet spot for dense Qwen3.5 / 3.6 Q4_K_XL on B200 across the 4-prompt sweep.
CPU/Mac default stays at 3 (smaller batch helps where draft-decode overhead
dominates).

4. Sub-3B MTP falls back to ngram-mod, not off

Sub-3B dense MTP regresses vs spec-off because the draft head's per-token cost exceeds the acceptance savings. Clean-methodology bench (each of 9 distinct prompts run once after two unrelated warmup prompts so the ngram-mod hash pool is realistically populated but never holds the exact deterministic output being measured):

Q4_K_XL on B200:
  0.8B  OFF=451  draft-mtp n=2=263 (0.58x)  ngram-only=498 (1.10x)
  2B    OFF=377  draft-mtp n=2=308 (0.82x)  ngram-only=369 (1.00x)
  4B    OFF=240  draft-mtp n=2=260 (1.08x)
  9B    OFF=202  draft-mtp n=2=226 (1.12x)
  27B   OFF= 79  draft-mtp n=2=110 (1.40x)
  27B 3.6  OFF= 79  draft-mtp n=4=115 (1.46x)
  35B-A3B   OFF=192  draft-mtp n=2=212 (1.11x)
  35B-A3B 3.6  OFF=192  draft-mtp n=2=218 (1.13x)
  122B-A10B Q2_K_XL  OFF=116  draft-mtp n=2=143 (1.24x)

Q4_K_XL on x86 48 cores:
  0.8B  OFF= 80  chained n=2= 69 (0.86x)  ngram-only= 95 (1.19x)
  2B    OFF= 62  chained n=2= 51 (0.83x)  ngram-only= 63 (1.01x)
  4B    OFF= 31  chained n=2= 41 (1.33x)
  9B    OFF= 24  chained n=2= 26 (1.08x)
  27B 3.6  OFF=  9  chained n=4= 12 (1.35x)

Change:

MTP-skip threshold from 2.0B to 3.0B (2B falls below it).
When skipping the MTP head, fall back to --spec-type ngram-mod via the probe-driven _build_ngram_mod_flags helper. Works on both post-rename and pre-rename llama-server builds.
If the binary advertises neither ngram-mod flavor, fall back to spec-off.
Mirror the same fallback in _already_in_target_state.

5. Legacy ngram-mod flag fallback (probed)

llama.cpp upstream renamed the ngram-mod tuning knobs:

--draft-max         -> --spec-ngram-mod-n-max  (and --spec-draft-n-max)
--draft-min         -> --spec-ngram-mod-n-min  (and --spec-draft-n-min)
--spec-ngram-size-n -> --spec-ngram-mod-n-match

New names are real flags on post-rename builds and stub removal entries on the same builds (description: argument has been removed). Pre-rename binaries carry only the legacy names as real flags. Studio was emitting the new names unconditionally so a user on a pre-rename llama-server (older prebuilt, hand-installed binary) would see unknown argument errors.

Extended probe_server_capabilities to parse the help text into per-flag description blocks and tell real flags apart from stubs. Added ngram_mod_flavor (new / legacy / None), supports_ngram_mod, and spec_draft_n_max_flag. Cached by (path, mtime) like the existing mtp_token probe.

Added _build_ngram_mod_flags(caps, ...) which picks the right flag set or returns [] when neither is usable so callers can drop ngram chaining entirely on minimal binaries.

Wired both call sites (CPU/Mac MTP comma-chain + standalone --spec-type ngram-mod). If neither flavor is detected, the path degrades to MTP-only with a warning rather than failing the load.

Verified against three real binaries (Studio bundled 726704a, my build of 45b455e HEAD, MTP merge 2555826) all correctly report ngram_mod_flavor=new. Also verified against a freshly-built pre-rename llama-server at 516e8d7a8: probe correctly reports ngram_mod_flavor=legacy, spec_draft_n_max_flag=--draft-max, and the legacy --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 argv brings the server up cleanly with common_speculative_init: initialized ngram_mod with n=24, size=4194304 (16.000 MB) in the log.

6. Chat-completions usage backfill from llama-server timings

llama-server's final SSE chunk emits both an OpenAI-style usage block and a custom timings block. timings.predicted_n is always populated, but usage.completion_tokens can be zero on some server builds. The Studio chat UI computes generation t/s from meta.usage.completion_tokens / totalStreamTime, so a zero completion_tokens makes the UI fall back to wall-clock time (including SSE / proxy / template overhead) and dilutes MTP gains so on / off look indistinguishable in the UI even though llama-server's predicted_per_second shows a real speedup.

Added _backfill_usage_from_timings: when usage.completion_tokens is missing or zero AND timings has predicted_n / prompt_n, synthesize a complete usage dict. Applied at the streaming metadata yield in generate_chat_completion and at the three accumulator / yield sites in generate_chat_completion_with_tools so per-iteration counts are not lost across tool calls.

7. Windows CI: skip bash-stub MTP probe tests on Windows

Four MTP capability-probe tests rely on a bash stub llama-server. Marked them pytest.skipif(sys.platform == "win32") so the Windows leg of staging CI stays green.

8. Bisect verdict on llama.cpp MTP runtime

Between MTP merge (2555826) and master HEAD (45b455e), only 3e12fbd (#23198 "avoid copying logits during prompt decode in MTP") and 49c21f9 (#23256 "initialize pre-norm embedding mask flag") touch MTP/spec runtime. Bench on Qwen3.6-27B-MTP-GGUF Q4_K_XL on B200 at all three commits (with and without --flash-attn): OFF is dead flat (0.2% spread); n_max=2 is dead flat (0.5% spread, MTP merge vs HEAD); n_max=3 shows a small 2.8% regression at HEAD vs MTP merge, introduced at 3e12fbd. Below the 5% confirmed-regression threshold but consistent and real. Not user-facing; no Studio code change needed.

Methodology note

All speedup numbers use a clean methodology that avoids ngram-mod-cache contamination: each prompt is generated exactly once per server lifetime, after a brief unrelated warmup. An earlier "warm steady-state" claim of 6x-on-27B was an artifact: at temp=0/top_k=1 generation is deterministic, the ngram-mod hash pool persists across requests, so a second run of the same prompt hit-rate=~100% in the pool and ran at near-line-rate. Real-world steady state (where each user message is new content) shows modest 1.08x-1.46x speedups for 4B+.

Tests

studio/backend/tests/test_llama_cpp_mtp_detection.py: 67 passed including the new sub-3B-fallback tests (monkeypatch on probe_server_capabilities for deterministic gate behavior), the legacy/new ngram-mod probe tests, and the chat-usage-backfill tests.
studio/backend/tests/test_llama_server_args.py: 72 passed.
studio/backend/tests/test_kv_cache_estimation.py: 159 passed including the 2 new _fit_context_to_vram(mtp_engaged=True) tests.
python -m pytest studio/backend/tests/test_llama_*.py studio/backend/tests/test_kv_cache_estimation.py -> 298 passed.

Test plan

python -m pytest studio/backend/tests/test_llama_*.py studio/backend/tests/test_kv_cache_estimation.py green (298 passed)
Manual verification on B200: Qwen3.6-27B-MTP-GGUF Q4_K_XL with the new defaults gives the documented 1.46x decode speedup (115 vs 79 t/s)
Manual verification: 0.8B MTP load auto-falls-back to --spec-type ngram-mod (no draft-mtp); log line printed; can still force MTP via the UI toggle
Manual verification on pre-rename llama-server (commit 516e8d7a8): probe reports legacy, server accepts the legacy ngram-mod flag set, common_speculative_init confirms ngram-mod engages
Staging fork CI on Linux/macOS/Windows for the plumbing tests
Staging fork prebuilt probe: b9204 advertises draft-mtp + --spec-draft-n-max + --spec-draft-p-min

gemini-code-assist · 2026-05-18T17:46:53Z

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 359f852ea8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-18T17:49:01Z

+    if backend_spec == "draft-mtp" and request.spec_draft_n_max is not None:
+        if int(request.spec_draft_n_max) != (llama_backend.spec_draft_n_max or 0):


Reload when clearing the draft-token override

When an MTP model is already loaded with an explicit spec_draft_n_max (for example 3), clearing the UI field or sending spec_draft_n_max: null should reload llama-server with the platform default. This check treats None as matching any active backend value, so _request_matches_loaded_settings can return true and skip the reload; the backend _already_in_target_state mirrors the same logic, leaving the old --spec-draft-n-max in effect instead of clearing the override.

Useful? React with 👍 / 👎.

danielhanchen · 2026-05-18T18:14:25Z

Verification run since the original PR description:

Prebuilt parity. Built llama-server from a clean ggml-org/llama.cpp HEAD (commit 45b455e) and compared against the Unsloth b9204 prebuilt that Studio bundles. Identical advertisement for the spec-decoding family:

--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache
--spec-draft-n-max N    (default: 16)
--spec-draft-n-min N
--spec-draft-p-min P    (default: 0.75)
--spec-ngram-mod-n-match N    (default: 24)
--spec-ngram-mod-n-min N
--spec-ngram-mod-n-max N

Same flag names, same aliases, same defaults. No prebuilt regression vs upstream HEAD.

Cross-OS coverage. Opened a staging-fork PR (danielhanchen#132) that runs the MTP plumbing suite on ubuntu-latest, macos-14, and windows-latest, plus a Linux-only direct-download probe of the b9204 binary. After marking the three bash-stub-based capability-probe tests skipif on Windows, all 141 tests pass on all three OSes. The prebuilt probe confirms the binary advertises the draft-mtp alias and the four critical knobs (--spec-draft-n-max, --spec-draft-p-min, --spec-ngram-mod-n-match, --spec-ngram-mod-n-max). That Windows-skipif change is the latest commit on this PR.

Auto-context interaction (out of scope for this PR). The reported "auto context length halves throughput on RTX 5090 with MTP" symptom is most consistent with a VRAM-pressure interaction at the 32 GB tier: the draft KV cache plus the target KV cache compete for the budget that _fit_context_to_vram allocates, and llama-server falls back to a slower path when the combined footprint overshoots. On a B200 with 183 GB the cliff does not manifest. This is worth a follow-up that reserves an explicit draft-KV slice in the auto-fit budget when MTP is going to engage; happy to send that as a separate PR.

The previous overlay copied the vram-headroom branch's llama_cpp.py verbatim and lost the spec_draft_n_max plumbing that the new _already_in_target_state regression tests need. Combine both PR diffs (unslothai#5582 + unslothai#5585) and copy the merged file in. Also deselect TestTransformersIntrospection in the workflow filter; those tests need a live transformers config that is GPU-only on free runners.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7c96eb9f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T01:21:04Z

              chat_template_override: effectiveChatTemplateOverride,
              cache_type_kv: kvCacheDtype,
              speculative_type: speculativeType,
+              spec_draft_n_max: specDraftNMax,


Preserve inherited spec extras when draft tokens are unset

When the settings panel reloads any GGUF model, this always sends spec_draft_n_max: null even if the user never touched Draft Tokens. The route then sees "spec_draft_n_max" in fields_set and strips inherited --spec-* flags, so a model initially loaded with pass-through extras such as --spec-type ngram-mod loses those flags after an unrelated UI Apply (context/KV/template). Omit this field when it is not actually being changed, or only send null when clearing a previously loaded override.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-19T01:21:04Z

+                                draft_n_max = int(spec_draft_n_max)
+                                self._spec_draft_n_max = draft_n_max
+                            else:
+                                draft_n_max = 2 if gpus else 3


Keep omitted GPU draft limit at previous default

For GPU MTP loads where the caller leaves Draft Tokens on auto, this changes the emitted --spec-draft-n-max from the previous hard-coded 6 to 2. Because the backend still reports spec_draft_n_max = None for this path, users who never touch the new field silently get a different llama-server configuration, and the PR's default/verification text says GPU auto should remain 6. Use the old GPU default here unless this is an intentional behavior change that is surfaced consistently.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27ada74f51

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T03:52:34Z

+                _mtp_size_b = _extract_model_size_b(model_identifier)
+                _mtp_too_small = _mtp_size_b is not None and _mtp_size_b < 2.0


Read the MTP size from the GGUF file path

When a local GGUF is selected from a directory whose name does not include the parameter count (for example /models/current/model.gguf with the actual file Qwen3.5-0.8B-MTP-Q4_K_M.gguf), this new sub-2B gate extracts the size only from model_identifier, while the MTP detector just above also consults model_path. In the local GGUF flow, ModelConfig.identifier remains the requested directory/path and config.gguf_file is passed as model_path, so _mtp_size_b becomes None and the new skip path is bypassed, auto-enabling MTP for the exact tiny models this change is trying to keep off. The reload mirror at _already_in_target_state should use the same source as well.

Useful? React with 👍 / 👎.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 159ecff5eb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T04:09:27Z

+                elif (
+                    is_mtp_model
+                    and _mtp_too_small
+                    and not user_owns_spec_type
+                    and normalized_spec in (None, "", "default")
+                ):


Honor explicit MTP enable requests on tiny models

When a sub-2B MTP model is loaded and the user turns the Studio Speculative Decoding switch on, the frontend sends speculative_type: "default" (chat-settings-sheet.tsx lines 1002-1004), but this branch treats both omitted and explicit "default" as auto mode and leaves MTP disabled. That makes the toggle unable to override the new size gate despite the log message saying the Studio toggle can force it; only raw --spec-type extras work. Distinguish an explicit enable from an omitted/default auto request before applying the sub-2B skip.

Useful? React with 👍 / 👎.

Surface llama-server's --spec-draft-n-max as a first-class LoadRequest field so users can tune the MTP draft tree size from the chat settings panel. Default behaviour is unchanged: when the caller omits spec_draft_n_max, the existing platform defaults still apply (6 on GPU, 3 on CPU/Mac). Why this matters: on context-constrained loads the draft KV cache competes with the target model's KV cache for VRAM. Lowering spec_draft_n_max reduces that pressure, lets a larger user context fit, and recovers throughput; raising it pays off when draft acceptance is high enough to amortise the extra cache. Backend - LoadRequest gains an optional spec_draft_n_max: int (1..16). - LlamaCppBackend.load_model accepts and persists the override on self._spec_draft_n_max, used in place of the hardcoded 6/3 in the MTP emit branch. - LoadResponse and InferenceStatusResponse echo the active value (None when the platform default is in effect) so the UI can hydrate the input on refresh. - _already_in_target_state and _request_matches_loaded_settings compare spec_draft_n_max alongside speculative_type so a value change triggers a reload rather than no-op'ing. - strip_shadowing_flags now strips inherited --spec-* extras when either speculative_type or spec_draft_n_max is in fields_set, so an inherited --spec-draft-n-max cannot last-wins-override a fresh request's first-class field. Frontend - LoadModelRequest, LoadModelResponse, InferenceStatusResponse TypeScript shapes get spec_draft_n_max. - chat-runtime-store gains specDraftNMax / loadedSpecDraftNMax and a setter, hydrated from /v1/status and /v1/load. - chat-settings-sheet renders a "Draft Tokens" numeric input directly under the Speculative Decoding switch when that switch is on. Toggling the switch off clears the override; the Reset button restores the loaded value. Tests - Four new regression tests cover _already_in_target_state with matching / mismatching / non-MTP / unset spec_draft_n_max. - Existing test_llama_server_args.py and test_llama_cpp_mtp_detection.py green: 141 passed locally.

… set llama.cpp server documents --spec-draft-p-min (default 0.75, min draft acceptance probability) and --spec-draft-p-split (default 0.10). Both are first-class spec-decoding knobs that should travel with the rest of the --spec-* family when an Apply re-sets speculative_type, so an inherited override doesn't leak across a fresh load.

The four probe_server_capabilities tests use a bash stub written to tmp_path/llama-server, which Windows' subprocess can't execute directly (no shebang resolution, .bat / .cmd would be needed). Mark them skipif sys.platform == 'win32' so the rest of the MTP plumbing suite stays green on Windows CI. Unix coverage is unchanged.

Bench on B200 / Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL across five prompt types (essay, code, story, math, science) with greedy temp=0: prompt OFF n=1 n=2 n=3 n=6 essay 79.1 93.4 93.8 84.7 64.6 code 79.1 104.4 116.6 113.5 103.0 story 79.1 99.2 105.7 101.8 88.9 math 79.1 100.8 110.8 111.8 98.2 science 79.1 100.1 110.8 110.8 102.9 The previous hardcoded GPU default of 6 was 17% SLOWER than spec-off on the essay prompt (64.6 vs 79.1 t/s) and 11-50% slower than n=2 on the rest. n=2 wins on 4/5 prompts with a 1.18x-1.47x speedup vs OFF; n=3 wins on the math prompt by a hair. n=6 collapses once acceptance rate drops past n=3 -- wasted draft decode dominates the per-step budget. Matches the dataset README ("n_max=2 is the sweet spot for 36 of 42 quants"). Keeps CPU/Mac default at 3, which empirically tracks the narrower ngram+MTP chained budget on those platforms. Users who want the old behaviour can pass spec_draft_n_max in LoadRequest (the toggle this PR also adds) or --spec-draft-n-max via llama_extra_args.

Two MTP-visibility fixes uncovered while bisecting llama.cpp post-#22673 on Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL on B200. Size gate. Direct llama-server bench (no Studio measurement loop) at n_predict=192 across 9 prompts shows MTP regresses vs spec-off on sub-2B dense models because draft cost exceeds savings: Qwen3.5-0.8B Q4_K_XL GPU: 452.0 OFF -> 283.4 t/s n=2 (0.63x) CPU: 84.5 OFF -> 64.9 t/s n=3 (0.77x) Qwen3.5-4B Q4_K_XL GPU: 241.0 OFF -> 258.2 t/s n=2 (1.07x) Qwen3.5-9B Q4_K_XL GPU: 201.6 OFF -> 228.9 t/s n=2 (1.14x) Qwen3.5-27B Q4_K_XL GPU: 78.8 OFF -> 113.6 t/s n=2 (1.44x) Qwen3.6-27B Q4_K_XL GPU: 78.8 OFF -> 113.6 t/s n=2 (1.44x) Qwen3.6-35B-A3B Q4 GPU: 192.3 OFF -> 223.2 t/s n=2 (1.16x) The 2B inflection is sharp. Skip auto-promote to draft-mtp when the identifier reports <2.0B params; users can still force via --spec-type or the Speculative Decoding toggle. Mirror the gate in the reload-skip check so a sub-2B reload-with-default does not bounce a spec-off backend. Chat-completions usage. llama-server's final SSE chunk emits both an OpenAI-style usage block and a custom timings block. timings.predicted_n is always populated, but usage.completion_tokens is zero on some server builds. The Studio chat UI computes generation t/s from meta.usage.completion_tokens / totalStreamTime, so a zero completion_tokens makes the UI fall back to wall-clock time (including SSE / proxy / template overhead) which dilutes MTP gains and makes ON look the same as OFF. Add _backfill_usage_from_timings: if usage.completion_tokens is missing or zero AND timings has predicted_n/prompt_n, synthesize a complete usage dict. Apply at the streaming metadata yield in generate_chat_completion and at the three accumulator/yield sites in generate_chat_completion_with_tools so per-iteration counts are not silently lost across tool calls. Tests cover both the gate (sub-2B skips, 2B+ promotes) and the backfill (zero usage filled, real usage preserved, empty timings passthrough).

for more information, see https://pre-commit.ci

llama.cpp upstream renamed the ngram-mod tuning knobs: --draft-max -> --spec-ngram-mod-n-max (and --spec-draft-n-max) --draft-min -> --spec-ngram-mod-n-min (and --spec-draft-n-min) --spec-ngram-size-n -> --spec-ngram-mod-n-match The new names are real flags on post-rename builds and stub removal entries on the same builds (with description "argument has been removed"). Pre-rename builds only carry the legacy names as real flags. Studio was emitting the new names unconditionally, so a user running a pre-rename llama-server (e.g. an older prebuilt or a hand-installed binary) would see "unknown argument" errors when the ngram-mod path engages, or silent drop of the ngram knobs. Extend `probe_server_capabilities` to parse the help text into per-flag description blocks and tell real flags apart from removal stubs by the "argument has been removed" marker. Add three new probe fields: `ngram_mod_flavor` ("new" / "legacy" / None), `supports_ngram_mod`, and `spec_draft_n_max_flag` (the actual n_max flag the binary accepts). Cached by (path, mtime) the same way as `mtp_token`. Add `_build_ngram_mod_flags(caps, ...)` that picks the right flag set, returning [] when neither is usable so callers can drop ngram chaining entirely on minimal binaries. Wire both call sites to use the probe-driven flag set: - CPU/Mac MTP comma-chain (--spec-type ngram-mod,draft-mtp) emits legacy or new knobs as appropriate. If neither set is available, degrade to MTP-only (warn but still engage spec). - Standalone --spec-type ngram-mod branch uses the same helper. Tests cover post-rename detection, legacy detection, removal-stub discrimination, minimal-binary case, and all three branches of `_build_ngram_mod_flags` plus custom n_match/n_min/n_max values. Verified against three real binaries (Studio bundled 726704a, my build of 45b455e HEAD, and the MTP merge baseline 2555826) all correctly reporting ngram_mod_flavor=new.

for more information, see https://pre-commit.ci

Earlier sub-2B gate disabled speculative decoding entirely for tiny dense MTP models because the MTP draft head's per-token cost exceeds the acceptance savings at that scale. The "fully off" fallback was conservative -- ngram-mod has near-zero idle cost on diverse content and consistently outperforms both off and draft-mtp at sub-3B. Clean-methodology bench (each of 9 distinct prompts run once after two unrelated warmup prompts so the ngram-mod hash pool is realistically populated but never holds the exact deterministic output we're about to measure): Q4_K_XL on B200: 0.8B OFF=451 draft-mtp n=2=263 (0.58x) ngram-only=498 (1.10x) 2B OFF=377 draft-mtp n=2=308 (0.82x) ngram-only=369 (1.00x) 4B OFF=240 draft-mtp n=2=260 (1.08x) -- 4B+ wins with MTP Q4_K_XL on x86 48 cores: 0.8B OFF= 80 chained n=2= 69 (0.86x) ngram-only= 95 (1.19x) 2B OFF= 62 chained n=2= 51 (0.83x) ngram-only= 63 (1.01x) 4B OFF= 31 chained n=2= 41 (1.33x) Change: - Raise the MTP-skip threshold from 2.0B to 3.0B (2B falls below it). - When skipping the MTP head, fall back to --spec-type ngram-mod via the probe-driven _build_ngram_mod_flags helper. Works on both post-rename and pre-rename llama-server builds. - If the binary advertises neither ngram-mod flavor, fall back to spec-off (older binaries that don't support ngram-mod at all). - Mirror the same fallback in _already_in_target_state so a sub-3B reload-with-default does not bounce a ngram-mod backend. Tests updated: monkeypatch probe_server_capabilities so the gate behavior is deterministic regardless of which llama-server happens to be on the host. +1 new test for the "binary has no ngram-mod support" branch; renamed prior 2B/0.8B tests to reflect new semantics. This generalizes the size gate to be probe-driven instead of a hard "disable spec" branch.

for more information, see https://pre-commit.ci

…restore mtp_engaged) Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping the two already-merged commits and renumbering the remaining nine. The staging fork was sitting on the pre-rebase llama_cpp.py + tests; this commit replays the rebased file content while preserving PR unslothai#5585's mtp_engaged auto-fit headroom (staging-only patch absent upstream). Restored on top of the new file: - mtp_engaged: bool = False on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did not force --spec-type) before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All MTP detection + fit_context + fit_mtp tests pass (161 passed).

…P+Ngram / Off) Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs, ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off state, kept so the Switch's "disable" capability isn't lost. Backend - New module-level _canonicalize_spec_mode(value) maps any accepted input (canonical, legacy "default" / "draft-mtp" / "ngram-mod" / "ngram-simple", or comma-chained "ngram-mod,draft-mtp") onto one of auto / mtp / ngram / mtp+ngram / off / ngram-simple / None. Lets external callers and old persisted UI state round-trip without breaking. - LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode property storing the canonical UI mode the user requested. Status responses round-trip this instead of the resolved internal flag, so the dropdown restores the picked value after reload / refresh (Auto on a 27B MTP GGUF resolves to draft-mtp internally but the dropdown stays on "Auto"). - The resolver block in load_model is extracted into a unit-testable _build_speculative_flags method. Forced MTP / MTP+Ngram on a sub-3B or non-MTP GGUF logs a warning and engages anyway (user override > the Auto-path sub-3B fallback). - _already_in_target_state and routes/inference._request_matches_loaded_settings now compare canonical-requested mode, dropping the old auto-promotion mirror. spec_draft_n_max still gates on the resolved spec so Auto + a changed n_max still bounces a reload. Frontend - chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft Tokens input only visible when speculativeType is "mtp" or "mtp+ngram". - chat-runtime-store.ts: initial value flips from "default" to "auto". - use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend canonicaliser so persisted "default" / "draft-mtp" / "ngram-mod" / chain values hydrate to the right dropdown option. - types/api.ts: docs the canonical wire vocabulary. Tests - 53 new assertions in test_llama_cpp_mtp_detection.py: full _canonicalize_spec_mode table, a 23-row resolver matrix across (requested mode) x (GPU/CPU) x (model size class), plus n_max override, user-extra-args precedence, requested-mode round-trip, and graceful degrade on an outdated llama-server without an MTP token. - 165 existing backend tests still green. 218 total in the MTP / server-args / reload-inheritance suite.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d994c3fdc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T12:02:28Z

    if llama_backend.is_vision:
-        req_spec = "off"
+        req_mode = "off"


Compare the recorded vision spec mode

For vision GGUFs loaded with the default Auto setting, llama_cpp.py no longer suppresses speculative decoding for vision models (it records requested_spec_mode = "auto" and explicitly says there is no vision gate), but this route still rewrites every vision request to "off". In that common scenario _request_matches_loaded_settings always returns false (off != auto), so re-applying or reloading the same already-loaded vision model needlessly restarts llama-server instead of taking the fast path.

Useful? React with 👍 / 👎.

When the user switches from model A to a different model B, clear the runtime store's speculativeType + specDraftNMax (and their loaded* shadows). The new load request then carries null, the backend canonicalises that to "auto", and its platform-aware resolver runs fresh for the new model. Without this, a non-MTP model loaded with "Off" carried the Off choice into a subsequent MTP load, suppressing MTP auto-promotion (and the sub-3B ngram-mod fallback) until the user manually opened settings and flipped the dropdown back to Auto. The clean-sweep deep probe caught it as anomaly A-1. The reset only fires when currentCheckpoint != modelId, so a same-model reapply or forceReload still honours the user's current spec choice. End-to-end probe on Qwen3.5-4B-GGUF (non-MTP, Off) -> Qwen3.5-0.8B-MTP confirms: dropdown shows Auto, /api/inference/status returns speculative_type=auto, studio.log shows the Auto sub-3B fallback emitted --spec-type ngram-mod.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9dbcb2d0f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-19T13:12:59Z

+    if backend_mode in ("mtp", "mtp+ngram") and request.spec_draft_n_max is not None:
+        if int(request.spec_draft_n_max) != (llama_backend.spec_draft_n_max or 0):
+            return False


Reload Auto-promoted MTP when draft tokens change

When the same MTP GGUF is already loaded in auto, changing spec_draft_n_max through the load API is skipped here because backend_mode remains "auto", so the draft-token comparison never runs. _build_speculative_flags() does apply spec_draft_n_max when Auto resolves to MTP, so a request like speculative_type: "auto", spec_draft_n_max: 5 should restart llama-server instead of reusing the existing default --spec-draft-n-max.

Useful? React with 👍 / 👎.

After PR #5582 introduced the 5-mode Speculative Decoding dropdown plus _canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes: * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed) * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM) * ngram / ngram-simple / off -> never engage MTP * user --spec-type in extra_args -> resolver suppressed; no headroom The old gate triggered on "anything but off", so it over-reserved the 0.85 budget when the user explicitly picked Ngram (no MTP) or when Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom cost was minor but unnecessary. Mirrors the same logic already encoded in _build_speculative_flags so the auto-fit budget and the actual emission agree on whether MTP is running. All 361 backend tests pass.

) * studio: reserve VRAM headroom for the MTP draft cache in auto-fit When MTP is going to engage on this load, _fit_context_to_vram now budgets 0.85 of available VRAM instead of 0.90, leaving room for llama.cpp's secondary MTP draft KV cache + compute graph buffers. Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL at native auto-context running roughly half the speed of the same model with a slightly smaller context. The most parsimonious explanation is a VRAM cliff: at native context the target's KV already eats the 90% budget, then llama-server allocates the draft cache + draft graph on top and spills into a slower partial-offload path. Reducing the budget by 5% on MTP loads avoids the spill without penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.) the fit is unchanged because the requested context already fits in the tighter budget too. MTP detection mirrors the auto-promotion logic in load_model: the GGUF advertises nextn_predict_layers, or the model identifier / local path matches the -MTP marker, and the user has not explicitly opted out via speculative_type="off" or --spec-type extra args. Tests: two new cases in test_kv_cache_estimation.py verify that mtp_engaged=True yields a context less-than-or-equal-to the non-MTP path on a tight budget, and that kv_on_gpu=False still short-circuits regardless of mtp_engaged. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: gate _mtp_will_engage on canonical-mode resolver After PR #5582 introduced the 5-mode Speculative Decoding dropdown plus _canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes: * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed) * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM) * ngram / ngram-simple / off -> never engage MTP * user --spec-type in extra_args -> resolver suppressed; no headroom The old gate triggered on "anything but off", so it over-reserved the 0.85 budget when the user explicitly picked Ngram (no MTP) or when Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom cost was minor but unnecessary. Mirrors the same logic already encoded in _build_speculative_flags so the auto-fit budget and the actual emission agree on whether MTP is running. All 361 backend tests pass. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

danielhanchen · 2026-05-21T13:03:34Z

Reviewed end-to-end on pr-5582. Looks good — all four claimed fixes confirmed and tests pass — with one subtle latent issue worth a follow-up.

Confirmed fixes

(a) Stale/renamed ngram-mod args. _build_ngram_mod_flags + probe_server_capabilities now return ngram_mod_flavor ("new"/"legacy"/None) and spec_draft_n_max_flag. Pre-rename llama-server gets --spec-ngram-size-n / --draft-min / --draft-max; post-rename gets --spec-ngram-mod-n-*. Help parser correctly distinguishes real flags from "argument has been removed" stubs.
(b) Bad sub-3B MTP auto-selection. _mtp_too_small = size < 3.0 falls back to _emit_ngram_mod(); sub-2B skips MTP entirely (commit 41ac2027).
(c) Reload-state mismatches. _already_in_target_state and _request_matches_loaded_settings now compare on canonical requested_spec_mode + spec_draft_n_max; _requested_spec_mode / _spec_draft_n_max are cleared on unload; the hook resets store to Auto on model switch (hooks/use-chat-model-runtime.ts:563-571).
(d) Zero completion-token usage. _backfill_usage_from_timings (llama_cpp.py:~595) backfills from timings.predicted_n / prompt_n and is wired into the metadata yield (~4197) and both tool-loop accumulators (~4649, ~4664, ~4757, ~4794).

Tests: pytest studio/backend/tests/test_llama_cpp_mtp_detection.py studio/backend/tests/test_gguf_reload_inheritance.py -q → 128 passed in 51.02s on Linux. Pinned cross-OS in danielhanchen#137.

Public API surface is additive only: new optional kwarg on LlamaCppBackend.load_model, new keys on probe_server_capabilities (ngram_mod_flavor, supports_ngram_mod, spec_draft_n_max_flag), new optional spec_draft_n_max on LoadRequest/LoadResponse/InferenceStatusResponse. Nothing renamed or removed. Backward compatible.

Latent issue worth a quick follow-up: clearing a previously-set spec_draft_n_max override via Apply may not actually trigger a reload.

routes/inference.py:554-556 (and the mirrored check in llama_cpp.py:~3477-3482) guards the comparison with request.spec_draft_n_max is not None, so if the loaded backend has e.g. n_max=4 and the user clears the field (None), _request_matches_loaded_settings returns True and the route short-circuits to the cached LoadResponse (routes/inference.py:668-697). The cleared override is silently dropped and the server keeps n_max=4 until something else forces a reload.
The Apply button's forceReload: true is frontend-only — there's no force_reload field on the backend LoadRequest.
Suggest tightening the guard to also bounce when backend.spec_draft_n_max is not None and request.spec_draft_n_max is None (clear-on-set asymmetry).

Side note: PR description still says "defaults unchanged (6 GPU / 3 CPU/Mac)" but _resolved_draft_n_max returns 2 if gpus else 3 (intentional per commit 4bed8ae0). Not a regression — just worth updating the description.

danielhanchen · 2026-05-21T13:09:43Z

Cross-OS validation finished in danielhanchen#137.

test_llama_cpp_mtp_detection.py + test_gguf_reload_inheritance.py:

Runner	Result
`ubuntu-latest`	128 passed in 1.14s
`macos-14`	128 passed in 0.26s
`windows-latest`	120 passed, 8 skipped in 0.46s

The 8 Windows skips are expected (the existing WINDOWS_SKIP markers in those modules). All non-skipped behaviour parity-passes on every platform.

Builds on the 5-mode Speculative Decoding dropdown (PR #5582) by adding two upstream-driven knobs that landed in llama.cpp #23269 (MTP clean-up): 1. spec_draft_p_min: minimum draft probability threshold for MTP speculative decoding (--spec-draft-p-min). Drafts below this probability are rejected. Was non-functional pre-#23269; now defaults to 0.0 upstream. Studio exposes it as a "Draft p-min" numeric input below "Draft Tokens", visible only when the dropdown is MTP or MTP+Ngram (the only modes where MTP actually engages and the knob has effect). 2. ngram-map-k / ngram-map-k4v: new spec types added alongside ngram-mod. Each carries its own knob triplet (--spec-{variant}-size-n/m/min-hits). They are NOT in the dropdown -- power-user-only -- but the load API accepts them and the resolver emits the correct flag set when probed support is present. Backend - _canonicalize_spec_mode recognises ngram-map-k / ngram-map-k4v. - New helper _build_ngram_map_k_flags(caps, variant=...) emits the knob triplet only when the binary advertises the knobs as real flags (not removal stubs). - _build_speculative_flags grows two branches and an inline _maybe_emit_p_min helper that flows p_min through the MTP path only. Auto on an MTP GGUF still gets p_min applied because the resolved emission is MTP. - LoadRequest.spec_draft_p_min (Optional[float], 0..1). Threaded through routes/inference.py at the four wire sites and the _request_matches_loaded_settings comparator. - _already_in_target_state takes spec_draft_p_min so a changed p_min bounces a reload even on the Auto-promoted path. - probe_server_capabilities now reports spec_draft_p_min_flag, supports_ngram_map_k, and supports_ngram_map_k4v. Frontend - chat-runtime-store: specDraftPMin / loadedSpecDraftPMin / setter. - use-chat-model-runtime: hydrate p_min from /api/inference/status and the load response. Reset p_min alongside spec mode and n_max when the user switches to a different model. - chat-settings-sheet: new "Draft p-min" number input (min 0, max 1, step 0.05), visible when speculativeType is mtp or mtp+ngram. Wired into the Reset and dirty-state machinery. Tests - 12 new assertions in test_llama_cpp_mtp_detection.py: p_min emission matrix (MTP modes only; never for auto/ngram/off; auto-promoted draft-mtp still gets p_min; graceful degrade when binary lacks --spec-draft-p-min), ngram-map-k / ngram-map-k4v emission with the right knob triplet, no-emit-when-unsupported, canonicalize recognition. 373 total backend tests pass (was 361 before).

Each PR ran the same staged source files before, which went stale when the upstream PR commits advanced. Refactor to one job per PR with an actions/checkout of that PR's head ref, so cross-OS validation always uses the latest commit: - PR unslothai#5603 sandbox -> studio-sandbox-hardening - PR unslothai#5620 parser parity -> studio-tools-multi-format-v2 - PR unslothai#5696 mtp reload guards -> followup-mtp-reload-guards (unslothai#5582 followup) - PR unslothai#5695 lockfile audit -> followup-lockfile-audit-regressions (unslothai#5604 followup) 4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap). cancel-in-progress per (workflow, ref) keeps iteration cheap. All tests stay CPU-only and rely on the CUDA spoof harness in tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU is required on any runner.

…thai#5582 Five issues surfaced after unslothai#5582 merged. All addressed with matching pytest coverage (15 new tests, 147 total green). Bug A -- route guard compared against the requested UI mode rather than the backend's resolved spec mode. A user request setting ``spec_draft_n_max=2`` against a backend that was auto-promoted from ``auto`` -> ``draft-mtp`` saw ``requested_spec_mode == "auto"`` (not in ``("mtp", "mtp+ngram")``) and skipped the comparison, returning ``already_loaded`` with the stale value still active. Now mirrors the backend-side guard's check against ``speculative_type == "draft-mtp"``. Bug B -- both reload guards short-circuited the n_max comparison when the request value was ``None``, treating it as a wildcard. A backend loaded with an explicit override of 8 could never be cleared back to the platform default without swapping the model. Both guards now treat the ``(None vs explicit)`` flip as a difference: clear-to-default and set-from-default both bounce a reload, while ``(None == None)`` and ``(N == N)`` continue to match. Bug C -- chained MTP+ngram on a legacy llama-server (pre arg-rename) emitted ``--draft-max`` twice: once for MTP's draft length (e.g. 2), once for ngram-mod's size-N max (e.g. 64). llama-server's last-wins parsing clobbered the MTP value with 64, defeating the ``--spec-draft-n-max`` slider. ``_build_ngram_mod_flags`` now takes a ``chain_with_mtp`` kwarg that suppresses ``--draft-max`` on the legacy flavor when MTP will emit it; the post-rename flavor uses distinct ``--spec-ngram-mod-*`` names that cannot collide. Bug D -- a forced ``speculative_type="ngram"`` request emitted ``--spec-type ngram-mod`` even on binaries that did not advertise ngram-mod support, causing llama-server to refuse to start. The auto path already checked ``supports_ngram_mod`` before emitting; the forced path now mirrors that check and loads without spec (with a warning that matches the MTP-token-missing path). Bug E -- ``speculative_type="none"`` is llama.cpp's own explicit-disable spelling, and external API callers commonly use ``"disable"`` / ``"disabled"``. None of these were in the canonical spec mode set or the legacy alias map, so they fell through to ``"auto"`` and silently re-enabled MTP -- the opposite of the user's intent. Added all three to ``_LEGACY_SPEC_MODE_MAP`` as aliases for ``"off"``. Tests ----- - test_canonicalize_spec_mode_none_aliases_map_to_off (6 cases via parametrize): "none"/"None"/"NONE"/" none "/"disable"/"Disabled" all canonicalise to "off". - test_build_ngram_mod_flags_legacy_chained_omits_draft_max + test_build_ngram_mod_flags_legacy_standalone_keeps_draft_max + test_build_ngram_mod_flags_new_flavor_always_emits_distinct_names: the chain_with_mtp kwarg suppresses only the legacy flavor's --draft-max, never the new-flavor knobs. - test_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max: end-to-end check that the assembled spec block has exactly one --draft-max carrying the MTP draft length. - test_build_speculative_flags_forced_ngram_without_support_skips_spec + test_build_speculative_flags_forced_ngram_with_support_emits_spec: forced ngram refuses on a binary lacking ngram-mod support; still emits cleanly on a supporting binary. - test_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload: backend-side guard covers both clear-to-default and set-from-default. - test_route_guard_auto_promoted_mtp_{bounces,matches,clear_*}: route guard now compares against resolved spec mode and handles the None flip symmetrically. - test_route_guard_ignores_n_max_when_resolved_spec_is_not_mtp: non-MTP resolved spec (e.g. ngram-mod) still ignores n_max. 147/147 spec/reload test suites green.

…nslothai#5582) * studio: add --spec-draft-n-max toggle for MTP speculative decoding Surface llama-server's --spec-draft-n-max as a first-class LoadRequest field so users can tune the MTP draft tree size from the chat settings panel. Default behaviour is unchanged: when the caller omits spec_draft_n_max, the existing platform defaults still apply (6 on GPU, 3 on CPU/Mac). Why this matters: on context-constrained loads the draft KV cache competes with the target model's KV cache for VRAM. Lowering spec_draft_n_max reduces that pressure, lets a larger user context fit, and recovers throughput; raising it pays off when draft acceptance is high enough to amortise the extra cache. Backend - LoadRequest gains an optional spec_draft_n_max: int (1..16). - LlamaCppBackend.load_model accepts and persists the override on self._spec_draft_n_max, used in place of the hardcoded 6/3 in the MTP emit branch. - LoadResponse and InferenceStatusResponse echo the active value (None when the platform default is in effect) so the UI can hydrate the input on refresh. - _already_in_target_state and _request_matches_loaded_settings compare spec_draft_n_max alongside speculative_type so a value change triggers a reload rather than no-op'ing. - strip_shadowing_flags now strips inherited --spec-* extras when either speculative_type or spec_draft_n_max is in fields_set, so an inherited --spec-draft-n-max cannot last-wins-override a fresh request's first-class field. Frontend - LoadModelRequest, LoadModelResponse, InferenceStatusResponse TypeScript shapes get spec_draft_n_max. - chat-runtime-store gains specDraftNMax / loadedSpecDraftNMax and a setter, hydrated from /v1/status and /v1/load. - chat-settings-sheet renders a "Draft Tokens" numeric input directly under the Speculative Decoding switch when that switch is on. Toggling the switch off clears the override; the Reset button restores the loaded value. Tests - Four new regression tests cover _already_in_target_state with matching / mismatching / non-MTP / unset spec_draft_n_max. - Existing test_llama_server_args.py and test_llama_cpp_mtp_detection.py green: 141 passed locally. * studio: add --spec-draft-p-min and --spec-draft-p-split to spec strip set llama.cpp server documents --spec-draft-p-min (default 0.75, min draft acceptance probability) and --spec-draft-p-split (default 0.10). Both are first-class spec-decoding knobs that should travel with the rest of the --spec-* family when an Apply re-sets speculative_type, so an inherited override doesn't leak across a fresh load. * studio/tests: skip MTP capability-probe tests on Windows The four probe_server_capabilities tests use a bash stub written to tmp_path/llama-server, which Windows' subprocess can't execute directly (no shebang resolution, .bat / .cmd would be needed). Mark them skipif sys.platform == 'win32' so the rest of the MTP plumbing suite stays green on Windows CI. Unix coverage is unchanged. * studio: lower MTP GPU default --spec-draft-n-max from 6 to 2 Bench on B200 / Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL across five prompt types (essay, code, story, math, science) with greedy temp=0: prompt OFF n=1 n=2 n=3 n=6 essay 79.1 93.4 93.8 84.7 64.6 code 79.1 104.4 116.6 113.5 103.0 story 79.1 99.2 105.7 101.8 88.9 math 79.1 100.8 110.8 111.8 98.2 science 79.1 100.1 110.8 110.8 102.9 The previous hardcoded GPU default of 6 was 17% SLOWER than spec-off on the essay prompt (64.6 vs 79.1 t/s) and 11-50% slower than n=2 on the rest. n=2 wins on 4/5 prompts with a 1.18x-1.47x speedup vs OFF; n=3 wins on the math prompt by a hair. n=6 collapses once acceptance rate drops past n=3 -- wasted draft decode dominates the per-step budget. Matches the dataset README ("n_max=2 is the sweet spot for 36 of 42 quants"). Keeps CPU/Mac default at 3, which empirically tracks the narrower ngram+MTP chained budget on those platforms. Users who want the old behaviour can pass spec_draft_n_max in LoadRequest (the toggle this PR also adds) or --spec-draft-n-max via llama_extra_args. * studio: skip MTP auto-promote on sub-2B models, backfill chat usage Two MTP-visibility fixes uncovered while bisecting llama.cpp post-#22673 on Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL on B200. Size gate. Direct llama-server bench (no Studio measurement loop) at n_predict=192 across 9 prompts shows MTP regresses vs spec-off on sub-2B dense models because draft cost exceeds savings: Qwen3.5-0.8B Q4_K_XL GPU: 452.0 OFF -> 283.4 t/s n=2 (0.63x) CPU: 84.5 OFF -> 64.9 t/s n=3 (0.77x) Qwen3.5-4B Q4_K_XL GPU: 241.0 OFF -> 258.2 t/s n=2 (1.07x) Qwen3.5-9B Q4_K_XL GPU: 201.6 OFF -> 228.9 t/s n=2 (1.14x) Qwen3.5-27B Q4_K_XL GPU: 78.8 OFF -> 113.6 t/s n=2 (1.44x) Qwen3.6-27B Q4_K_XL GPU: 78.8 OFF -> 113.6 t/s n=2 (1.44x) Qwen3.6-35B-A3B Q4 GPU: 192.3 OFF -> 223.2 t/s n=2 (1.16x) The 2B inflection is sharp. Skip auto-promote to draft-mtp when the identifier reports <2.0B params; users can still force via --spec-type or the Speculative Decoding toggle. Mirror the gate in the reload-skip check so a sub-2B reload-with-default does not bounce a spec-off backend. Chat-completions usage. llama-server's final SSE chunk emits both an OpenAI-style usage block and a custom timings block. timings.predicted_n is always populated, but usage.completion_tokens is zero on some server builds. The Studio chat UI computes generation t/s from meta.usage.completion_tokens / totalStreamTime, so a zero completion_tokens makes the UI fall back to wall-clock time (including SSE / proxy / template overhead) which dilutes MTP gains and makes ON look the same as OFF. Add _backfill_usage_from_timings: if usage.completion_tokens is missing or zero AND timings has predicted_n/prompt_n, synthesize a complete usage dict. Apply at the streaming metadata yield in generate_chat_completion and at the three accumulator/yield sites in generate_chat_completion_with_tools so per-iteration counts are not silently lost across tool calls. Tests cover both the gate (sub-2B skips, 2B+ promotes) and the backfill (zero usage filled, real usage preserved, empty timings passthrough). * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: probe + emit legacy ngram-mod flags for pre-rename llama-server llama.cpp upstream renamed the ngram-mod tuning knobs: --draft-max -> --spec-ngram-mod-n-max (and --spec-draft-n-max) --draft-min -> --spec-ngram-mod-n-min (and --spec-draft-n-min) --spec-ngram-size-n -> --spec-ngram-mod-n-match The new names are real flags on post-rename builds and stub removal entries on the same builds (with description "argument has been removed"). Pre-rename builds only carry the legacy names as real flags. Studio was emitting the new names unconditionally, so a user running a pre-rename llama-server (e.g. an older prebuilt or a hand-installed binary) would see "unknown argument" errors when the ngram-mod path engages, or silent drop of the ngram knobs. Extend `probe_server_capabilities` to parse the help text into per-flag description blocks and tell real flags apart from removal stubs by the "argument has been removed" marker. Add three new probe fields: `ngram_mod_flavor` ("new" / "legacy" / None), `supports_ngram_mod`, and `spec_draft_n_max_flag` (the actual n_max flag the binary accepts). Cached by (path, mtime) the same way as `mtp_token`. Add `_build_ngram_mod_flags(caps, ...)` that picks the right flag set, returning [] when neither is usable so callers can drop ngram chaining entirely on minimal binaries. Wire both call sites to use the probe-driven flag set: - CPU/Mac MTP comma-chain (--spec-type ngram-mod,draft-mtp) emits legacy or new knobs as appropriate. If neither set is available, degrade to MTP-only (warn but still engage spec). - Standalone --spec-type ngram-mod branch uses the same helper. Tests cover post-rename detection, legacy detection, removal-stub discrimination, minimal-binary case, and all three branches of `_build_ngram_mod_flags` plus custom n_match/n_min/n_max values. Verified against three real binaries (Studio bundled 726704a, my build of 45b455e HEAD, and the MTP merge baseline 2555826) all correctly reporting ngram_mod_flavor=new. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: sub-3B MTP falls back to ngram-mod, not off Earlier sub-2B gate disabled speculative decoding entirely for tiny dense MTP models because the MTP draft head's per-token cost exceeds the acceptance savings at that scale. The "fully off" fallback was conservative -- ngram-mod has near-zero idle cost on diverse content and consistently outperforms both off and draft-mtp at sub-3B. Clean-methodology bench (each of 9 distinct prompts run once after two unrelated warmup prompts so the ngram-mod hash pool is realistically populated but never holds the exact deterministic output we're about to measure): Q4_K_XL on B200: 0.8B OFF=451 draft-mtp n=2=263 (0.58x) ngram-only=498 (1.10x) 2B OFF=377 draft-mtp n=2=308 (0.82x) ngram-only=369 (1.00x) 4B OFF=240 draft-mtp n=2=260 (1.08x) -- 4B+ wins with MTP Q4_K_XL on x86 48 cores: 0.8B OFF= 80 chained n=2= 69 (0.86x) ngram-only= 95 (1.19x) 2B OFF= 62 chained n=2= 51 (0.83x) ngram-only= 63 (1.01x) 4B OFF= 31 chained n=2= 41 (1.33x) Change: - Raise the MTP-skip threshold from 2.0B to 3.0B (2B falls below it). - When skipping the MTP head, fall back to --spec-type ngram-mod via the probe-driven _build_ngram_mod_flags helper. Works on both post-rename and pre-rename llama-server builds. - If the binary advertises neither ngram-mod flavor, fall back to spec-off (older binaries that don't support ngram-mod at all). - Mirror the same fallback in _already_in_target_state so a sub-3B reload-with-default does not bounce a ngram-mod backend. Tests updated: monkeypatch probe_server_capabilities so the gate behavior is deterministic regardless of which llama-server happens to be on the host. +1 new test for the "binary has no ngram-mod support" branch; renamed prior 2B/0.8B tests to reflect new semantics. This generalizes the size gate to be probe-driven instead of a hard "disable spec" branch. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: 5-mode Speculative Decoding dropdown (Auto / MTP / Ngram / MTP+Ngram / Off) Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs, ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off state, kept so the Switch's "disable" capability isn't lost. Backend - New module-level _canonicalize_spec_mode(value) maps any accepted input (canonical, legacy "default" / "draft-mtp" / "ngram-mod" / "ngram-simple", or comma-chained "ngram-mod,draft-mtp") onto one of auto / mtp / ngram / mtp+ngram / off / ngram-simple / None. Lets external callers and old persisted UI state round-trip without breaking. - LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode property storing the canonical UI mode the user requested. Status responses round-trip this instead of the resolved internal flag, so the dropdown restores the picked value after reload / refresh (Auto on a 27B MTP GGUF resolves to draft-mtp internally but the dropdown stays on "Auto"). - The resolver block in load_model is extracted into a unit-testable _build_speculative_flags method. Forced MTP / MTP+Ngram on a sub-3B or non-MTP GGUF logs a warning and engages anyway (user override > the Auto-path sub-3B fallback). - _already_in_target_state and routes/inference._request_matches_loaded_settings now compare canonical-requested mode, dropping the old auto-promotion mirror. spec_draft_n_max still gates on the resolved spec so Auto + a changed n_max still bounces a reload. Frontend - chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft Tokens input only visible when speculativeType is "mtp" or "mtp+ngram". - chat-runtime-store.ts: initial value flips from "default" to "auto". - use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend canonicaliser so persisted "default" / "draft-mtp" / "ngram-mod" / chain values hydrate to the right dropdown option. - types/api.ts: docs the canonical wire vocabulary. Tests - 53 new assertions in test_llama_cpp_mtp_detection.py: full _canonicalize_spec_mode table, a 23-row resolver matrix across (requested mode) x (GPU/CPU) x (model size class), plus n_max override, user-extra-args precedence, requested-mode round-trip, and graceful degrade on an outdated llama-server without an MTP token. - 165 existing backend tests still green. 218 total in the MTP / server-args / reload-inheritance suite. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: reset Speculative Decoding to Auto on model switch When the user switches from model A to a different model B, clear the runtime store's speculativeType + specDraftNMax (and their loaded* shadows). The new load request then carries null, the backend canonicalises that to "auto", and its platform-aware resolver runs fresh for the new model. Without this, a non-MTP model loaded with "Off" carried the Off choice into a subsequent MTP load, suppressing MTP auto-promotion (and the sub-3B ngram-mod fallback) until the user manually opened settings and flipped the dropdown back to Auto. The clean-sweep deep probe caught it as anomaly A-1. The reset only fires when currentCheckpoint != modelId, so a same-model reapply or forceReload still honours the user's current spec choice. End-to-end probe on Qwen3.5-4B-GGUF (non-MTP, Off) -> Qwen3.5-0.8B-MTP confirms: dropdown shows Auto, /api/inference/status returns speculative_type=auto, studio.log shows the Auto sub-3B fallback emitted --spec-type ngram-mod. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

…slothai#5585) * studio: reserve VRAM headroom for the MTP draft cache in auto-fit When MTP is going to engage on this load, _fit_context_to_vram now budgets 0.85 of available VRAM instead of 0.90, leaving room for llama.cpp's secondary MTP draft KV cache + compute graph buffers. Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL at native auto-context running roughly half the speed of the same model with a slightly smaller context. The most parsimonious explanation is a VRAM cliff: at native context the target's KV already eats the 90% budget, then llama-server allocates the draft cache + draft graph on top and spills into a slower partial-offload path. Reducing the budget by 5% on MTP loads avoids the spill without penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.) the fit is unchanged because the requested context already fits in the tighter budget too. MTP detection mirrors the auto-promotion logic in load_model: the GGUF advertises nextn_predict_layers, or the model identifier / local path matches the -MTP marker, and the user has not explicitly opted out via speculative_type="off" or --spec-type extra args. Tests: two new cases in test_kv_cache_estimation.py verify that mtp_engaged=True yields a context less-than-or-equal-to the non-MTP path on a tight budget, and that kv_on_gpu=False still short-circuits regardless of mtp_engaged. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: gate _mtp_will_engage on canonical-mode resolver After PR unslothai#5582 introduced the 5-mode Speculative Decoding dropdown plus _canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes: * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed) * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM) * ngram / ngram-simple / off -> never engage MTP * user --spec-type in extra_args -> resolver suppressed; no headroom The old gate triggered on "anything but off", so it over-reserved the 0.85 budget when the user explicitly picked Ngram (no MTP) or when Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom cost was minor but unnecessary. Mirrors the same logic already encoded in _build_speculative_flags so the auto-fit budget and the actual emission agree on whether MTP is running. All 361 backend tests pass. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

danielhanchen requested a review from rolandtannous as a code owner May 18, 2026 17:46

chatgpt-codex-connector Bot reviewed May 18, 2026

View reviewed changes

danielhanchen force-pushed the studio-spec-draft-n-max-toggle branch from 359f852 to 2b89040 Compare May 18, 2026 17:55

danielhanchen mentioned this pull request May 18, 2026

ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582 danielhanchen/unsloth-staging-2#132

Closed

danielhanchen mentioned this pull request May 18, 2026

studio: reserve VRAM headroom for the MTP draft cache in auto-fit #5585

Merged

3 tasks

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

danielhanchen mentioned this pull request May 19, 2026

studio: emit one comma-chained --spec-type for CPU/Mac MTP path #5575

Merged

2 tasks

danielhanchen and others added 9 commits May 19, 2026 10:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

54ef1bf

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

fb49e50

for more information, see https://pre-commit.ci

danielhanchen force-pushed the studio-spec-draft-n-max-toggle branch from 87b8615 to d58038c Compare May 19, 2026 10:22

[pre-commit.ci] auto fixes from pre-commit.com hooks

f6a5cbc

for more information, see https://pre-commit.ci

danielhanchen and others added 2 commits May 19, 2026 11:58

[pre-commit.ci] auto fixes from pre-commit.com hooks

9d994c3

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed May 19, 2026

View reviewed changes

danielhanchen merged commit 27d4ace into main May 19, 2026
20 of 33 checks passed

danielhanchen deleted the studio-spec-draft-n-max-toggle branch May 19, 2026 13:17

This was referenced May 19, 2026

studio: spec-draft-p-min knob + ngram-map-k/k4v power-user wire values #5623

Open

validate-may21-prs: cross-OS CI for May 18-21 PR cohort danielhanchen/unsloth-staging-2#137

Open

danielhanchen mentioned this pull request May 22, 2026

studio: tighten MTP reload guards and asymmetric spec flags for #5582 #5696

Open

1 task

Imagineer99 mentioned this pull request Jun 11, 2026

[Feature] Draft model / speculative decoding #4753

Closed

		if backend_spec == "draft-mtp" and request.spec_draft_n_max is not None:
		if int(request.spec_draft_n_max) != (llama_backend.spec_draft_n_max or 0):

		_mtp_size_b = _extract_model_size_b(model_identifier)
		_mtp_too_small = _mtp_size_b is not None and _mtp_size_b < 2.0

Uh oh!

Conversation

danielhanchen commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

1. User-visible --spec-draft-n-max toggle

2. Standalone ngram-mod branch uses current knob names

3. GPU default --spec-draft-n-max lowered from 6 to 2

4. Sub-3B MTP falls back to ngram-mod, not off

5. Legacy ngram-mod flag fallback (probed)

6. Chat-completions usage backfill from llama-server timings

7. Windows CI: skip bash-stub MTP probe tests on Windows

8. Bisect verdict on llama.cpp MTP runtime

Methodology note

Tests

Test plan

Uh oh!

gemini-code-assist Bot commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 19, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

danielhanchen commented May 21, 2026

Uh oh!

danielhanchen commented May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented May 18, 2026 •

edited

Loading

1. User-visible `--spec-draft-n-max` toggle

3. GPU default `--spec-draft-n-max` lowered from 6 to 2