Skip to content

studio: add --spec-draft-n-max toggle for MTP speculative decoding#5582

Merged
danielhanchen merged 13 commits into
mainfrom
studio-spec-draft-n-max-toggle
May 19, 2026
Merged

studio: add --spec-draft-n-max toggle for MTP speculative decoding#5582
danielhanchen merged 13 commits into
mainfrom
studio-spec-draft-n-max-toggle

Conversation

@danielhanchen

@danielhanchen danielhanchen commented May 18, 2026

Copy link
Copy Markdown
Member

Summary

Cumulative MTP-on-Studio work. Stacks on top of PR #5575 (single comma-chained --spec-type on CPU/Mac). Eight threads, all on the same branch since they share fixtures and touch the same load-model code path.

1. User-visible --spec-draft-n-max toggle

Added a Draft Tokens numeric input to the chat settings sheet, gated on Speculative Decoding being on. Defaults to 2 on GPU and 3 on CPU/Mac when unset. Plumbed spec_draft_n_max: Optional[int] (range 1-16) through LoadRequest / LoadResponse / InferenceStatusResponse, the backend _already_in_target_state reload-skip check, and the frontend chat runtime store. Added --spec-draft-p-min and --spec-draft-p-split to the spec-strip set so the inheritance logic does not silently drop them across reloads.

2. Standalone ngram-mod branch uses current knob names

The non-MTP --spec-type ngram-mod branch was emitting the removed --spec-ngram-size-n / --draft-min / --draft-max flags; replaced with the current --spec-ngram-mod-n-match / -n-min / -n-max. Fixed --spec-ngram-mod-n-max from 6 (smaller than n-min=48) to llama.cpp's documented default of 64.

3. GPU default --spec-draft-n-max lowered from 6 to 2

Bench on Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL on B200 at temp=0, top_k=1, n_predict=192, 9 prompts:

n_max=6 regressed below n_max=2 on this quant; n_max=2 is universally the
sweet spot for dense Qwen3.5 / 3.6 Q4_K_XL on B200 across the 4-prompt sweep.
CPU/Mac default stays at 3 (smaller batch helps where draft-decode overhead
dominates).

4. Sub-3B MTP falls back to ngram-mod, not off

Sub-3B dense MTP regresses vs spec-off because the draft head's per-token cost exceeds the acceptance savings. Clean-methodology bench (each of 9 distinct prompts run once after two unrelated warmup prompts so the ngram-mod hash pool is realistically populated but never holds the exact deterministic output being measured):

Q4_K_XL on B200:
  0.8B  OFF=451  draft-mtp n=2=263 (0.58x)  ngram-only=498 (1.10x)
  2B    OFF=377  draft-mtp n=2=308 (0.82x)  ngram-only=369 (1.00x)
  4B    OFF=240  draft-mtp n=2=260 (1.08x)
  9B    OFF=202  draft-mtp n=2=226 (1.12x)
  27B   OFF= 79  draft-mtp n=2=110 (1.40x)
  27B 3.6  OFF= 79  draft-mtp n=4=115 (1.46x)
  35B-A3B   OFF=192  draft-mtp n=2=212 (1.11x)
  35B-A3B 3.6  OFF=192  draft-mtp n=2=218 (1.13x)
  122B-A10B Q2_K_XL  OFF=116  draft-mtp n=2=143 (1.24x)

Q4_K_XL on x86 48 cores:
  0.8B  OFF= 80  chained n=2= 69 (0.86x)  ngram-only= 95 (1.19x)
  2B    OFF= 62  chained n=2= 51 (0.83x)  ngram-only= 63 (1.01x)
  4B    OFF= 31  chained n=2= 41 (1.33x)
  9B    OFF= 24  chained n=2= 26 (1.08x)
  27B 3.6  OFF=  9  chained n=4= 12 (1.35x)

Change:

  • MTP-skip threshold from 2.0B to 3.0B (2B falls below it).
  • When skipping the MTP head, fall back to --spec-type ngram-mod via the probe-driven _build_ngram_mod_flags helper. Works on both post-rename and pre-rename llama-server builds.
  • If the binary advertises neither ngram-mod flavor, fall back to spec-off.
  • Mirror the same fallback in _already_in_target_state.

5. Legacy ngram-mod flag fallback (probed)

llama.cpp upstream renamed the ngram-mod tuning knobs:

--draft-max         -> --spec-ngram-mod-n-max  (and --spec-draft-n-max)
--draft-min         -> --spec-ngram-mod-n-min  (and --spec-draft-n-min)
--spec-ngram-size-n -> --spec-ngram-mod-n-match

New names are real flags on post-rename builds and stub removal entries on the same builds (description: argument has been removed). Pre-rename binaries carry only the legacy names as real flags. Studio was emitting the new names unconditionally so a user on a pre-rename llama-server (older prebuilt, hand-installed binary) would see unknown argument errors.

Extended probe_server_capabilities to parse the help text into per-flag description blocks and tell real flags apart from stubs. Added ngram_mod_flavor (new / legacy / None), supports_ngram_mod, and spec_draft_n_max_flag. Cached by (path, mtime) like the existing mtp_token probe.

Added _build_ngram_mod_flags(caps, ...) which picks the right flag set or returns [] when neither is usable so callers can drop ngram chaining entirely on minimal binaries.

Wired both call sites (CPU/Mac MTP comma-chain + standalone --spec-type ngram-mod). If neither flavor is detected, the path degrades to MTP-only with a warning rather than failing the load.

Verified against three real binaries (Studio bundled 726704a, my build of 45b455e HEAD, MTP merge 2555826) all correctly report ngram_mod_flavor=new. Also verified against a freshly-built pre-rename llama-server at 516e8d7a8: probe correctly reports ngram_mod_flavor=legacy, spec_draft_n_max_flag=--draft-max, and the legacy --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 argv brings the server up cleanly with common_speculative_init: initialized ngram_mod with n=24, size=4194304 (16.000 MB) in the log.

6. Chat-completions usage backfill from llama-server timings

llama-server's final SSE chunk emits both an OpenAI-style usage block and a custom timings block. timings.predicted_n is always populated, but usage.completion_tokens can be zero on some server builds. The Studio chat UI computes generation t/s from meta.usage.completion_tokens / totalStreamTime, so a zero completion_tokens makes the UI fall back to wall-clock time (including SSE / proxy / template overhead) and dilutes MTP gains so on / off look indistinguishable in the UI even though llama-server's predicted_per_second shows a real speedup.

Added _backfill_usage_from_timings: when usage.completion_tokens is missing or zero AND timings has predicted_n / prompt_n, synthesize a complete usage dict. Applied at the streaming metadata yield in generate_chat_completion and at the three accumulator / yield sites in generate_chat_completion_with_tools so per-iteration counts are not lost across tool calls.

7. Windows CI: skip bash-stub MTP probe tests on Windows

Four MTP capability-probe tests rely on a bash stub llama-server. Marked them pytest.skipif(sys.platform == "win32") so the Windows leg of staging CI stays green.

8. Bisect verdict on llama.cpp MTP runtime

Between MTP merge (2555826) and master HEAD (45b455e), only 3e12fbd (#23198 "avoid copying logits during prompt decode in MTP") and 49c21f9 (#23256 "initialize pre-norm embedding mask flag") touch MTP/spec runtime. Bench on Qwen3.6-27B-MTP-GGUF Q4_K_XL on B200 at all three commits (with and without --flash-attn): OFF is dead flat (0.2% spread); n_max=2 is dead flat (0.5% spread, MTP merge vs HEAD); n_max=3 shows a small 2.8% regression at HEAD vs MTP merge, introduced at 3e12fbd. Below the 5% confirmed-regression threshold but consistent and real. Not user-facing; no Studio code change needed.

Methodology note

All speedup numbers use a clean methodology that avoids ngram-mod-cache contamination: each prompt is generated exactly once per server lifetime, after a brief unrelated warmup. An earlier "warm steady-state" claim of 6x-on-27B was an artifact: at temp=0/top_k=1 generation is deterministic, the ngram-mod hash pool persists across requests, so a second run of the same prompt hit-rate=~100% in the pool and ran at near-line-rate. Real-world steady state (where each user message is new content) shows modest 1.08x-1.46x speedups for 4B+.

Tests

  • studio/backend/tests/test_llama_cpp_mtp_detection.py: 67 passed including the new sub-3B-fallback tests (monkeypatch on probe_server_capabilities for deterministic gate behavior), the legacy/new ngram-mod probe tests, and the chat-usage-backfill tests.
  • studio/backend/tests/test_llama_server_args.py: 72 passed.
  • studio/backend/tests/test_kv_cache_estimation.py: 159 passed including the 2 new _fit_context_to_vram(mtp_engaged=True) tests.
  • python -m pytest studio/backend/tests/test_llama_*.py studio/backend/tests/test_kv_cache_estimation.py -> 298 passed.

Test plan

  • python -m pytest studio/backend/tests/test_llama_*.py studio/backend/tests/test_kv_cache_estimation.py green (298 passed)
  • Manual verification on B200: Qwen3.6-27B-MTP-GGUF Q4_K_XL with the new defaults gives the documented 1.46x decode speedup (115 vs 79 t/s)
  • Manual verification: 0.8B MTP load auto-falls-back to --spec-type ngram-mod (no draft-mtp); log line printed; can still force MTP via the UI toggle
  • Manual verification on pre-rename llama-server (commit 516e8d7a8): probe reports legacy, server accepts the legacy ngram-mod flag set, common_speculative_init confirms ngram-mod engages
  • Staging fork CI on Linux/macOS/Windows for the plumbing tests
  • Staging fork prebuilt probe: b9204 advertises draft-mtp + --spec-draft-n-max + --spec-draft-p-min

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 359f852ea8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +452 to +453
if backend_spec == "draft-mtp" and request.spec_draft_n_max is not None:
if int(request.spec_draft_n_max) != (llama_backend.spec_draft_n_max or 0):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reload when clearing the draft-token override

When an MTP model is already loaded with an explicit spec_draft_n_max (for example 3), clearing the UI field or sending spec_draft_n_max: null should reload llama-server with the platform default. This check treats None as matching any active backend value, so _request_matches_loaded_settings can return true and skip the reload; the backend _already_in_target_state mirrors the same logic, leaving the old --spec-draft-n-max in effect instead of clearing the override.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

Verification run since the original PR description:

Prebuilt parity. Built llama-server from a clean ggml-org/llama.cpp HEAD (commit 45b455e) and compared against the Unsloth b9204 prebuilt that Studio bundles. Identical advertisement for the spec-decoding family:

--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache
--spec-draft-n-max N    (default: 16)
--spec-draft-n-min N
--spec-draft-p-min P    (default: 0.75)
--spec-ngram-mod-n-match N    (default: 24)
--spec-ngram-mod-n-min N
--spec-ngram-mod-n-max N

Same flag names, same aliases, same defaults. No prebuilt regression vs upstream HEAD.

Cross-OS coverage. Opened a staging-fork PR (danielhanchen#132) that runs the MTP plumbing suite on ubuntu-latest, macos-14, and windows-latest, plus a Linux-only direct-download probe of the b9204 binary. After marking the three bash-stub-based capability-probe tests skipif on Windows, all 141 tests pass on all three OSes. The prebuilt probe confirms the binary advertises the draft-mtp alias and the four critical knobs (--spec-draft-n-max, --spec-draft-p-min, --spec-ngram-mod-n-match, --spec-ngram-mod-n-max). That Windows-skipif change is the latest commit on this PR.

Auto-context interaction (out of scope for this PR). The reported "auto context length halves throughput on RTX 5090 with MTP" symptom is most consistent with a VRAM-pressure interaction at the 32 GB tier: the draft KV cache plus the target KV cache compete for the budget that _fit_context_to_vram allocates, and llama-server falls back to a slower path when the combined footprint overshoots. On a B200 with 183 GB the cliff does not manifest. This is worth a follow-up that reserves an explicit draft-KV slice in the auto-fit budget when MTP is going to engage; happy to send that as a separate PR.

danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 18, 2026
The previous overlay copied the vram-headroom branch's llama_cpp.py
verbatim and lost the spec_draft_n_max plumbing that the new
_already_in_target_state regression tests need. Combine both PR
diffs (unslothai#5582 + unslothai#5585) and copy the merged file in. Also deselect
TestTransformersIntrospection in the workflow filter; those tests
need a live transformers config that is GPU-only on free runners.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d7c96eb9f3

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chat_template_override: effectiveChatTemplateOverride,
cache_type_kv: kvCacheDtype,
speculative_type: speculativeType,
spec_draft_n_max: specDraftNMax,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve inherited spec extras when draft tokens are unset

When the settings panel reloads any GGUF model, this always sends spec_draft_n_max: null even if the user never touched Draft Tokens. The route then sees "spec_draft_n_max" in fields_set and strips inherited --spec-* flags, so a model initially loaded with pass-through extras such as --spec-type ngram-mod loses those flags after an unrelated UI Apply (context/KV/template). Omit this field when it is not actually being changed, or only send null when clearing a previously loaded override.

Useful? React with 👍 / 👎.

draft_n_max = int(spec_draft_n_max)
self._spec_draft_n_max = draft_n_max
else:
draft_n_max = 2 if gpus else 3

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep omitted GPU draft limit at previous default

For GPU MTP loads where the caller leaves Draft Tokens on auto, this changes the emitted --spec-draft-n-max from the previous hard-coded 6 to 2. Because the backend still reports spec_draft_n_max = None for this path, users who never touch the new field silently get a different llama-server configuration, and the PR's default/verification text says GPU auto should remain 6. Use the old GPU default here unless this is an intentional behavior change that is surfaced consistently.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 27ada74f51

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +2699 to +2700
_mtp_size_b = _extract_model_size_b(model_identifier)
_mtp_too_small = _mtp_size_b is not None and _mtp_size_b < 2.0

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Read the MTP size from the GGUF file path

When a local GGUF is selected from a directory whose name does not include the parameter count (for example /models/current/model.gguf with the actual file Qwen3.5-0.8B-MTP-Q4_K_M.gguf), this new sub-2B gate extracts the size only from model_identifier, while the MTP detector just above also consults model_path. In the local GGUF flow, ModelConfig.identifier remains the requested directory/path and config.gguf_file is passed as model_path, so _mtp_size_b becomes None and the new skip path is bypassed, auto-enabling MTP for the exact tiny models this change is trying to keep off. The reload mirror at _already_in_target_state should use the same source as well.

Useful? React with 👍 / 👎.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 159ecff5eb

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +2846 to +2851
elif (
is_mtp_model
and _mtp_too_small
and not user_owns_spec_type
and normalized_spec in (None, "", "default")
):

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor explicit MTP enable requests on tiny models

When a sub-2B MTP model is loaded and the user turns the Studio Speculative Decoding switch on, the frontend sends speculative_type: "default" (chat-settings-sheet.tsx lines 1002-1004), but this branch treats both omitted and explicit "default" as auto mode and leaves MTP disabled. That makes the toggle unable to override the new size gate despite the log message saying the Studio toggle can force it; only raw --spec-type extras work. Distinguish an explicit enable from an omitted/default auto request before applying the sub-2B skip.

Useful? React with 👍 / 👎.

danielhanchen and others added 9 commits May 19, 2026 10:19
Surface llama-server's --spec-draft-n-max as a first-class
LoadRequest field so users can tune the MTP draft tree size from
the chat settings panel. Default behaviour is unchanged: when the
caller omits spec_draft_n_max, the existing platform defaults still
apply (6 on GPU, 3 on CPU/Mac).

Why this matters: on context-constrained loads the draft KV cache
competes with the target model's KV cache for VRAM. Lowering
spec_draft_n_max reduces that pressure, lets a larger user context
fit, and recovers throughput; raising it pays off when draft
acceptance is high enough to amortise the extra cache.

Backend
- LoadRequest gains an optional spec_draft_n_max: int (1..16).
- LlamaCppBackend.load_model accepts and persists the override on
  self._spec_draft_n_max, used in place of the hardcoded 6/3 in the
  MTP emit branch.
- LoadResponse and InferenceStatusResponse echo the active value
  (None when the platform default is in effect) so the UI can
  hydrate the input on refresh.
- _already_in_target_state and _request_matches_loaded_settings
  compare spec_draft_n_max alongside speculative_type so a value
  change triggers a reload rather than no-op'ing.
- strip_shadowing_flags now strips inherited --spec-* extras when
  either speculative_type or spec_draft_n_max is in fields_set, so
  an inherited --spec-draft-n-max cannot last-wins-override a fresh
  request's first-class field.

Frontend
- LoadModelRequest, LoadModelResponse, InferenceStatusResponse
  TypeScript shapes get spec_draft_n_max.
- chat-runtime-store gains specDraftNMax / loadedSpecDraftNMax and
  a setter, hydrated from /v1/status and /v1/load.
- chat-settings-sheet renders a "Draft Tokens" numeric input
  directly under the Speculative Decoding switch when that switch
  is on. Toggling the switch off clears the override; the Reset
  button restores the loaded value.

Tests
- Four new regression tests cover _already_in_target_state with
  matching / mismatching / non-MTP / unset spec_draft_n_max.
- Existing test_llama_server_args.py and test_llama_cpp_mtp_detection.py
  green: 141 passed locally.
… set

llama.cpp server documents --spec-draft-p-min (default 0.75, min draft
acceptance probability) and --spec-draft-p-split (default 0.10). Both
are first-class spec-decoding knobs that should travel with the rest
of the --spec-* family when an Apply re-sets speculative_type, so an
inherited override doesn't leak across a fresh load.
The four probe_server_capabilities tests use a bash stub written to
tmp_path/llama-server, which Windows' subprocess can't execute
directly (no shebang resolution, .bat / .cmd would be needed). Mark
them skipif sys.platform == 'win32' so the rest of the MTP plumbing
suite stays green on Windows CI. Unix coverage is unchanged.
Bench on B200 / Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL across five prompt
types (essay, code, story, math, science) with greedy temp=0:

  prompt    OFF    n=1    n=2    n=3    n=6
  essay    79.1   93.4   93.8   84.7   64.6
  code     79.1  104.4  116.6  113.5  103.0
  story    79.1   99.2  105.7  101.8   88.9
  math     79.1  100.8  110.8  111.8   98.2
  science  79.1  100.1  110.8  110.8  102.9

The previous hardcoded GPU default of 6 was 17% SLOWER than spec-off
on the essay prompt (64.6 vs 79.1 t/s) and 11-50% slower than n=2 on
the rest. n=2 wins on 4/5 prompts with a 1.18x-1.47x speedup vs OFF;
n=3 wins on the math prompt by a hair. n=6 collapses once acceptance
rate drops past n=3 -- wasted draft decode dominates the per-step
budget.

Matches the dataset README ("n_max=2 is the sweet spot for 36 of 42
quants"). Keeps CPU/Mac default at 3, which empirically tracks the
narrower ngram+MTP chained budget on those platforms.

Users who want the old behaviour can pass spec_draft_n_max in
LoadRequest (the toggle this PR also adds) or --spec-draft-n-max via
llama_extra_args.
Two MTP-visibility fixes uncovered while bisecting llama.cpp post-#22673
on Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL on B200.

Size gate. Direct llama-server bench (no Studio measurement loop) at
n_predict=192 across 9 prompts shows MTP regresses vs spec-off on
sub-2B dense models because draft cost exceeds savings:

  Qwen3.5-0.8B Q4_K_XL   GPU: 452.0 OFF -> 283.4 t/s n=2  (0.63x)
                         CPU: 84.5  OFF -> 64.9  t/s n=3  (0.77x)
  Qwen3.5-4B  Q4_K_XL    GPU: 241.0 OFF -> 258.2 t/s n=2  (1.07x)
  Qwen3.5-9B  Q4_K_XL    GPU: 201.6 OFF -> 228.9 t/s n=2  (1.14x)
  Qwen3.5-27B Q4_K_XL    GPU:  78.8 OFF -> 113.6 t/s n=2  (1.44x)
  Qwen3.6-27B Q4_K_XL    GPU:  78.8 OFF -> 113.6 t/s n=2  (1.44x)
  Qwen3.6-35B-A3B Q4     GPU: 192.3 OFF -> 223.2 t/s n=2  (1.16x)

The 2B inflection is sharp. Skip auto-promote to draft-mtp when the
identifier reports <2.0B params; users can still force via --spec-type
or the Speculative Decoding toggle. Mirror the gate in the
reload-skip check so a sub-2B reload-with-default does not bounce a
spec-off backend.

Chat-completions usage. llama-server's final SSE chunk emits both an
OpenAI-style usage block and a custom timings block. timings.predicted_n
is always populated, but usage.completion_tokens is zero on some
server builds. The Studio chat UI computes generation t/s from
meta.usage.completion_tokens / totalStreamTime, so a zero
completion_tokens makes the UI fall back to wall-clock time
(including SSE / proxy / template overhead) which dilutes MTP gains and
makes ON look the same as OFF.

Add _backfill_usage_from_timings: if usage.completion_tokens is missing
or zero AND timings has predicted_n/prompt_n, synthesize a complete
usage dict. Apply at the streaming metadata yield in
generate_chat_completion and at the three accumulator/yield sites in
generate_chat_completion_with_tools so per-iteration counts are not
silently lost across tool calls.

Tests cover both the gate (sub-2B skips, 2B+ promotes) and the
backfill (zero usage filled, real usage preserved, empty timings
passthrough).
llama.cpp upstream renamed the ngram-mod tuning knobs:

  --draft-max         -> --spec-ngram-mod-n-max  (and --spec-draft-n-max)
  --draft-min         -> --spec-ngram-mod-n-min  (and --spec-draft-n-min)
  --spec-ngram-size-n -> --spec-ngram-mod-n-match

The new names are real flags on post-rename builds and stub removal
entries on the same builds (with description "argument has been
removed"). Pre-rename builds only carry the legacy names as real
flags. Studio was emitting the new names unconditionally, so a user
running a pre-rename llama-server (e.g. an older prebuilt or a
hand-installed binary) would see "unknown argument" errors when the
ngram-mod path engages, or silent drop of the ngram knobs.

Extend `probe_server_capabilities` to parse the help text into
per-flag description blocks and tell real flags apart from removal
stubs by the "argument has been removed" marker. Add three new probe
fields: `ngram_mod_flavor` ("new" / "legacy" / None),
`supports_ngram_mod`, and `spec_draft_n_max_flag` (the actual n_max
flag the binary accepts). Cached by (path, mtime) the same way as
`mtp_token`.

Add `_build_ngram_mod_flags(caps, ...)` that picks the right flag
set, returning [] when neither is usable so callers can drop ngram
chaining entirely on minimal binaries.

Wire both call sites to use the probe-driven flag set:
- CPU/Mac MTP comma-chain (--spec-type ngram-mod,draft-mtp) emits
  legacy or new knobs as appropriate. If neither set is available,
  degrade to MTP-only (warn but still engage spec).
- Standalone --spec-type ngram-mod branch uses the same helper.

Tests cover post-rename detection, legacy detection, removal-stub
discrimination, minimal-binary case, and all three branches of
`_build_ngram_mod_flags` plus custom n_match/n_min/n_max values.

Verified against three real binaries (Studio bundled 726704a, my
build of 45b455e HEAD, and the MTP merge baseline 2555826) all
correctly reporting ngram_mod_flavor=new.
Earlier sub-2B gate disabled speculative decoding entirely for tiny
dense MTP models because the MTP draft head's per-token cost exceeds
the acceptance savings at that scale. The "fully off" fallback was
conservative -- ngram-mod has near-zero idle cost on diverse content
and consistently outperforms both off and draft-mtp at sub-3B.

Clean-methodology bench (each of 9 distinct prompts run once after
two unrelated warmup prompts so the ngram-mod hash pool is
realistically populated but never holds the exact deterministic
output we're about to measure):

  Q4_K_XL on B200:
    0.8B  OFF=451  draft-mtp n=2=263 (0.58x)  ngram-only=498 (1.10x)
    2B    OFF=377  draft-mtp n=2=308 (0.82x)  ngram-only=369 (1.00x)
    4B    OFF=240  draft-mtp n=2=260 (1.08x)  -- 4B+ wins with MTP

  Q4_K_XL on x86 48 cores:
    0.8B  OFF= 80  chained n=2= 69 (0.86x)  ngram-only= 95 (1.19x)
    2B    OFF= 62  chained n=2= 51 (0.83x)  ngram-only= 63 (1.01x)
    4B    OFF= 31  chained n=2= 41 (1.33x)

Change:
- Raise the MTP-skip threshold from 2.0B to 3.0B (2B falls below it).
- When skipping the MTP head, fall back to --spec-type ngram-mod via
  the probe-driven _build_ngram_mod_flags helper. Works on both
  post-rename and pre-rename llama-server builds.
- If the binary advertises neither ngram-mod flavor, fall back to
  spec-off (older binaries that don't support ngram-mod at all).
- Mirror the same fallback in _already_in_target_state so a sub-3B
  reload-with-default does not bounce a ngram-mod backend.

Tests updated: monkeypatch probe_server_capabilities so the gate
behavior is deterministic regardless of which llama-server happens
to be on the host. +1 new test for the "binary has no ngram-mod
support" branch; renamed prior 2B/0.8B tests to reflect new semantics.

This generalizes the size gate to be probe-driven instead of a hard
"disable spec" branch.
@danielhanchen danielhanchen force-pushed the studio-spec-draft-n-max-toggle branch from 87b8615 to d58038c Compare May 19, 2026 10:22
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
…restore mtp_engaged)

Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping
the two already-merged commits and renumbering the remaining nine. The
staging fork was sitting on the pre-rebase llama_cpp.py + tests; this
commit replays the rebased file content while preserving PR unslothai#5585's
mtp_engaged auto-fit headroom (staging-only patch absent upstream).

Restored on top of the new file:
- mtp_engaged: bool = False on _fit_context_to_vram
- budget_frac = 0.85 if mtp_engaged else 0.90
- _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did
  not force --spec-type) before the auto-fit GPU subset loops
- mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites

All MTP detection + fit_context + fit_mtp tests pass (161 passed).
danielhanchen and others added 2 commits May 19, 2026 11:58
…P+Ngram / Off)

Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option
Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs,
ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes
force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no
ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the
ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off
state, kept so the Switch's "disable" capability isn't lost.

Backend
- New module-level _canonicalize_spec_mode(value) maps any accepted input
  (canonical, legacy "default" / "draft-mtp" / "ngram-mod" / "ngram-simple",
  or comma-chained "ngram-mod,draft-mtp") onto one of auto / mtp / ngram /
  mtp+ngram / off / ngram-simple / None. Lets external callers and old
  persisted UI state round-trip without breaking.
- LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode
  property storing the canonical UI mode the user requested. Status
  responses round-trip this instead of the resolved internal flag, so the
  dropdown restores the picked value after reload / refresh (Auto on a 27B
  MTP GGUF resolves to draft-mtp internally but the dropdown stays on
  "Auto").
- The resolver block in load_model is extracted into a unit-testable
  _build_speculative_flags method. Forced MTP / MTP+Ngram on a sub-3B or
  non-MTP GGUF logs a warning and engages anyway (user override > the
  Auto-path sub-3B fallback).
- _already_in_target_state and routes/inference._request_matches_loaded_settings
  now compare canonical-requested mode, dropping the old auto-promotion
  mirror. spec_draft_n_max still gates on the resolved spec so Auto + a
  changed n_max still bounces a reload.

Frontend
- chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV
  Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft
  Tokens input only visible when speculativeType is "mtp" or "mtp+ngram".
- chat-runtime-store.ts: initial value flips from "default" to "auto".
- use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend
  canonicaliser so persisted "default" / "draft-mtp" / "ngram-mod" / chain
  values hydrate to the right dropdown option.
- types/api.ts: docs the canonical wire vocabulary.

Tests
- 53 new assertions in test_llama_cpp_mtp_detection.py: full
  _canonicalize_spec_mode table, a 23-row resolver matrix across
  (requested mode) x (GPU/CPU) x (model size class), plus n_max override,
  user-extra-args precedence, requested-mode round-trip, and graceful
  degrade on an outdated llama-server without an MTP token.
- 165 existing backend tests still green. 218 total in the MTP /
  server-args / reload-inheritance suite.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9d994c3fdc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines 445 to +446
if llama_backend.is_vision:
req_spec = "off"
req_mode = "off"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Compare the recorded vision spec mode

For vision GGUFs loaded with the default Auto setting, llama_cpp.py no longer suppresses speculative decoding for vision models (it records requested_spec_mode = "auto" and explicitly says there is no vision gate), but this route still rewrites every vision request to "off". In that common scenario _request_matches_loaded_settings always returns false (off != auto), so re-applying or reloading the same already-loaded vision model needlessly restarts llama-server instead of taking the fast path.

Useful? React with 👍 / 👎.

When the user switches from model A to a different model B, clear the
runtime store's speculativeType + specDraftNMax (and their loaded*
shadows). The new load request then carries null, the backend
canonicalises that to "auto", and its platform-aware resolver runs
fresh for the new model.

Without this, a non-MTP model loaded with "Off" carried the Off choice
into a subsequent MTP load, suppressing MTP auto-promotion (and the
sub-3B ngram-mod fallback) until the user manually opened settings and
flipped the dropdown back to Auto. The clean-sweep deep probe caught
it as anomaly A-1.

The reset only fires when currentCheckpoint != modelId, so a
same-model reapply or forceReload still honours the user's current
spec choice. End-to-end probe on Qwen3.5-4B-GGUF (non-MTP, Off) ->
Qwen3.5-0.8B-MTP confirms: dropdown shows Auto, /api/inference/status
returns speculative_type=auto, studio.log shows the Auto sub-3B
fallback emitted --spec-type ngram-mod.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b9dbcb2d0f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +454 to +456
if backend_mode in ("mtp", "mtp+ngram") and request.spec_draft_n_max is not None:
if int(request.spec_draft_n_max) != (llama_backend.spec_draft_n_max or 0):
return False

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reload Auto-promoted MTP when draft tokens change

When the same MTP GGUF is already loaded in auto, changing spec_draft_n_max through the load API is skipped here because backend_mode remains "auto", so the draft-token comparison never runs. _build_speculative_flags() does apply spec_draft_n_max when Auto resolves to MTP, so a request like speculative_type: "auto", spec_draft_n_max: 5 should restart llama-server instead of reusing the existing default --spec-draft-n-max.

Useful? React with 👍 / 👎.

@danielhanchen danielhanchen merged commit 27d4ace into main May 19, 2026
20 of 33 checks passed
@danielhanchen danielhanchen deleted the studio-spec-draft-n-max-toggle branch May 19, 2026 13:17
danielhanchen added a commit that referenced this pull request May 19, 2026
After PR #5582 introduced the 5-mode Speculative Decoding dropdown plus
_canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes:
  * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed)
  * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion
  * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM)
  * ngram / ngram-simple / off -> never engage MTP
  * user --spec-type in extra_args -> resolver suppressed; no headroom

The old gate triggered on "anything but off", so it over-reserved the
0.85 budget when the user explicitly picked Ngram (no MTP) or when
Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom
cost was minor but unnecessary.

Mirrors the same logic already encoded in _build_speculative_flags so
the auto-fit budget and the actual emission agree on whether MTP is
running.

All 361 backend tests pass.
danielhanchen added a commit that referenced this pull request May 19, 2026
)

* studio: reserve VRAM headroom for the MTP draft cache in auto-fit

When MTP is going to engage on this load, _fit_context_to_vram now
budgets 0.85 of available VRAM instead of 0.90, leaving room for
llama.cpp's secondary MTP draft KV cache + compute graph buffers.

Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF
UD-Q4_K_XL at native auto-context running roughly half the speed of
the same model with a slightly smaller context. The most parsimonious
explanation is a VRAM cliff: at native context the target's KV
already eats the 90% budget, then llama-server allocates the draft
cache + draft graph on top and spills into a slower partial-offload
path. Reducing the budget by 5% on MTP loads avoids the spill without
penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.)
the fit is unchanged because the requested context already fits in
the tighter budget too.

MTP detection mirrors the auto-promotion logic in load_model: the
GGUF advertises nextn_predict_layers, or the model identifier /
local path matches the -MTP marker, and the user has not explicitly
opted out via speculative_type="off" or --spec-type extra args.

Tests: two new cases in test_kv_cache_estimation.py verify that
mtp_engaged=True yields a context less-than-or-equal-to the
non-MTP path on a tight budget, and that kv_on_gpu=False still
short-circuits regardless of mtp_engaged.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: gate _mtp_will_engage on canonical-mode resolver

After PR #5582 introduced the 5-mode Speculative Decoding dropdown plus
_canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes:
  * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed)
  * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion
  * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM)
  * ngram / ngram-simple / off -> never engage MTP
  * user --spec-type in extra_args -> resolver suppressed; no headroom

The old gate triggered on "anything but off", so it over-reserved the
0.85 budget when the user explicitly picked Ngram (no MTP) or when
Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom
cost was minor but unnecessary.

Mirrors the same logic already encoded in _build_speculative_flags so
the auto-fit budget and the actual emission agree on whether MTP is
running.

All 361 backend tests pass.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@danielhanchen

Copy link
Copy Markdown
Member Author

Reviewed end-to-end on pr-5582. Looks good — all four claimed fixes confirmed and tests pass — with one subtle latent issue worth a follow-up.

Confirmed fixes

  • (a) Stale/renamed ngram-mod args. _build_ngram_mod_flags + probe_server_capabilities now return ngram_mod_flavor ("new"/"legacy"/None) and spec_draft_n_max_flag. Pre-rename llama-server gets --spec-ngram-size-n / --draft-min / --draft-max; post-rename gets --spec-ngram-mod-n-*. Help parser correctly distinguishes real flags from "argument has been removed" stubs.
  • (b) Bad sub-3B MTP auto-selection. _mtp_too_small = size < 3.0 falls back to _emit_ngram_mod(); sub-2B skips MTP entirely (commit 41ac2027).
  • (c) Reload-state mismatches. _already_in_target_state and _request_matches_loaded_settings now compare on canonical requested_spec_mode + spec_draft_n_max; _requested_spec_mode / _spec_draft_n_max are cleared on unload; the hook resets store to Auto on model switch (hooks/use-chat-model-runtime.ts:563-571).
  • (d) Zero completion-token usage. _backfill_usage_from_timings (llama_cpp.py:~595) backfills from timings.predicted_n / prompt_n and is wired into the metadata yield (~4197) and both tool-loop accumulators (~4649, ~4664, ~4757, ~4794).

Tests: pytest studio/backend/tests/test_llama_cpp_mtp_detection.py studio/backend/tests/test_gguf_reload_inheritance.py -q128 passed in 51.02s on Linux. Pinned cross-OS in danielhanchen#137.

Public API surface is additive only: new optional kwarg on LlamaCppBackend.load_model, new keys on probe_server_capabilities (ngram_mod_flavor, supports_ngram_mod, spec_draft_n_max_flag), new optional spec_draft_n_max on LoadRequest/LoadResponse/InferenceStatusResponse. Nothing renamed or removed. Backward compatible.

Latent issue worth a quick follow-up: clearing a previously-set spec_draft_n_max override via Apply may not actually trigger a reload.

  • routes/inference.py:554-556 (and the mirrored check in llama_cpp.py:~3477-3482) guards the comparison with request.spec_draft_n_max is not None, so if the loaded backend has e.g. n_max=4 and the user clears the field (None), _request_matches_loaded_settings returns True and the route short-circuits to the cached LoadResponse (routes/inference.py:668-697). The cleared override is silently dropped and the server keeps n_max=4 until something else forces a reload.
  • The Apply button's forceReload: true is frontend-only — there's no force_reload field on the backend LoadRequest.
  • Suggest tightening the guard to also bounce when backend.spec_draft_n_max is not None and request.spec_draft_n_max is None (clear-on-set asymmetry).

Side note: PR description still says "defaults unchanged (6 GPU / 3 CPU/Mac)" but _resolved_draft_n_max returns 2 if gpus else 3 (intentional per commit 4bed8ae0). Not a regression — just worth updating the description.

@danielhanchen

Copy link
Copy Markdown
Member Author

Cross-OS validation finished in danielhanchen#137.

test_llama_cpp_mtp_detection.py + test_gguf_reload_inheritance.py:

Runner Result
ubuntu-latest 128 passed in 1.14s
macos-14 128 passed in 0.26s
windows-latest 120 passed, 8 skipped in 0.46s

The 8 Windows skips are expected (the existing WINDOWS_SKIP markers in those modules). All non-skipped behaviour parity-passes on every platform.

danielhanchen added a commit that referenced this pull request May 22, 2026
Builds on the 5-mode Speculative Decoding dropdown (PR #5582) by adding
two upstream-driven knobs that landed in llama.cpp #23269 (MTP
clean-up):

1. spec_draft_p_min: minimum draft probability threshold for MTP
   speculative decoding (--spec-draft-p-min). Drafts below this
   probability are rejected. Was non-functional pre-#23269; now
   defaults to 0.0 upstream. Studio exposes it as a "Draft p-min"
   numeric input below "Draft Tokens", visible only when the dropdown
   is MTP or MTP+Ngram (the only modes where MTP actually engages and
   the knob has effect).

2. ngram-map-k / ngram-map-k4v: new spec types added alongside
   ngram-mod. Each carries its own knob triplet
   (--spec-{variant}-size-n/m/min-hits). They are NOT in the
   dropdown -- power-user-only -- but the load API accepts them and
   the resolver emits the correct flag set when probed support is
   present.

Backend
- _canonicalize_spec_mode recognises ngram-map-k / ngram-map-k4v.
- New helper _build_ngram_map_k_flags(caps, variant=...) emits the
  knob triplet only when the binary advertises the knobs as real
  flags (not removal stubs).
- _build_speculative_flags grows two branches and an inline
  _maybe_emit_p_min helper that flows p_min through the MTP path
  only. Auto on an MTP GGUF still gets p_min applied because the
  resolved emission is MTP.
- LoadRequest.spec_draft_p_min (Optional[float], 0..1). Threaded
  through routes/inference.py at the four wire sites and the
  _request_matches_loaded_settings comparator.
- _already_in_target_state takes spec_draft_p_min so a changed p_min
  bounces a reload even on the Auto-promoted path.
- probe_server_capabilities now reports spec_draft_p_min_flag,
  supports_ngram_map_k, and supports_ngram_map_k4v.

Frontend
- chat-runtime-store: specDraftPMin / loadedSpecDraftPMin / setter.
- use-chat-model-runtime: hydrate p_min from /api/inference/status
  and the load response. Reset p_min alongside spec mode and n_max
  when the user switches to a different model.
- chat-settings-sheet: new "Draft p-min" number input
  (min 0, max 1, step 0.05), visible when speculativeType is mtp
  or mtp+ngram. Wired into the Reset and dirty-state machinery.

Tests
- 12 new assertions in test_llama_cpp_mtp_detection.py: p_min emission
  matrix (MTP modes only; never for auto/ngram/off; auto-promoted
  draft-mtp still gets p_min; graceful degrade when binary lacks
  --spec-draft-p-min), ngram-map-k / ngram-map-k4v emission with the
  right knob triplet, no-emit-when-unsupported, canonicalize
  recognition. 373 total backend tests pass (was 361 before).
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 22, 2026
Each PR ran the same staged source files before, which went stale when
the upstream PR commits advanced. Refactor to one job per PR with an
actions/checkout of that PR's head ref, so cross-OS validation
always uses the latest commit:

  - PR unslothai#5603 sandbox            -> studio-sandbox-hardening
  - PR unslothai#5620 parser parity      -> studio-tools-multi-format-v2
  - PR unslothai#5696 mtp reload guards  -> followup-mtp-reload-guards (unslothai#5582 followup)
  - PR unslothai#5695 lockfile audit     -> followup-lockfile-audit-regressions (unslothai#5604 followup)

4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap).
cancel-in-progress per (workflow, ref) keeps iteration cheap.

All tests stay CPU-only and rely on the CUDA spoof harness in
tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU
is required on any runner.
rhsCZ pushed a commit to rhsCZ/unsloth that referenced this pull request May 23, 2026
…thai#5582

Five issues surfaced after unslothai#5582 merged. All addressed with matching
pytest coverage (15 new tests, 147 total green).

Bug A -- route guard compared against the requested UI mode rather than
the backend's resolved spec mode. A user request setting
``spec_draft_n_max=2`` against a backend that was auto-promoted from
``auto`` -> ``draft-mtp`` saw ``requested_spec_mode == "auto"`` (not in
``("mtp", "mtp+ngram")``) and skipped the comparison, returning
``already_loaded`` with the stale value still active. Now mirrors the
backend-side guard's check against ``speculative_type == "draft-mtp"``.

Bug B -- both reload guards short-circuited the n_max comparison when
the request value was ``None``, treating it as a wildcard. A backend
loaded with an explicit override of 8 could never be cleared back to
the platform default without swapping the model. Both guards now treat
the ``(None vs explicit)`` flip as a difference: clear-to-default and
set-from-default both bounce a reload, while ``(None == None)`` and
``(N == N)`` continue to match.

Bug C -- chained MTP+ngram on a legacy llama-server (pre arg-rename)
emitted ``--draft-max`` twice: once for MTP's draft length (e.g. 2),
once for ngram-mod's size-N max (e.g. 64). llama-server's last-wins
parsing clobbered the MTP value with 64, defeating the
``--spec-draft-n-max`` slider. ``_build_ngram_mod_flags`` now takes a
``chain_with_mtp`` kwarg that suppresses ``--draft-max`` on the legacy
flavor when MTP will emit it; the post-rename flavor uses distinct
``--spec-ngram-mod-*`` names that cannot collide.

Bug D -- a forced ``speculative_type="ngram"`` request emitted
``--spec-type ngram-mod`` even on binaries that did not advertise
ngram-mod support, causing llama-server to refuse to start. The auto
path already checked ``supports_ngram_mod`` before emitting; the
forced path now mirrors that check and loads without spec (with a
warning that matches the MTP-token-missing path).

Bug E -- ``speculative_type="none"`` is llama.cpp's own explicit-disable
spelling, and external API callers commonly use ``"disable"`` /
``"disabled"``. None of these were in the canonical spec mode set or
the legacy alias map, so they fell through to ``"auto"`` and silently
re-enabled MTP -- the opposite of the user's intent. Added all three
to ``_LEGACY_SPEC_MODE_MAP`` as aliases for ``"off"``.

Tests
-----
- test_canonicalize_spec_mode_none_aliases_map_to_off (6 cases via
  parametrize): "none"/"None"/"NONE"/"  none  "/"disable"/"Disabled"
  all canonicalise to "off".
- test_build_ngram_mod_flags_legacy_chained_omits_draft_max +
  test_build_ngram_mod_flags_legacy_standalone_keeps_draft_max +
  test_build_ngram_mod_flags_new_flavor_always_emits_distinct_names:
  the chain_with_mtp kwarg suppresses only the legacy flavor's
  --draft-max, never the new-flavor knobs.
- test_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max:
  end-to-end check that the assembled spec block has exactly one
  --draft-max carrying the MTP draft length.
- test_build_speculative_flags_forced_ngram_without_support_skips_spec
  + test_build_speculative_flags_forced_ngram_with_support_emits_spec:
  forced ngram refuses on a binary lacking ngram-mod support; still
  emits cleanly on a supporting binary.
- test_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload:
  backend-side guard covers both clear-to-default and set-from-default.
- test_route_guard_auto_promoted_mtp_{bounces,matches,clear_*}: route
  guard now compares against resolved spec mode and handles the None
  flip symmetrically.
- test_route_guard_ignores_n_max_when_resolved_spec_is_not_mtp:
  non-MTP resolved spec (e.g. ngram-mod) still ignores n_max.

147/147 spec/reload test suites green.
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…nslothai#5582)

* studio: add --spec-draft-n-max toggle for MTP speculative decoding

Surface llama-server's --spec-draft-n-max as a first-class
LoadRequest field so users can tune the MTP draft tree size from
the chat settings panel. Default behaviour is unchanged: when the
caller omits spec_draft_n_max, the existing platform defaults still
apply (6 on GPU, 3 on CPU/Mac).

Why this matters: on context-constrained loads the draft KV cache
competes with the target model's KV cache for VRAM. Lowering
spec_draft_n_max reduces that pressure, lets a larger user context
fit, and recovers throughput; raising it pays off when draft
acceptance is high enough to amortise the extra cache.

Backend
- LoadRequest gains an optional spec_draft_n_max: int (1..16).
- LlamaCppBackend.load_model accepts and persists the override on
  self._spec_draft_n_max, used in place of the hardcoded 6/3 in the
  MTP emit branch.
- LoadResponse and InferenceStatusResponse echo the active value
  (None when the platform default is in effect) so the UI can
  hydrate the input on refresh.
- _already_in_target_state and _request_matches_loaded_settings
  compare spec_draft_n_max alongside speculative_type so a value
  change triggers a reload rather than no-op'ing.
- strip_shadowing_flags now strips inherited --spec-* extras when
  either speculative_type or spec_draft_n_max is in fields_set, so
  an inherited --spec-draft-n-max cannot last-wins-override a fresh
  request's first-class field.

Frontend
- LoadModelRequest, LoadModelResponse, InferenceStatusResponse
  TypeScript shapes get spec_draft_n_max.
- chat-runtime-store gains specDraftNMax / loadedSpecDraftNMax and
  a setter, hydrated from /v1/status and /v1/load.
- chat-settings-sheet renders a "Draft Tokens" numeric input
  directly under the Speculative Decoding switch when that switch
  is on. Toggling the switch off clears the override; the Reset
  button restores the loaded value.

Tests
- Four new regression tests cover _already_in_target_state with
  matching / mismatching / non-MTP / unset spec_draft_n_max.
- Existing test_llama_server_args.py and test_llama_cpp_mtp_detection.py
  green: 141 passed locally.

* studio: add --spec-draft-p-min and --spec-draft-p-split to spec strip set

llama.cpp server documents --spec-draft-p-min (default 0.75, min draft
acceptance probability) and --spec-draft-p-split (default 0.10). Both
are first-class spec-decoding knobs that should travel with the rest
of the --spec-* family when an Apply re-sets speculative_type, so an
inherited override doesn't leak across a fresh load.

* studio/tests: skip MTP capability-probe tests on Windows

The four probe_server_capabilities tests use a bash stub written to
tmp_path/llama-server, which Windows' subprocess can't execute
directly (no shebang resolution, .bat / .cmd would be needed). Mark
them skipif sys.platform == 'win32' so the rest of the MTP plumbing
suite stays green on Windows CI. Unix coverage is unchanged.

* studio: lower MTP GPU default --spec-draft-n-max from 6 to 2

Bench on B200 / Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL across five prompt
types (essay, code, story, math, science) with greedy temp=0:

  prompt    OFF    n=1    n=2    n=3    n=6
  essay    79.1   93.4   93.8   84.7   64.6
  code     79.1  104.4  116.6  113.5  103.0
  story    79.1   99.2  105.7  101.8   88.9
  math     79.1  100.8  110.8  111.8   98.2
  science  79.1  100.1  110.8  110.8  102.9

The previous hardcoded GPU default of 6 was 17% SLOWER than spec-off
on the essay prompt (64.6 vs 79.1 t/s) and 11-50% slower than n=2 on
the rest. n=2 wins on 4/5 prompts with a 1.18x-1.47x speedup vs OFF;
n=3 wins on the math prompt by a hair. n=6 collapses once acceptance
rate drops past n=3 -- wasted draft decode dominates the per-step
budget.

Matches the dataset README ("n_max=2 is the sweet spot for 36 of 42
quants"). Keeps CPU/Mac default at 3, which empirically tracks the
narrower ngram+MTP chained budget on those platforms.

Users who want the old behaviour can pass spec_draft_n_max in
LoadRequest (the toggle this PR also adds) or --spec-draft-n-max via
llama_extra_args.

* studio: skip MTP auto-promote on sub-2B models, backfill chat usage

Two MTP-visibility fixes uncovered while bisecting llama.cpp post-#22673
on Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL on B200.

Size gate. Direct llama-server bench (no Studio measurement loop) at
n_predict=192 across 9 prompts shows MTP regresses vs spec-off on
sub-2B dense models because draft cost exceeds savings:

  Qwen3.5-0.8B Q4_K_XL   GPU: 452.0 OFF -> 283.4 t/s n=2  (0.63x)
                         CPU: 84.5  OFF -> 64.9  t/s n=3  (0.77x)
  Qwen3.5-4B  Q4_K_XL    GPU: 241.0 OFF -> 258.2 t/s n=2  (1.07x)
  Qwen3.5-9B  Q4_K_XL    GPU: 201.6 OFF -> 228.9 t/s n=2  (1.14x)
  Qwen3.5-27B Q4_K_XL    GPU:  78.8 OFF -> 113.6 t/s n=2  (1.44x)
  Qwen3.6-27B Q4_K_XL    GPU:  78.8 OFF -> 113.6 t/s n=2  (1.44x)
  Qwen3.6-35B-A3B Q4     GPU: 192.3 OFF -> 223.2 t/s n=2  (1.16x)

The 2B inflection is sharp. Skip auto-promote to draft-mtp when the
identifier reports <2.0B params; users can still force via --spec-type
or the Speculative Decoding toggle. Mirror the gate in the
reload-skip check so a sub-2B reload-with-default does not bounce a
spec-off backend.

Chat-completions usage. llama-server's final SSE chunk emits both an
OpenAI-style usage block and a custom timings block. timings.predicted_n
is always populated, but usage.completion_tokens is zero on some
server builds. The Studio chat UI computes generation t/s from
meta.usage.completion_tokens / totalStreamTime, so a zero
completion_tokens makes the UI fall back to wall-clock time
(including SSE / proxy / template overhead) which dilutes MTP gains and
makes ON look the same as OFF.

Add _backfill_usage_from_timings: if usage.completion_tokens is missing
or zero AND timings has predicted_n/prompt_n, synthesize a complete
usage dict. Apply at the streaming metadata yield in
generate_chat_completion and at the three accumulator/yield sites in
generate_chat_completion_with_tools so per-iteration counts are not
silently lost across tool calls.

Tests cover both the gate (sub-2B skips, 2B+ promotes) and the
backfill (zero usage filled, real usage preserved, empty timings
passthrough).

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: probe + emit legacy ngram-mod flags for pre-rename llama-server

llama.cpp upstream renamed the ngram-mod tuning knobs:

  --draft-max         -> --spec-ngram-mod-n-max  (and --spec-draft-n-max)
  --draft-min         -> --spec-ngram-mod-n-min  (and --spec-draft-n-min)
  --spec-ngram-size-n -> --spec-ngram-mod-n-match

The new names are real flags on post-rename builds and stub removal
entries on the same builds (with description "argument has been
removed"). Pre-rename builds only carry the legacy names as real
flags. Studio was emitting the new names unconditionally, so a user
running a pre-rename llama-server (e.g. an older prebuilt or a
hand-installed binary) would see "unknown argument" errors when the
ngram-mod path engages, or silent drop of the ngram knobs.

Extend `probe_server_capabilities` to parse the help text into
per-flag description blocks and tell real flags apart from removal
stubs by the "argument has been removed" marker. Add three new probe
fields: `ngram_mod_flavor` ("new" / "legacy" / None),
`supports_ngram_mod`, and `spec_draft_n_max_flag` (the actual n_max
flag the binary accepts). Cached by (path, mtime) the same way as
`mtp_token`.

Add `_build_ngram_mod_flags(caps, ...)` that picks the right flag
set, returning [] when neither is usable so callers can drop ngram
chaining entirely on minimal binaries.

Wire both call sites to use the probe-driven flag set:
- CPU/Mac MTP comma-chain (--spec-type ngram-mod,draft-mtp) emits
  legacy or new knobs as appropriate. If neither set is available,
  degrade to MTP-only (warn but still engage spec).
- Standalone --spec-type ngram-mod branch uses the same helper.

Tests cover post-rename detection, legacy detection, removal-stub
discrimination, minimal-binary case, and all three branches of
`_build_ngram_mod_flags` plus custom n_match/n_min/n_max values.

Verified against three real binaries (Studio bundled 726704a, my
build of 45b455e HEAD, and the MTP merge baseline 2555826) all
correctly reporting ngram_mod_flavor=new.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: sub-3B MTP falls back to ngram-mod, not off

Earlier sub-2B gate disabled speculative decoding entirely for tiny
dense MTP models because the MTP draft head's per-token cost exceeds
the acceptance savings at that scale. The "fully off" fallback was
conservative -- ngram-mod has near-zero idle cost on diverse content
and consistently outperforms both off and draft-mtp at sub-3B.

Clean-methodology bench (each of 9 distinct prompts run once after
two unrelated warmup prompts so the ngram-mod hash pool is
realistically populated but never holds the exact deterministic
output we're about to measure):

  Q4_K_XL on B200:
    0.8B  OFF=451  draft-mtp n=2=263 (0.58x)  ngram-only=498 (1.10x)
    2B    OFF=377  draft-mtp n=2=308 (0.82x)  ngram-only=369 (1.00x)
    4B    OFF=240  draft-mtp n=2=260 (1.08x)  -- 4B+ wins with MTP

  Q4_K_XL on x86 48 cores:
    0.8B  OFF= 80  chained n=2= 69 (0.86x)  ngram-only= 95 (1.19x)
    2B    OFF= 62  chained n=2= 51 (0.83x)  ngram-only= 63 (1.01x)
    4B    OFF= 31  chained n=2= 41 (1.33x)

Change:
- Raise the MTP-skip threshold from 2.0B to 3.0B (2B falls below it).
- When skipping the MTP head, fall back to --spec-type ngram-mod via
  the probe-driven _build_ngram_mod_flags helper. Works on both
  post-rename and pre-rename llama-server builds.
- If the binary advertises neither ngram-mod flavor, fall back to
  spec-off (older binaries that don't support ngram-mod at all).
- Mirror the same fallback in _already_in_target_state so a sub-3B
  reload-with-default does not bounce a ngram-mod backend.

Tests updated: monkeypatch probe_server_capabilities so the gate
behavior is deterministic regardless of which llama-server happens
to be on the host. +1 new test for the "binary has no ngram-mod
support" branch; renamed prior 2B/0.8B tests to reflect new semantics.

This generalizes the size gate to be probe-driven instead of a hard
"disable spec" branch.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: 5-mode Speculative Decoding dropdown (Auto / MTP / Ngram / MTP+Ngram / Off)

Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option
Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs,
ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes
force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no
ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the
ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off
state, kept so the Switch's "disable" capability isn't lost.

Backend
- New module-level _canonicalize_spec_mode(value) maps any accepted input
  (canonical, legacy "default" / "draft-mtp" / "ngram-mod" / "ngram-simple",
  or comma-chained "ngram-mod,draft-mtp") onto one of auto / mtp / ngram /
  mtp+ngram / off / ngram-simple / None. Lets external callers and old
  persisted UI state round-trip without breaking.
- LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode
  property storing the canonical UI mode the user requested. Status
  responses round-trip this instead of the resolved internal flag, so the
  dropdown restores the picked value after reload / refresh (Auto on a 27B
  MTP GGUF resolves to draft-mtp internally but the dropdown stays on
  "Auto").
- The resolver block in load_model is extracted into a unit-testable
  _build_speculative_flags method. Forced MTP / MTP+Ngram on a sub-3B or
  non-MTP GGUF logs a warning and engages anyway (user override > the
  Auto-path sub-3B fallback).
- _already_in_target_state and routes/inference._request_matches_loaded_settings
  now compare canonical-requested mode, dropping the old auto-promotion
  mirror. spec_draft_n_max still gates on the resolved spec so Auto + a
  changed n_max still bounces a reload.

Frontend
- chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV
  Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft
  Tokens input only visible when speculativeType is "mtp" or "mtp+ngram".
- chat-runtime-store.ts: initial value flips from "default" to "auto".
- use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend
  canonicaliser so persisted "default" / "draft-mtp" / "ngram-mod" / chain
  values hydrate to the right dropdown option.
- types/api.ts: docs the canonical wire vocabulary.

Tests
- 53 new assertions in test_llama_cpp_mtp_detection.py: full
  _canonicalize_spec_mode table, a 23-row resolver matrix across
  (requested mode) x (GPU/CPU) x (model size class), plus n_max override,
  user-extra-args precedence, requested-mode round-trip, and graceful
  degrade on an outdated llama-server without an MTP token.
- 165 existing backend tests still green. 218 total in the MTP /
  server-args / reload-inheritance suite.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: reset Speculative Decoding to Auto on model switch

When the user switches from model A to a different model B, clear the
runtime store's speculativeType + specDraftNMax (and their loaded*
shadows). The new load request then carries null, the backend
canonicalises that to "auto", and its platform-aware resolver runs
fresh for the new model.

Without this, a non-MTP model loaded with "Off" carried the Off choice
into a subsequent MTP load, suppressing MTP auto-promotion (and the
sub-3B ngram-mod fallback) until the user manually opened settings and
flipped the dropdown back to Auto. The clean-sweep deep probe caught
it as anomaly A-1.

The reset only fires when currentCheckpoint != modelId, so a
same-model reapply or forceReload still honours the user's current
spec choice. End-to-end probe on Qwen3.5-4B-GGUF (non-MTP, Off) ->
Qwen3.5-0.8B-MTP confirms: dropdown shows Auto, /api/inference/status
returns speculative_type=auto, studio.log shows the Auto sub-3B
fallback emitted --spec-type ngram-mod.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…slothai#5585)

* studio: reserve VRAM headroom for the MTP draft cache in auto-fit

When MTP is going to engage on this load, _fit_context_to_vram now
budgets 0.85 of available VRAM instead of 0.90, leaving room for
llama.cpp's secondary MTP draft KV cache + compute graph buffers.

Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF
UD-Q4_K_XL at native auto-context running roughly half the speed of
the same model with a slightly smaller context. The most parsimonious
explanation is a VRAM cliff: at native context the target's KV
already eats the 90% budget, then llama-server allocates the draft
cache + draft graph on top and spills into a slower partial-offload
path. Reducing the budget by 5% on MTP loads avoids the spill without
penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.)
the fit is unchanged because the requested context already fits in
the tighter budget too.

MTP detection mirrors the auto-promotion logic in load_model: the
GGUF advertises nextn_predict_layers, or the model identifier /
local path matches the -MTP marker, and the user has not explicitly
opted out via speculative_type="off" or --spec-type extra args.

Tests: two new cases in test_kv_cache_estimation.py verify that
mtp_engaged=True yields a context less-than-or-equal-to the
non-MTP path on a tight budget, and that kv_on_gpu=False still
short-circuits regardless of mtp_engaged.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: gate _mtp_will_engage on canonical-mode resolver

After PR unslothai#5582 introduced the 5-mode Speculative Decoding dropdown plus
_canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes:
  * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed)
  * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion
  * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM)
  * ngram / ngram-simple / off -> never engage MTP
  * user --spec-type in extra_args -> resolver suppressed; no headroom

The old gate triggered on "anything but off", so it over-reserved the
0.85 budget when the user explicitly picked Ngram (no MTP) or when
Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom
cost was minor but unnecessary.

Mirrors the same logic already encoded in _build_speculative_flags so
the auto-fit budget and the actual emission agree on whether MTP is
running.

All 361 backend tests pass.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant