studio: reserve VRAM headroom for the MTP draft cache in auto-fit#5585
Conversation
|
Warning You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again! |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b0591a3912
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| or _is_mtp_model_name(model_identifier, model_path) | ||
| ) | ||
| and _normalized_spec not in {"off"} | ||
| and not _extra_args_set_spec_type(extra_args) |
There was a problem hiding this comment.
Reserve MTP headroom for explicit spec-type overrides
When an MTP GGUF is loaded with llama_extra_args explicitly enabling MTP, e.g. ['--spec-type', 'draft-mtp'] or --spec-type=mtp, this predicate forces _mtp_will_engage to false because _extra_args_set_spec_type(extra_args) is true. Those args are still appended to the final llama-server command later, so the draft MTP cache will be allocated, but auto-fit used the old 90% budget instead of the new 85% headroom and can hit the same VRAM cliff this change is meant to avoid.
Useful? React with 👍 / 👎.
|
Stacking note: this PR sits on top of #5575 and #5582. Order of merge does not matter; they rebase cleanly either way. The auto-fit headroom change here is independent of the comma-chain and draft-tokens UI changes. Cross-OS coverage for the broader MTP plumbing is at danielhanchen#132 (ubuntu-latest / macos-14 / windows-latest). The kv-cache fit tests added here are picked up by that workflow. |
The previous overlay copied the vram-headroom branch's llama_cpp.py verbatim and lost the spec_draft_n_max plumbing that the new _already_in_target_state regression tests need. Combine both PR diffs (unslothai#5582 + unslothai#5585) and copy the merged file in. Also deselect TestTransformersIntrospection in the workflow filter; those tests need a live transformers config that is GPU-only on free runners.
The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch, which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom changes. That overwrote the mtp_engaged-aware budget path on staging and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and test_fit_mtp_engaged_unchanged_when_kv_off_gpu. Restore: - mtp_engaged: bool = False param on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage computation before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All 299 staging backend tests pass.
…restore mtp_engaged) Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping the two already-merged commits and renumbering the remaining nine. The staging fork was sitting on the pre-rebase llama_cpp.py + tests; this commit replays the rebased file content while preserving PR unslothai#5585's mtp_engaged auto-fit headroom (staging-only patch absent upstream). Restored on top of the new file: - mtp_engaged: bool = False on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did not force --spec-type) before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All MTP detection + fit_context + fit_mtp tests pass (161 passed).
…P+Ngram / Off) Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs, ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off state, kept so the Switch's "disable" capability isn't lost. Backend - New module-level _canonicalize_spec_mode(value) maps any accepted input (canonical, legacy "default"/"draft-mtp"/"ngram-mod"/"ngram-simple", or comma-chained "ngram-mod,draft-mtp") onto one of auto/mtp/ngram/mtp+ngram/ off/ngram-simple/None. Lets external callers + old persisted UI state round-trip without breaking. - LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode property storing the canonical UI mode the user requested. Status responses round-trip this instead of the resolved internal flag, so the dropdown restores the picked value after reload / refresh (Auto on a 27B MTP GGUF resolves to draft-mtp internally but the dropdown stays on "Auto"). - The resolver block in load_model is extracted into a unit-testable _build_speculative_flags method. Forced MTP/MTP+Ngram on a sub-3B or non-MTP GGUF logs a warning and engages anyway (user override > the Auto-path sub-3B fallback). - _already_in_target_state and routes/inference._request_matches_loaded_settings now compare canonical-requested mode, dropping the old auto-promotion mirror. spec_draft_n_max still gates on the resolved spec (so Auto + changed n_max still bounces a reload). Frontend - chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft Tokens input only visible when speculativeType is "mtp" or "mtp+ngram". - chat-runtime-store.ts: initial value flips from "default" to "auto". - use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend canonicalizer so persisted "default"/"draft-mtp"/"ngram-mod"/chain values hydrate to the right dropdown option. - types/api.ts: docs the canonical wire vocabulary. Staging-only carry-over (PR unslothai#5585): mtp_engaged restored on _fit_context_to_vram, with the auto-fit gate widened to recognise forced MTP modes in addition to Auto's promotion path. Tests - 53 new assertions in test_llama_cpp_mtp_detection.py: full _canonicalize_spec_mode table, a 23-row resolver matrix across (requested mode) x (GPU/CPU) x (model size class), plus n_max override, user-extra-args precedence, requested-mode round-trip, and graceful degrade on an outdated llama-server without an MTP token. - 165 existing backend tests + 3 staging fit-mtp-engaged tests still green. 218 total in the MTP/server-args/reload-inheritance suite.
When MTP is going to engage on this load, _fit_context_to_vram now budgets 0.85 of available VRAM instead of 0.90, leaving room for llama.cpp's secondary MTP draft KV cache + compute graph buffers. Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL at native auto-context running roughly half the speed of the same model with a slightly smaller context. The most parsimonious explanation is a VRAM cliff: at native context the target's KV already eats the 90% budget, then llama-server allocates the draft cache + draft graph on top and spills into a slower partial-offload path. Reducing the budget by 5% on MTP loads avoids the spill without penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.) the fit is unchanged because the requested context already fits in the tighter budget too. MTP detection mirrors the auto-promotion logic in load_model: the GGUF advertises nextn_predict_layers, or the model identifier / local path matches the -MTP marker, and the user has not explicitly opted out via speculative_type="off" or --spec-type extra args. Tests: two new cases in test_kv_cache_estimation.py verify that mtp_engaged=True yields a context less-than-or-equal-to the non-MTP path on a tight budget, and that kv_on_gpu=False still short-circuits regardless of mtp_engaged.
for more information, see https://pre-commit.ci
After PR #5582 introduced the 5-mode Speculative Decoding dropdown plus _canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes: * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed) * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM) * ngram / ngram-simple / off -> never engage MTP * user --spec-type in extra_args -> resolver suppressed; no headroom The old gate triggered on "anything but off", so it over-reserved the 0.85 budget when the user explicitly picked Ngram (no MTP) or when Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom cost was minor but unnecessary. Mirrors the same logic already encoded in _build_speculative_flags so the auto-fit budget and the actual emission agree on whether MTP is running. All 361 backend tests pass.
b0591a3 to
acbb2fb
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: acbb2fb5a2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| model_size, | ||
| cache_type_kv, | ||
| n_parallel = n_parallel, | ||
| mtp_engaged = _mtp_will_engage, |
There was a problem hiding this comment.
Enforce MTP headroom after fitting
When this MTP-aware fit is used in the auto-context path, _fit_context_to_vram can still return requested_ctx or the 4096 minimum even though model_size + kv is above the new 85% budget (for example when weights alone exceed the 85% budget, or when the 4096 floor does not fit). The caller then accepts that returned value against _GPU_PIN_VRAM_FRACTION (95%) and sets use_fit=False, so MTP loads with weights+target-KV between 85% and 95% of free VRAM still pin without --fit and leave no room for the draft cache—the spill this change is meant to avoid remains possible in that auto-context path.
Useful? React with 👍 / 👎.
…slothai#5585) * studio: reserve VRAM headroom for the MTP draft cache in auto-fit When MTP is going to engage on this load, _fit_context_to_vram now budgets 0.85 of available VRAM instead of 0.90, leaving room for llama.cpp's secondary MTP draft KV cache + compute graph buffers. Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL at native auto-context running roughly half the speed of the same model with a slightly smaller context. The most parsimonious explanation is a VRAM cliff: at native context the target's KV already eats the 90% budget, then llama-server allocates the draft cache + draft graph on top and spills into a slower partial-offload path. Reducing the budget by 5% on MTP loads avoids the spill without penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.) the fit is unchanged because the requested context already fits in the tighter budget too. MTP detection mirrors the auto-promotion logic in load_model: the GGUF advertises nextn_predict_layers, or the model identifier / local path matches the -MTP marker, and the user has not explicitly opted out via speculative_type="off" or --spec-type extra args. Tests: two new cases in test_kv_cache_estimation.py verify that mtp_engaged=True yields a context less-than-or-equal-to the non-MTP path on a tight budget, and that kv_on_gpu=False still short-circuits regardless of mtp_engaged. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * studio: gate _mtp_will_engage on canonical-mode resolver After PR unslothai#5582 introduced the 5-mode Speculative Decoding dropdown plus _canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes: * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed) * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM) * ngram / ngram-simple / off -> never engage MTP * user --spec-type in extra_args -> resolver suppressed; no headroom The old gate triggered on "anything but off", so it over-reserved the 0.85 budget when the user explicitly picked Ngram (no MTP) or when Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom cost was minor but unnecessary. Mirrors the same logic already encoded in _build_speculative_flags so the auto-fit budget and the actual emission agree on whether MTP is running. All 361 backend tests pass. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Summary
When MTP is going to engage on this load,
_fit_context_to_vramnow budgets 0.85 of available VRAM instead of 0.90, leaving room for llama.cpp's secondary MTP draft KV cache + compute graph buffers.Detection mirrors the auto-promotion logic already in
load_model: the GGUF advertisesnextn_predict_layers, or the model identifier / local path matches the-MTPmarker, and the user has not explicitly opted out viaspeculative_type="off"or--spec-typeinllama_extra_args.Motivation
A user reported on RTX 5090 (32 GB) that
unsloth/Qwen3.6-27B-MTP-GGUFUD-Q4_K_XL at native auto-context ran roughly half the speed of the same model with a slightly smaller context. The most parsimonious explanation is a VRAM cliff at the 32 GB tier: at native context the target's KV already eats the 90% budget, then llama-server allocates the draft cache + draft graph on top and spills into a slower partial-offload path.Dropping the budget by 5% on MTP-engaged loads avoids the spill without penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.) the fit is unchanged because the requested context already fits in the tighter budget too.
Why 0.85 and not "subtract an explicit draft-cache estimate"
The exact draft footprint depends on the model's
nextn_predict_layerscount, hidden dim, n_parallel, cache_type_kv and the live llama-server draft graph buffers. A constant 5% headroom is conservative enough to cover the common case (Qwen3.6 *-MTP with 1 predict layer, hidden 5120, ctx 131072 -> roughly 2 GB on f16 KV) and falls below the user-visible context loss on hardware where the cliff does not exist. A more sophisticated explicit-estimate path is reasonable as a follow-up.Tests
test_fit_mtp_engaged_returns_smaller_or_equal_context- tight budget, MTP path returns <= non-MTP path.test_fit_mtp_engaged_unchanged_when_kv_off_gpu---kv-offloadshort-circuit is preserved regardless ofmtp_engaged.test_llama_cpp_mtp_detection.py,test_llama_server_args.py,test_llama_cpp_context_fit.py,test_native_context_length.py,test_kv_cache_estimation.py,test_llama_cpp_max_context_threshold.py.Test plan