Skip to content

studio: reserve VRAM headroom for the MTP draft cache in auto-fit#5585

Merged
danielhanchen merged 3 commits into
mainfrom
studio-mtp-vram-headroom
May 19, 2026
Merged

studio: reserve VRAM headroom for the MTP draft cache in auto-fit#5585
danielhanchen merged 3 commits into
mainfrom
studio-mtp-vram-headroom

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

When MTP is going to engage on this load, _fit_context_to_vram now budgets 0.85 of available VRAM instead of 0.90, leaving room for llama.cpp's secondary MTP draft KV cache + compute graph buffers.

Detection mirrors the auto-promotion logic already in load_model: the GGUF advertises nextn_predict_layers, or the model identifier / local path matches the -MTP marker, and the user has not explicitly opted out via speculative_type="off" or --spec-type in llama_extra_args.

Motivation

A user reported on RTX 5090 (32 GB) that unsloth/Qwen3.6-27B-MTP-GGUF UD-Q4_K_XL at native auto-context ran roughly half the speed of the same model with a slightly smaller context. The most parsimonious explanation is a VRAM cliff at the 32 GB tier: at native context the target's KV already eats the 90% budget, then llama-server allocates the draft cache + draft graph on top and spills into a slower partial-offload path.

Dropping the budget by 5% on MTP-engaged loads avoids the spill without penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.) the fit is unchanged because the requested context already fits in the tighter budget too.

Why 0.85 and not "subtract an explicit draft-cache estimate"

The exact draft footprint depends on the model's nextn_predict_layers count, hidden dim, n_parallel, cache_type_kv and the live llama-server draft graph buffers. A constant 5% headroom is conservative enough to cover the common case (Qwen3.6 *-MTP with 1 predict layer, hidden 5120, ctx 131072 -> roughly 2 GB on f16 KV) and falls below the user-visible context loss on hardware where the cliff does not exist. A more sophisticated explicit-estimate path is reasonable as a follow-up.

Tests

  • New: test_fit_mtp_engaged_returns_smaller_or_equal_context - tight budget, MTP path returns <= non-MTP path.
  • New: test_fit_mtp_engaged_unchanged_when_kv_off_gpu - --kv-offload short-circuit is preserved regardless of mtp_engaged.
  • Existing suite: 348 passed locally across test_llama_cpp_mtp_detection.py, test_llama_server_args.py, test_llama_cpp_context_fit.py, test_native_context_length.py, test_kv_cache_estimation.py, test_llama_cpp_max_context_threshold.py.

Test plan

  • New regression tests pass
  • Existing fit-context + MTP suites stay green
  • Maintainer review on a 32 GB / 24 GB GPU to confirm the cliff disappears

@gemini-code-assist

Copy link
Copy Markdown
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 18, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b0591a3912

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

or _is_mtp_model_name(model_identifier, model_path)
)
and _normalized_spec not in {"off"}
and not _extra_args_set_spec_type(extra_args)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Reserve MTP headroom for explicit spec-type overrides

When an MTP GGUF is loaded with llama_extra_args explicitly enabling MTP, e.g. ['--spec-type', 'draft-mtp'] or --spec-type=mtp, this predicate forces _mtp_will_engage to false because _extra_args_set_spec_type(extra_args) is true. Those args are still appended to the final llama-server command later, so the draft MTP cache will be allocated, but auto-fit used the old 90% budget instead of the new 85% headroom and can hit the same VRAM cliff this change is meant to avoid.

Useful? React with 👍 / 👎.

@danielhanchen

Copy link
Copy Markdown
Member Author

Stacking note: this PR sits on top of #5575 and #5582. Order of merge does not matter; they rebase cleanly either way. The auto-fit headroom change here is independent of the comma-chain and draft-tokens UI changes.

Cross-OS coverage for the broader MTP plumbing is at danielhanchen#132 (ubuntu-latest / macos-14 / windows-latest). The kv-cache fit tests added here are picked up by that workflow.

danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 18, 2026
The previous overlay copied the vram-headroom branch's llama_cpp.py
verbatim and lost the spec_draft_n_max plumbing that the new
_already_in_target_state regression tests need. Combine both PR
diffs (unslothai#5582 + unslothai#5585) and copy the merged file in. Also deselect
TestTransformersIntrospection in the workflow filter; those tests
need a live transformers config that is GPU-only on free runners.
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod
fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch,
which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom
changes. That overwrote the mtp_engaged-aware budget path on staging
and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and
test_fit_mtp_engaged_unchanged_when_kv_off_gpu.

Restore:
- mtp_engaged: bool = False param on _fit_context_to_vram
- budget_frac = 0.85 if mtp_engaged else 0.90
- _mtp_will_engage computation before the auto-fit GPU subset loops
- mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites

All 299 staging backend tests pass.
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
…restore mtp_engaged)

Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping
the two already-merged commits and renumbering the remaining nine. The
staging fork was sitting on the pre-rebase llama_cpp.py + tests; this
commit replays the rebased file content while preserving PR unslothai#5585's
mtp_engaged auto-fit headroom (staging-only patch absent upstream).

Restored on top of the new file:
- mtp_engaged: bool = False on _fit_context_to_vram
- budget_frac = 0.85 if mtp_engaged else 0.90
- _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did
  not force --spec-type) before the auto-fit GPU subset loops
- mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites

All MTP detection + fit_context + fit_mtp tests pass (161 passed).
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
…P+Ngram / Off)

Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option
Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs,
ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes
force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no
ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the
ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off
state, kept so the Switch's "disable" capability isn't lost.

Backend
- New module-level _canonicalize_spec_mode(value) maps any accepted input
  (canonical, legacy "default"/"draft-mtp"/"ngram-mod"/"ngram-simple", or
  comma-chained "ngram-mod,draft-mtp") onto one of auto/mtp/ngram/mtp+ngram/
  off/ngram-simple/None. Lets external callers + old persisted UI state
  round-trip without breaking.
- LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode
  property storing the canonical UI mode the user requested. Status responses
  round-trip this instead of the resolved internal flag, so the dropdown
  restores the picked value after reload / refresh (Auto on a 27B MTP GGUF
  resolves to draft-mtp internally but the dropdown stays on "Auto").
- The resolver block in load_model is extracted into a unit-testable
  _build_speculative_flags method. Forced MTP/MTP+Ngram on a sub-3B or
  non-MTP GGUF logs a warning and engages anyway (user override > the
  Auto-path sub-3B fallback).
- _already_in_target_state and routes/inference._request_matches_loaded_settings
  now compare canonical-requested mode, dropping the old auto-promotion
  mirror. spec_draft_n_max still gates on the resolved spec (so Auto +
  changed n_max still bounces a reload).

Frontend
- chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV
  Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft
  Tokens input only visible when speculativeType is "mtp" or "mtp+ngram".
- chat-runtime-store.ts: initial value flips from "default" to "auto".
- use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend
  canonicalizer so persisted "default"/"draft-mtp"/"ngram-mod"/chain values
  hydrate to the right dropdown option.
- types/api.ts: docs the canonical wire vocabulary.

Staging-only carry-over (PR unslothai#5585): mtp_engaged restored on
_fit_context_to_vram, with the auto-fit gate widened to recognise forced
MTP modes in addition to Auto's promotion path.

Tests
- 53 new assertions in test_llama_cpp_mtp_detection.py: full
  _canonicalize_spec_mode table, a 23-row resolver matrix across
  (requested mode) x (GPU/CPU) x (model size class), plus n_max override,
  user-extra-args precedence, requested-mode round-trip, and graceful
  degrade on an outdated llama-server without an MTP token.
- 165 existing backend tests + 3 staging fit-mtp-engaged tests still
  green. 218 total in the MTP/server-args/reload-inheritance suite.
danielhanchen and others added 3 commits May 19, 2026 13:17
When MTP is going to engage on this load, _fit_context_to_vram now
budgets 0.85 of available VRAM instead of 0.90, leaving room for
llama.cpp's secondary MTP draft KV cache + compute graph buffers.

Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF
UD-Q4_K_XL at native auto-context running roughly half the speed of
the same model with a slightly smaller context. The most parsimonious
explanation is a VRAM cliff: at native context the target's KV
already eats the 90% budget, then llama-server allocates the draft
cache + draft graph on top and spills into a slower partial-offload
path. Reducing the budget by 5% on MTP loads avoids the spill without
penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.)
the fit is unchanged because the requested context already fits in
the tighter budget too.

MTP detection mirrors the auto-promotion logic in load_model: the
GGUF advertises nextn_predict_layers, or the model identifier /
local path matches the -MTP marker, and the user has not explicitly
opted out via speculative_type="off" or --spec-type extra args.

Tests: two new cases in test_kv_cache_estimation.py verify that
mtp_engaged=True yields a context less-than-or-equal-to the
non-MTP path on a tight budget, and that kv_on_gpu=False still
short-circuits regardless of mtp_engaged.
After PR #5582 introduced the 5-mode Speculative Decoding dropdown plus
_canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes:
  * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed)
  * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion
  * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM)
  * ngram / ngram-simple / off -> never engage MTP
  * user --spec-type in extra_args -> resolver suppressed; no headroom

The old gate triggered on "anything but off", so it over-reserved the
0.85 budget when the user explicitly picked Ngram (no MTP) or when
Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom
cost was minor but unnecessary.

Mirrors the same logic already encoded in _build_speculative_flags so
the auto-fit budget and the actual emission agree on whether MTP is
running.

All 361 backend tests pass.
@danielhanchen danielhanchen force-pushed the studio-mtp-vram-headroom branch from b0591a3 to acbb2fb Compare May 19, 2026 13:18
@danielhanchen danielhanchen merged commit bef6da5 into main May 19, 2026
10 of 26 checks passed
@danielhanchen danielhanchen deleted the studio-mtp-vram-headroom branch May 19, 2026 13:19

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: acbb2fb5a2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

model_size,
cache_type_kv,
n_parallel = n_parallel,
mtp_engaged = _mtp_will_engage,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Enforce MTP headroom after fitting

When this MTP-aware fit is used in the auto-context path, _fit_context_to_vram can still return requested_ctx or the 4096 minimum even though model_size + kv is above the new 85% budget (for example when weights alone exceed the 85% budget, or when the 4096 floor does not fit). The caller then accepts that returned value against _GPU_PIN_VRAM_FRACTION (95%) and sets use_fit=False, so MTP loads with weights+target-KV between 85% and 95% of free VRAM still pin without --fit and leave no room for the draft cache—the spill this change is meant to avoid remains possible in that auto-context path.

Useful? React with 👍 / 👎.

rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…slothai#5585)

* studio: reserve VRAM headroom for the MTP draft cache in auto-fit

When MTP is going to engage on this load, _fit_context_to_vram now
budgets 0.85 of available VRAM instead of 0.90, leaving room for
llama.cpp's secondary MTP draft KV cache + compute graph buffers.

Motivation: a user report on RTX 5090 (32 GB) showed Qwen3.6-27B-MTP-GGUF
UD-Q4_K_XL at native auto-context running roughly half the speed of
the same model with a slightly smaller context. The most parsimonious
explanation is a VRAM cliff: at native context the target's KV
already eats the 90% budget, then llama-server allocates the draft
cache + draft graph on top and spills into a slower partial-offload
path. Reducing the budget by 5% on MTP loads avoids the spill without
penalising non-MTP loads. On hardware with abundant VRAM (B200, etc.)
the fit is unchanged because the requested context already fits in
the tighter budget too.

MTP detection mirrors the auto-promotion logic in load_model: the
GGUF advertises nextn_predict_layers, or the model identifier /
local path matches the -MTP marker, and the user has not explicitly
opted out via speculative_type="off" or --spec-type extra args.

Tests: two new cases in test_kv_cache_estimation.py verify that
mtp_engaged=True yields a context less-than-or-equal-to the
non-MTP path on a tight budget, and that kv_on_gpu=False still
short-circuits regardless of mtp_engaged.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* studio: gate _mtp_will_engage on canonical-mode resolver

After PR unslothai#5582 introduced the 5-mode Speculative Decoding dropdown plus
_canonicalize_spec_mode, the auto-fit MTP-engaged predicate becomes:
  * forced mtp / mtp+ngram -> always engage MTP (extra VRAM needed)
  * auto + MTP GGUF (>= 3B) -> engages MTP via auto-promotion
  * auto + MTP GGUF (sub-3B) -> falls back to ngram-mod (no extra VRAM)
  * ngram / ngram-simple / off -> never engage MTP
  * user --spec-type in extra_args -> resolver suppressed; no headroom

The old gate triggered on "anything but off", so it over-reserved the
0.85 budget when the user explicitly picked Ngram (no MTP) or when
Auto fell back to ngram-mod on a sub-3B MTP model. The 5% headroom
cost was minor but unnecessary.

Mirrors the same logic already encoded in _build_speculative_flags so
the auto-fit budget and the actual emission agree on whether MTP is
running.

All 361 backend tests pass.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant