studio: tighten MTP reload guards and asymmetric spec flags for #5582 by danielhanchen · Pull Request #5696 · unslothai/unsloth

danielhanchen · 2026-05-22T13:10:39Z

Summary

Followup to #5582. Five issues surfaced after that PR landed; all addressed here with matching pytest coverage.

Bug A: route guard at routes/inference.py:_request_matches_loaded_settings compared request.spec_draft_n_max against the requested UI mode. An Auto-promoted backend has requested_spec_mode == "auto", so a user moving the Draft Tokens slider while running under auto-resolved MTP returned already_loaded and kept the stale value. Now mirrors the backend-side guard's check against the resolved speculative_type == "draft-mtp".
Bug B: both reload guards short-circuited the n_max comparison when the request value was None, treating it as a wildcard. A backend loaded with an explicit override of 8 could never be cleared back to default without a model swap. Both guards now treat the (None vs explicit) flip as a difference.
Bug C: chained MTP+ngram on a legacy llama-server emitted --draft-max twice (MTP draft length, then ngram-mod size-N max). Last-wins clobbered MTP's value with 64. _build_ngram_mod_flags now takes a chain_with_mtp kwarg that suppresses the duplicate emission on the legacy flavor; the post-rename flavor uses distinct --spec-ngram-mod-* names and is unaffected.
Bug D: forced speculative_type="ngram" emitted --spec-type ngram-mod even on binaries that did not advertise ngram-mod support, causing llama-server to refuse to start. The auto path already checked supports_ngram_mod before emitting; the forced path now mirrors that check.
Bug E: speculative_type="none" (the spelling llama.cpp itself uses for disable, plus "disable" / "disabled") fell through the canonicaliser to "auto" and silently re-enabled MTP. Added all three to _LEGACY_SPEC_MODE_MAP as aliases for "off".

Test plan

python -m pytest studio/backend/tests/test_llama_cpp_mtp_detection.py studio/backend/tests/test_gguf_reload_inheritance.py -q — 147/147 pass (15 new tests added):
- test_canonicalize_spec_mode_none_aliases_map_to_off (6 parametrize cases) — Bug E
- test_build_ngram_mod_flags_legacy_chained_omits_draft_max + standalone + new-flavor — Bug C unit
- test_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max — Bug C end-to-end
- test_build_speculative_flags_forced_ngram_without_support_skips_spec + with-support — Bug D
- test_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload — Bug B backend side
- test_route_guard_auto_promoted_mtp_{bounces,matches,clear_*,ignores_*} — Bug A + Bug B route side

No new behaviour for users on default builds (post-rename flavor, no explicit slider use, no forced spec mode). Only the buggy edge cases change.

gemini-code-assist

Code Review

This pull request fixes several issues in the llama.cpp speculative decoding implementation. It prevents flag collisions in legacy builds by suppressing duplicate --draft-max flags when MTP and ngram-mod are chained. It also ensures that "none" and "disable" speculative mode aliases are correctly canonicalized to "off" and prevents server startup failures by verifying ngram-mod support before emitting flags. Furthermore, the reload logic was updated to accurately detect changes in spec_draft_n_max, including transitions to and from default values and within auto-promoted MTP sessions. A robust suite of regression tests has been added to cover these scenarios. I have no feedback to provide as there were no review comments.

Each PR ran the same staged source files before, which went stale when the upstream PR commits advanced. Refactor to one job per PR with an actions/checkout of that PR's head ref, so cross-OS validation always uses the latest commit: - PR unslothai#5603 sandbox -> studio-sandbox-hardening - PR unslothai#5620 parser parity -> studio-tools-multi-format-v2 - PR unslothai#5696 mtp reload guards -> followup-mtp-reload-guards (unslothai#5582 followup) - PR unslothai#5695 lockfile audit -> followup-lockfile-audit-regressions (unslothai#5604 followup) 4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap). cancel-in-progress per (workflow, ref) keeps iteration cheap. All tests stay CPU-only and rely on the CUDA spoof harness in tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU is required on any runner.

unslothai#5620 parser tests transitively import the safetensors loop, which needs the datasets package. unslothai#5696 route-guard tests import routes/inference.py, which transitively imports core/training (uses matplotlib). Add both, plus the auth deps (pyjwt/cryptography/ aiosqlite/python-multipart) needed for any test that touches the FastAPI route module so route-level imports resolve cleanly on all three OSes.

Five issues surfaced after #5582 merged. All addressed with matching pytest coverage (15 new tests, 147 total green). Bug A -- route guard compared against the requested UI mode rather than the backend's resolved spec mode. A user request setting ``spec_draft_n_max=2`` against a backend that was auto-promoted from ``auto`` -> ``draft-mtp`` saw ``requested_spec_mode == "auto"`` (not in ``("mtp", "mtp+ngram")``) and skipped the comparison, returning ``already_loaded`` with the stale value still active. Now mirrors the backend-side guard's check against ``speculative_type == "draft-mtp"``. Bug B -- both reload guards short-circuited the n_max comparison when the request value was ``None``, treating it as a wildcard. A backend loaded with an explicit override of 8 could never be cleared back to the platform default without swapping the model. Both guards now treat the ``(None vs explicit)`` flip as a difference: clear-to-default and set-from-default both bounce a reload, while ``(None == None)`` and ``(N == N)`` continue to match. Bug C -- chained MTP+ngram on a legacy llama-server (pre arg-rename) emitted ``--draft-max`` twice: once for MTP's draft length (e.g. 2), once for ngram-mod's size-N max (e.g. 64). llama-server's last-wins parsing clobbered the MTP value with 64, defeating the ``--spec-draft-n-max`` slider. ``_build_ngram_mod_flags`` now takes a ``chain_with_mtp`` kwarg that suppresses ``--draft-max`` on the legacy flavor when MTP will emit it; the post-rename flavor uses distinct ``--spec-ngram-mod-*`` names that cannot collide. Bug D -- a forced ``speculative_type="ngram"`` request emitted ``--spec-type ngram-mod`` even on binaries that did not advertise ngram-mod support, causing llama-server to refuse to start. The auto path already checked ``supports_ngram_mod`` before emitting; the forced path now mirrors that check and loads without spec (with a warning that matches the MTP-token-missing path). Bug E -- ``speculative_type="none"`` is llama.cpp's own explicit-disable spelling, and external API callers commonly use ``"disable"`` / ``"disabled"``. None of these were in the canonical spec mode set or the legacy alias map, so they fell through to ``"auto"`` and silently re-enabled MTP -- the opposite of the user's intent. Added all three to ``_LEGACY_SPEC_MODE_MAP`` as aliases for ``"off"``. Tests ----- - test_canonicalize_spec_mode_none_aliases_map_to_off (6 cases via parametrize): "none"/"None"/"NONE"/" none "/"disable"/"Disabled" all canonicalise to "off". - test_build_ngram_mod_flags_legacy_chained_omits_draft_max + test_build_ngram_mod_flags_legacy_standalone_keeps_draft_max + test_build_ngram_mod_flags_new_flavor_always_emits_distinct_names: the chain_with_mtp kwarg suppresses only the legacy flavor's --draft-max, never the new-flavor knobs. - test_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max: end-to-end check that the assembled spec block has exactly one --draft-max carrying the MTP draft length. - test_build_speculative_flags_forced_ngram_without_support_skips_spec + test_build_speculative_flags_forced_ngram_with_support_emits_spec: forced ngram refuses on a binary lacking ngram-mod support; still emits cleanly on a supporting binary. - test_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload: backend-side guard covers both clear-to-default and set-from-default. - test_route_guard_auto_promoted_mtp_{bounces,matches,clear_*}: route guard now compares against resolved spec mode and handles the None flip symmetrically. - test_route_guard_ignores_n_max_when_resolved_spec_is_not_mtp: non-MTP resolved spec (e.g. ngram-mod) still ignores n_max. 147/147 spec/reload test suites green.

for more information, see https://pre-commit.ci

The unit-level tests in this file stub httpx so they don't pay the real-import cost. But the route-guard tests added in this PR import routes.inference -> huggingface_hub -> from httpx import Response, and the stub had no Response, so on any environment where real httpx IS installed (CI, staging) my test file was overriding it with the incomplete stub and breaking the import chain. Switch to a try: import httpx pattern: when real httpx is on the path use it; only fall back to the stub when httpx is genuinely missing. Also add Response/Request to the fallback stub as a defensive measure for future imports that might reach for them. Local: 139/139 pass.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a54396a675

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-23T15:36:56Z

            "--draft-min",
            str(n_min),
-            "--draft-max",
-            str(n_max),
        ]
+        if not chain_with_mtp:
+            # Only safe to set --draft-max here when MTP is NOT in the


Avoid emitting legacy draft-min without matching draft-max

In the legacy chained mtp+ngram path, this still emits --draft-min 48 while --draft-max is suppressed here and later reused for MTP (spec_draft_n_max, typically 2/3). That produces an inverted legacy ngram range (--draft-min 48 --draft-max 2/3) on affected binaries, which can break or effectively disable ngram-mod for auto CPU MTP loads and forced mtp+ngram requests. Please gate --draft-min together with --draft-max (or otherwise ensure draft-min <= draft-max) when chain_with_mtp=True on legacy flavor.

Useful? React with 👍 / 👎.

Codex flagged that the legacy chained mtp+ngram path emits --draft-min 48 while --draft-max is suppressed and later reused for MTP (spec_draft_n_max, typically 2/3). That produces an inverted legacy ngram range (--draft-min 48 --draft-max 2/3) on affected binaries, which can break or effectively disable ngram-mod for auto CPU MTP loads and forced mtp+ngram requests. Both --draft-min and --draft-max are generic flags on legacy llama-server builds, so either one would race with MTP's own values. Gate the pair together: when chain_with_mtp=True on the legacy flavor we drop both flags and rely on MTP's emission for the chained range. Standalone ngram still emits both, preserving a valid min<=max window. Updated test_build_ngram_mod_flags_legacy_chained_omits_draft_max (now omits_draft_min_and_max) and added a min<=max guard on the standalone case. Full suite (test_llama_cpp_mtp_detection + test_llama_server_args) passes locally: 229 / 229.

danielhanchen · 2026-05-23T16:42:57Z

Pushed 068e129 addressing the codex P1 on the legacy chained mtp+ngram path.

Root cause: previously only --draft-max was suppressed inside the chain_with_mtp=True branch of _build_ngram_mod_flags, while --draft-min 48 was still emitted. Because both flags are generic on legacy llama-server (not yet renamed to --spec-ngram-mod-*), the surviving --draft-min 48 then collided with MTP's own --draft-min / --draft-max (2/3 from spec_draft_n_max) further down the command line. The resulting effective range was inverted (--draft-min 48 --draft-max 2/3) which disabled ngram-mod on the affected legacy binaries for both auto CPU MTP loads and forced mtp+ngram requests.

Fix: gate --draft-min and --draft-max together inside the if not chain_with_mtp: branch. Standalone ngram on legacy still emits both flags so the size-N window remains valid; chained ngram drops both so MTP's emission wins cleanly. --spec-ngram-size-n is a distinct flag and stays unconditional so ngram-mod still has its lookup length when chained.

Updated test_build_ngram_mod_flags_legacy_chained_omits_draft_max (renamed to ..._omits_draft_min_and_max) and added a min <= max guard on the standalone-keeps test so the inverted-range regression cannot re-creep silently. Full local pass: 229 / 229 across test_llama_cpp_mtp_detection.py + test_llama_server_args.py.

danielhanchen requested a review from rolandtannous as a code owner May 22, 2026 13:10

gemini-code-assist Bot reviewed May 22, 2026

View reviewed changes

Daniel Han and others added 3 commits May 23, 2026 15:33

[pre-commit.ci] auto fixes from pre-commit.com hooks

fe61d67

for more information, see https://pre-commit.ci

danielhanchen force-pushed the followup-mtp-reload-guards branch from 28e70bb to a54396a Compare May 23, 2026 15:33

chatgpt-codex-connector Bot reviewed May 23, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: tighten MTP reload guards and asymmetric spec flags for #5582#5696

studio: tighten MTP reload guards and asymmetric spec flags for #5582#5696
danielhanchen wants to merge 4 commits into
mainfrom
followup-mtp-reload-guards

danielhanchen commented May 22, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Uh oh!

danielhanchen commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented May 22, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 23, 2026

Choose a reason for hiding this comment

Uh oh!

danielhanchen commented May 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant