Skip to content

studio: tighten MTP reload guards and asymmetric spec flags for #5582#5696

Open
danielhanchen wants to merge 4 commits into
mainfrom
followup-mtp-reload-guards
Open

studio: tighten MTP reload guards and asymmetric spec flags for #5582#5696
danielhanchen wants to merge 4 commits into
mainfrom
followup-mtp-reload-guards

Conversation

@danielhanchen

Copy link
Copy Markdown
Member

Summary

Followup to #5582. Five issues surfaced after that PR landed; all addressed here with matching pytest coverage.

  • Bug A: route guard at routes/inference.py:_request_matches_loaded_settings compared request.spec_draft_n_max against the requested UI mode. An Auto-promoted backend has requested_spec_mode == "auto", so a user moving the Draft Tokens slider while running under auto-resolved MTP returned already_loaded and kept the stale value. Now mirrors the backend-side guard's check against the resolved speculative_type == "draft-mtp".
  • Bug B: both reload guards short-circuited the n_max comparison when the request value was None, treating it as a wildcard. A backend loaded with an explicit override of 8 could never be cleared back to default without a model swap. Both guards now treat the (None vs explicit) flip as a difference.
  • Bug C: chained MTP+ngram on a legacy llama-server emitted --draft-max twice (MTP draft length, then ngram-mod size-N max). Last-wins clobbered MTP's value with 64. _build_ngram_mod_flags now takes a chain_with_mtp kwarg that suppresses the duplicate emission on the legacy flavor; the post-rename flavor uses distinct --spec-ngram-mod-* names and is unaffected.
  • Bug D: forced speculative_type="ngram" emitted --spec-type ngram-mod even on binaries that did not advertise ngram-mod support, causing llama-server to refuse to start. The auto path already checked supports_ngram_mod before emitting; the forced path now mirrors that check.
  • Bug E: speculative_type="none" (the spelling llama.cpp itself uses for disable, plus "disable" / "disabled") fell through the canonicaliser to "auto" and silently re-enabled MTP. Added all three to _LEGACY_SPEC_MODE_MAP as aliases for "off".

Test plan

  • python -m pytest studio/backend/tests/test_llama_cpp_mtp_detection.py studio/backend/tests/test_gguf_reload_inheritance.py -q — 147/147 pass (15 new tests added):
    • test_canonicalize_spec_mode_none_aliases_map_to_off (6 parametrize cases) — Bug E
    • test_build_ngram_mod_flags_legacy_chained_omits_draft_max + standalone + new-flavor — Bug C unit
    • test_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max — Bug C end-to-end
    • test_build_speculative_flags_forced_ngram_without_support_skips_spec + with-support — Bug D
    • test_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload — Bug B backend side
    • test_route_guard_auto_promoted_mtp_{bounces,matches,clear_*,ignores_*} — Bug A + Bug B route side

No new behaviour for users on default builds (post-rename flavor, no explicit slider use, no forced spec mode). Only the buggy edge cases change.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes several issues in the llama.cpp speculative decoding implementation. It prevents flag collisions in legacy builds by suppressing duplicate --draft-max flags when MTP and ngram-mod are chained. It also ensures that "none" and "disable" speculative mode aliases are correctly canonicalized to "off" and prevents server startup failures by verifying ngram-mod support before emitting flags. Furthermore, the reload logic was updated to accurately detect changes in spec_draft_n_max, including transitions to and from default values and within auto-promoted MTP sessions. A robust suite of regression tests has been added to cover these scenarios. I have no feedback to provide as there were no review comments.

danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 22, 2026
Each PR ran the same staged source files before, which went stale when
the upstream PR commits advanced. Refactor to one job per PR with an
actions/checkout of that PR's head ref, so cross-OS validation
always uses the latest commit:

  - PR unslothai#5603 sandbox            -> studio-sandbox-hardening
  - PR unslothai#5620 parser parity      -> studio-tools-multi-format-v2
  - PR unslothai#5696 mtp reload guards  -> followup-mtp-reload-guards (unslothai#5582 followup)
  - PR unslothai#5695 lockfile audit     -> followup-lockfile-audit-regressions (unslothai#5604 followup)

4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap).
cancel-in-progress per (workflow, ref) keeps iteration cheap.

All tests stay CPU-only and rely on the CUDA spoof harness in
tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU
is required on any runner.
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 22, 2026
unslothai#5620 parser tests transitively import the safetensors loop, which
needs the datasets package. unslothai#5696 route-guard tests import
routes/inference.py, which transitively imports core/training (uses
matplotlib). Add both, plus the auth deps (pyjwt/cryptography/
aiosqlite/python-multipart) needed for any test that touches the
FastAPI route module so route-level imports resolve cleanly on all
three OSes.
Daniel Han and others added 3 commits May 23, 2026 15:33
Five issues surfaced after #5582 merged. All addressed with matching
pytest coverage (15 new tests, 147 total green).

Bug A -- route guard compared against the requested UI mode rather than
the backend's resolved spec mode. A user request setting
``spec_draft_n_max=2`` against a backend that was auto-promoted from
``auto`` -> ``draft-mtp`` saw ``requested_spec_mode == "auto"`` (not in
``("mtp", "mtp+ngram")``) and skipped the comparison, returning
``already_loaded`` with the stale value still active. Now mirrors the
backend-side guard's check against ``speculative_type == "draft-mtp"``.

Bug B -- both reload guards short-circuited the n_max comparison when
the request value was ``None``, treating it as a wildcard. A backend
loaded with an explicit override of 8 could never be cleared back to
the platform default without swapping the model. Both guards now treat
the ``(None vs explicit)`` flip as a difference: clear-to-default and
set-from-default both bounce a reload, while ``(None == None)`` and
``(N == N)`` continue to match.

Bug C -- chained MTP+ngram on a legacy llama-server (pre arg-rename)
emitted ``--draft-max`` twice: once for MTP's draft length (e.g. 2),
once for ngram-mod's size-N max (e.g. 64). llama-server's last-wins
parsing clobbered the MTP value with 64, defeating the
``--spec-draft-n-max`` slider. ``_build_ngram_mod_flags`` now takes a
``chain_with_mtp`` kwarg that suppresses ``--draft-max`` on the legacy
flavor when MTP will emit it; the post-rename flavor uses distinct
``--spec-ngram-mod-*`` names that cannot collide.

Bug D -- a forced ``speculative_type="ngram"`` request emitted
``--spec-type ngram-mod`` even on binaries that did not advertise
ngram-mod support, causing llama-server to refuse to start. The auto
path already checked ``supports_ngram_mod`` before emitting; the
forced path now mirrors that check and loads without spec (with a
warning that matches the MTP-token-missing path).

Bug E -- ``speculative_type="none"`` is llama.cpp's own explicit-disable
spelling, and external API callers commonly use ``"disable"`` /
``"disabled"``. None of these were in the canonical spec mode set or
the legacy alias map, so they fell through to ``"auto"`` and silently
re-enabled MTP -- the opposite of the user's intent. Added all three
to ``_LEGACY_SPEC_MODE_MAP`` as aliases for ``"off"``.

Tests
-----
- test_canonicalize_spec_mode_none_aliases_map_to_off (6 cases via
  parametrize): "none"/"None"/"NONE"/"  none  "/"disable"/"Disabled"
  all canonicalise to "off".
- test_build_ngram_mod_flags_legacy_chained_omits_draft_max +
  test_build_ngram_mod_flags_legacy_standalone_keeps_draft_max +
  test_build_ngram_mod_flags_new_flavor_always_emits_distinct_names:
  the chain_with_mtp kwarg suppresses only the legacy flavor's
  --draft-max, never the new-flavor knobs.
- test_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max:
  end-to-end check that the assembled spec block has exactly one
  --draft-max carrying the MTP draft length.
- test_build_speculative_flags_forced_ngram_without_support_skips_spec
  + test_build_speculative_flags_forced_ngram_with_support_emits_spec:
  forced ngram refuses on a binary lacking ngram-mod support; still
  emits cleanly on a supporting binary.
- test_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload:
  backend-side guard covers both clear-to-default and set-from-default.
- test_route_guard_auto_promoted_mtp_{bounces,matches,clear_*}: route
  guard now compares against resolved spec mode and handles the None
  flip symmetrically.
- test_route_guard_ignores_n_max_when_resolved_spec_is_not_mtp:
  non-MTP resolved spec (e.g. ngram-mod) still ignores n_max.

147/147 spec/reload test suites green.
The unit-level tests in this file stub httpx so they don't pay
the real-import cost. But the route-guard tests added in this PR
import routes.inference -> huggingface_hub -> from httpx
import Response, and the stub had no Response, so on any
environment where real httpx IS installed (CI, staging) my test
file was overriding it with the incomplete stub and breaking the
import chain.

Switch to a try: import httpx pattern: when real httpx is on the
path use it; only fall back to the stub when httpx is genuinely
missing. Also add Response/Request to the fallback stub as a
defensive measure for future imports that might reach for them.

Local: 139/139 pass.
@danielhanchen danielhanchen force-pushed the followup-mtp-reload-guards branch from 28e70bb to a54396a Compare May 23, 2026 15:33

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a54396a675

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +529 to +533
"--draft-min",
str(n_min),
"--draft-max",
str(n_max),
]
if not chain_with_mtp:
# Only safe to set --draft-max here when MTP is NOT in the

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Avoid emitting legacy draft-min without matching draft-max

In the legacy chained mtp+ngram path, this still emits --draft-min 48 while --draft-max is suppressed here and later reused for MTP (spec_draft_n_max, typically 2/3). That produces an inverted legacy ngram range (--draft-min 48 --draft-max 2/3) on affected binaries, which can break or effectively disable ngram-mod for auto CPU MTP loads and forced mtp+ngram requests. Please gate --draft-min together with --draft-max (or otherwise ensure draft-min <= draft-max) when chain_with_mtp=True on legacy flavor.

Useful? React with 👍 / 👎.

Codex flagged that the legacy chained mtp+ngram path emits --draft-min 48
while --draft-max is suppressed and later reused for MTP (spec_draft_n_max,
typically 2/3). That produces an inverted legacy ngram range
(--draft-min 48 --draft-max 2/3) on affected binaries, which can break or
effectively disable ngram-mod for auto CPU MTP loads and forced mtp+ngram
requests.

Both --draft-min and --draft-max are generic flags on legacy llama-server
builds, so either one would race with MTP's own values. Gate the pair
together: when chain_with_mtp=True on the legacy flavor we drop both flags
and rely on MTP's emission for the chained range. Standalone ngram still
emits both, preserving a valid min<=max window.

Updated test_build_ngram_mod_flags_legacy_chained_omits_draft_max (now
omits_draft_min_and_max) and added a min<=max guard on the standalone
case. Full suite (test_llama_cpp_mtp_detection + test_llama_server_args)
passes locally: 229 / 229.
@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed 068e129 addressing the codex P1 on the legacy chained mtp+ngram path.

Root cause: previously only --draft-max was suppressed inside the chain_with_mtp=True branch of _build_ngram_mod_flags, while --draft-min 48 was still emitted. Because both flags are generic on legacy llama-server (not yet renamed to --spec-ngram-mod-*), the surviving --draft-min 48 then collided with MTP's own --draft-min / --draft-max (2/3 from spec_draft_n_max) further down the command line. The resulting effective range was inverted (--draft-min 48 --draft-max 2/3) which disabled ngram-mod on the affected legacy binaries for both auto CPU MTP loads and forced mtp+ngram requests.

Fix: gate --draft-min and --draft-max together inside the if not chain_with_mtp: branch. Standalone ngram on legacy still emits both flags so the size-N window remains valid; chained ngram drops both so MTP's emission wins cleanly. --spec-ngram-size-n is a distinct flag and stays unconditional so ngram-mod still has its lookup length when chained.

Updated test_build_ngram_mod_flags_legacy_chained_omits_draft_max (renamed to ..._omits_draft_min_and_max) and added a min <= max guard on the standalone-keeps test so the inverted-range regression cannot re-creep silently. Full local pass: 229 / 229 across test_llama_cpp_mtp_detection.py + test_llama_server_args.py.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant