studio: tighten MTP reload guards and asymmetric spec flags for #5582#5696
studio: tighten MTP reload guards and asymmetric spec flags for #5582#5696danielhanchen wants to merge 4 commits into
Conversation
There was a problem hiding this comment.
Code Review
This pull request fixes several issues in the llama.cpp speculative decoding implementation. It prevents flag collisions in legacy builds by suppressing duplicate --draft-max flags when MTP and ngram-mod are chained. It also ensures that "none" and "disable" speculative mode aliases are correctly canonicalized to "off" and prevents server startup failures by verifying ngram-mod support before emitting flags. Furthermore, the reload logic was updated to accurately detect changes in spec_draft_n_max, including transitions to and from default values and within auto-promoted MTP sessions. A robust suite of regression tests has been added to cover these scenarios. I have no feedback to provide as there were no review comments.
Each PR ran the same staged source files before, which went stale when the upstream PR commits advanced. Refactor to one job per PR with an actions/checkout of that PR's head ref, so cross-OS validation always uses the latest commit: - PR unslothai#5603 sandbox -> studio-sandbox-hardening - PR unslothai#5620 parser parity -> studio-tools-multi-format-v2 - PR unslothai#5696 mtp reload guards -> followup-mtp-reload-guards (unslothai#5582 followup) - PR unslothai#5695 lockfile audit -> followup-lockfile-audit-regressions (unslothai#5604 followup) 4 jobs x 3 OSes = 12 runs; Windows = 4 (below the 5-concurrent cap). cancel-in-progress per (workflow, ref) keeps iteration cheap. All tests stay CPU-only and rely on the CUDA spoof harness in tests/conftest.py + tests/_zoo_aggressive_cuda_spoof.py, so no real GPU is required on any runner.
unslothai#5620 parser tests transitively import the safetensors loop, which needs the datasets package. unslothai#5696 route-guard tests import routes/inference.py, which transitively imports core/training (uses matplotlib). Add both, plus the auth deps (pyjwt/cryptography/ aiosqlite/python-multipart) needed for any test that touches the FastAPI route module so route-level imports resolve cleanly on all three OSes.
Five issues surfaced after #5582 merged. All addressed with matching pytest coverage (15 new tests, 147 total green). Bug A -- route guard compared against the requested UI mode rather than the backend's resolved spec mode. A user request setting ``spec_draft_n_max=2`` against a backend that was auto-promoted from ``auto`` -> ``draft-mtp`` saw ``requested_spec_mode == "auto"`` (not in ``("mtp", "mtp+ngram")``) and skipped the comparison, returning ``already_loaded`` with the stale value still active. Now mirrors the backend-side guard's check against ``speculative_type == "draft-mtp"``. Bug B -- both reload guards short-circuited the n_max comparison when the request value was ``None``, treating it as a wildcard. A backend loaded with an explicit override of 8 could never be cleared back to the platform default without swapping the model. Both guards now treat the ``(None vs explicit)`` flip as a difference: clear-to-default and set-from-default both bounce a reload, while ``(None == None)`` and ``(N == N)`` continue to match. Bug C -- chained MTP+ngram on a legacy llama-server (pre arg-rename) emitted ``--draft-max`` twice: once for MTP's draft length (e.g. 2), once for ngram-mod's size-N max (e.g. 64). llama-server's last-wins parsing clobbered the MTP value with 64, defeating the ``--spec-draft-n-max`` slider. ``_build_ngram_mod_flags`` now takes a ``chain_with_mtp`` kwarg that suppresses ``--draft-max`` on the legacy flavor when MTP will emit it; the post-rename flavor uses distinct ``--spec-ngram-mod-*`` names that cannot collide. Bug D -- a forced ``speculative_type="ngram"`` request emitted ``--spec-type ngram-mod`` even on binaries that did not advertise ngram-mod support, causing llama-server to refuse to start. The auto path already checked ``supports_ngram_mod`` before emitting; the forced path now mirrors that check and loads without spec (with a warning that matches the MTP-token-missing path). Bug E -- ``speculative_type="none"`` is llama.cpp's own explicit-disable spelling, and external API callers commonly use ``"disable"`` / ``"disabled"``. None of these were in the canonical spec mode set or the legacy alias map, so they fell through to ``"auto"`` and silently re-enabled MTP -- the opposite of the user's intent. Added all three to ``_LEGACY_SPEC_MODE_MAP`` as aliases for ``"off"``. Tests ----- - test_canonicalize_spec_mode_none_aliases_map_to_off (6 cases via parametrize): "none"/"None"/"NONE"/" none "/"disable"/"Disabled" all canonicalise to "off". - test_build_ngram_mod_flags_legacy_chained_omits_draft_max + test_build_ngram_mod_flags_legacy_standalone_keeps_draft_max + test_build_ngram_mod_flags_new_flavor_always_emits_distinct_names: the chain_with_mtp kwarg suppresses only the legacy flavor's --draft-max, never the new-flavor knobs. - test_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max: end-to-end check that the assembled spec block has exactly one --draft-max carrying the MTP draft length. - test_build_speculative_flags_forced_ngram_without_support_skips_spec + test_build_speculative_flags_forced_ngram_with_support_emits_spec: forced ngram refuses on a binary lacking ngram-mod support; still emits cleanly on a supporting binary. - test_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload: backend-side guard covers both clear-to-default and set-from-default. - test_route_guard_auto_promoted_mtp_{bounces,matches,clear_*}: route guard now compares against resolved spec mode and handles the None flip symmetrically. - test_route_guard_ignores_n_max_when_resolved_spec_is_not_mtp: non-MTP resolved spec (e.g. ngram-mod) still ignores n_max. 147/147 spec/reload test suites green.
for more information, see https://pre-commit.ci
The unit-level tests in this file stub httpx so they don't pay the real-import cost. But the route-guard tests added in this PR import routes.inference -> huggingface_hub -> from httpx import Response, and the stub had no Response, so on any environment where real httpx IS installed (CI, staging) my test file was overriding it with the incomplete stub and breaking the import chain. Switch to a try: import httpx pattern: when real httpx is on the path use it; only fall back to the stub when httpx is genuinely missing. Also add Response/Request to the fallback stub as a defensive measure for future imports that might reach for them. Local: 139/139 pass.
28e70bb to
a54396a
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a54396a675
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| "--draft-min", | ||
| str(n_min), | ||
| "--draft-max", | ||
| str(n_max), | ||
| ] | ||
| if not chain_with_mtp: | ||
| # Only safe to set --draft-max here when MTP is NOT in the |
There was a problem hiding this comment.
Avoid emitting legacy draft-min without matching draft-max
In the legacy chained mtp+ngram path, this still emits --draft-min 48 while --draft-max is suppressed here and later reused for MTP (spec_draft_n_max, typically 2/3). That produces an inverted legacy ngram range (--draft-min 48 --draft-max 2/3) on affected binaries, which can break or effectively disable ngram-mod for auto CPU MTP loads and forced mtp+ngram requests. Please gate --draft-min together with --draft-max (or otherwise ensure draft-min <= draft-max) when chain_with_mtp=True on legacy flavor.
Useful? React with 👍 / 👎.
Codex flagged that the legacy chained mtp+ngram path emits --draft-min 48 while --draft-max is suppressed and later reused for MTP (spec_draft_n_max, typically 2/3). That produces an inverted legacy ngram range (--draft-min 48 --draft-max 2/3) on affected binaries, which can break or effectively disable ngram-mod for auto CPU MTP loads and forced mtp+ngram requests. Both --draft-min and --draft-max are generic flags on legacy llama-server builds, so either one would race with MTP's own values. Gate the pair together: when chain_with_mtp=True on the legacy flavor we drop both flags and rely on MTP's emission for the chained range. Standalone ngram still emits both, preserving a valid min<=max window. Updated test_build_ngram_mod_flags_legacy_chained_omits_draft_max (now omits_draft_min_and_max) and added a min<=max guard on the standalone case. Full suite (test_llama_cpp_mtp_detection + test_llama_server_args) passes locally: 229 / 229.
|
Pushed 068e129 addressing the codex P1 on the legacy chained mtp+ngram path. Root cause: previously only Fix: gate Updated |
Summary
Followup to #5582. Five issues surfaced after that PR landed; all addressed here with matching pytest coverage.
routes/inference.py:_request_matches_loaded_settingscomparedrequest.spec_draft_n_maxagainst the requested UI mode. An Auto-promoted backend hasrequested_spec_mode == "auto", so a user moving the Draft Tokens slider while running under auto-resolved MTP returnedalready_loadedand kept the stale value. Now mirrors the backend-side guard's check against the resolvedspeculative_type == "draft-mtp".None, treating it as a wildcard. A backend loaded with an explicit override of 8 could never be cleared back to default without a model swap. Both guards now treat the(None vs explicit)flip as a difference.--draft-maxtwice (MTP draft length, then ngram-mod size-N max). Last-wins clobbered MTP's value with 64._build_ngram_mod_flagsnow takes achain_with_mtpkwarg that suppresses the duplicate emission on the legacy flavor; the post-rename flavor uses distinct--spec-ngram-mod-*names and is unaffected.speculative_type="ngram"emitted--spec-type ngram-modeven on binaries that did not advertise ngram-mod support, causing llama-server to refuse to start. The auto path already checkedsupports_ngram_modbefore emitting; the forced path now mirrors that check.speculative_type="none"(the spelling llama.cpp itself uses for disable, plus"disable"/"disabled") fell through the canonicaliser to"auto"and silently re-enabled MTP. Added all three to_LEGACY_SPEC_MODE_MAPas aliases for"off".Test plan
python -m pytest studio/backend/tests/test_llama_cpp_mtp_detection.py studio/backend/tests/test_gguf_reload_inheritance.py -q— 147/147 pass (15 new tests added):test_canonicalize_spec_mode_none_aliases_map_to_off(6 parametrize cases) — Bug Etest_build_ngram_mod_flags_legacy_chained_omits_draft_max+ standalone + new-flavor — Bug C unittest_build_speculative_flags_chained_mtp_ngram_legacy_no_duplicate_draft_max— Bug C end-to-endtest_build_speculative_flags_forced_ngram_without_support_skips_spec+ with-support — Bug Dtest_already_in_target_state_{clear,set}_explicit_n_max_*_forces_reload— Bug B backend sidetest_route_guard_auto_promoted_mtp_{bounces,matches,clear_*,ignores_*}— Bug A + Bug B route sideNo new behaviour for users on default builds (post-rename flavor, no explicit slider use, no forced spec mode). Only the buggy edge cases change.