Skip to content

ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582#132

Closed
danielhanchen wants to merge 14 commits into
mainfrom
studio-mtp-ci
Closed

ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582#132
danielhanchen wants to merge 14 commits into
mainfrom
studio-mtp-ci

Conversation

@danielhanchen

Copy link
Copy Markdown
Owner

Staging PR to exercise PRs unslothai#5575 (comma-chained --spec-type for CPU/Mac) and unslothai#5582 (--spec-draft-n-max toggle) across the three OSes upstream has no GPU coverage for.

Workflows added

  • mtp-tests-ci.yml: pytest test_llama_cpp_mtp_detection.py + test_llama_server_args.py on ubuntu-latest, macos-14, windows-latest. Pure-Python, no llama-server, no GPU.
  • mtp-prebuilt-probe.yml: install Studio via install.sh / install.ps1 on each OS and assert the bundled llama-server advertises --spec-type with the draft-mtp / mtp alias plus --spec-draft-n-max.

Both pin concurrency.cancel-in-progress and only trigger on the studio-mtp-ci branch + matching paths.

What this catches

  • Platform-specific path / shell / encoding issues in the spec-decoding emit path before merging upstream.
  • A prebuilt llama-server regression that drops the draft-mtp / mtp alias on any one OS.

Reset

This branch also strips the prior staging-fork workflow bloat so the 5-concurrent-Windows-runner cap stays headroom-clear and unrelated path changes do not re-trigger.

Strip the existing workflow bloat (Mac/Windows smoke matrix, mlx,
notebooks, version-compat, wheel-smoke etc.) and add two targeted
workflows that exercise the MTP spec-decoding path the PRs
unslothai#5575 and unslothai#5582 touch:

1. mtp-tests-ci.yml runs the two pytest files that pin Studio's MTP
   argv emission and spec_draft_n_max plumbing on ubuntu-latest,
   macos-14, and windows-latest. Pure-Python, no llama-server, no
   GPU.

2. mtp-prebuilt-probe.yml installs Studio via install.sh /
   install.ps1 on each OS and asserts the bundled llama-server
   advertises --spec-type with the draft-mtp / mtp alias and the
   --spec-draft-n-max knob. This is what probe_server_capabilities
   does at runtime; running it on CI catches a prebuilt regression
   before users see it.

Both workflows pin concurrency.cancel-in-progress so each new
push supersedes the prior run, and they only trigger on the
studio-mtp-ci branch + the relevant paths.

Also overlays the two pending PRs (unslothai#5575 + unslothai#5582) onto this
staging branch so the workflows actually exercise the fixed
emission path rather than the stale main.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for Multi-Token Prediction (MTP) speculative decoding, including auto-detection for MTP-capable models based on GGUF metadata and naming conventions. It adds a new 'Draft Tokens' configuration to the UI and improves the robustness of offline model loading by implementing a DNS probe and local Hugging Face cache resolution. Additionally, the backend now tees llama-server output to dedicated log files for better post-mortem analysis. Feedback identifies a high-severity issue regarding potential race conditions caused by the global modification of environment variables within the new offline-handling context manager.

Comment on lines +124 to +146
def _hf_offline_if_dns_dead():
"""Set HF_HUB_OFFLINE for the body of this block only when DNS to
huggingface.co fails. Restores the env on exit so a transient
resolver hiccup at the start of one load can't quarantine the whole
process. Respects an explicit user setting (no-op if already set)."""
if "HF_HUB_OFFLINE" in os.environ:
yield False
return
if not _probe_dns_dead():
yield False
return

transformers_was_set = "TRANSFORMERS_OFFLINE" in os.environ
os.environ["HF_HUB_OFFLINE"] = "1"
if not transformers_was_set:
os.environ["TRANSFORMERS_OFFLINE"] = "1"
logger.warning("huggingface.co unreachable; using local HF cache for this load.")
try:
yield True
finally:
os.environ.pop("HF_HUB_OFFLINE", None)
if not transformers_was_set:
os.environ.pop("TRANSFORMERS_OFFLINE", None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The _hf_offline_if_dns_dead context manager modifies os.environ globally without synchronization. Since this context manager is used in routes/inference.py outside of the LlamaCppBackend lock, concurrent requests can interfere with each other's environment variables, leading to race conditions where HF_HUB_OFFLINE or TRANSFORMERS_OFFLINE are incorrectly set or unset. This should be protected by a thread-safe mechanism or handled without modifying global environment state.

shimmyshimmer and others added 8 commits May 18, 2026 18:53
The previous overlay copied the vram-headroom branch's llama_cpp.py
verbatim and lost the spec_draft_n_max plumbing that the new
_already_in_target_state regression tests need. Combine both PR
diffs (unslothai#5582 + unslothai#5585) and copy the merged file in. Also deselect
TestTransformersIntrospection in the workflow filter; those tests
need a live transformers config that is GPU-only on free runners.
Cherry-pick from upstream unsloth PR unslothai#5575:
- Skip MTP auto-promote for models <2B params; sub-2B dense regresses
  vs spec-off (0.8B Q4_K_XL: 452 -> 283 t/s, 0.63x).
- Backfill /v1/chat/completions usage from llama-server timings so the
  Studio UI t/s widget shows real decode throughput instead of falling
  back to wall-clock when usage.completion_tokens is zero.

Mirrors auto-promote gate in _already_in_target_state. Applies usage
backfill at all four metadata-yield sites in generate_chat_completion
and generate_chat_completion_with_tools.

Tests: +3 sub-2B gate tests, +4 backfill tests. 289 passed locally.
Cherry-pick from upstream PR unslothai#5575:
- Probe llama-server --help to tell post-rename
  --spec-ngram-mod-n-{match,min,max} flags apart from pre-rename
  --spec-ngram-size-n / --draft-min / --draft-max via the "argument
  has been removed" description on stub entries.
- Cached on (path, mtime), like the existing mtp_token probe.
- Wire CPU/Mac MTP comma-chain and standalone ngram-mod branches to
  emit the right knob set, or degrade gracefully (warn + skip ngram
  chain) on binaries that have neither.

Tests: +8 new tests (4 probe-detection, 4 build_ngram_mod_flags).
The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod
fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch,
which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom
changes. That overwrote the mtp_engaged-aware budget path on staging
and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and
test_fit_mtp_engaged_unchanged_when_kv_off_gpu.

Restore:
- mtp_engaged: bool = False param on _fit_context_to_vram
- budget_frac = 0.85 if mtp_engaged else 0.90
- _mtp_will_engage computation before the auto-fit GPU subset loops
- mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites

All 299 staging backend tests pass.
Earlier sub-2B gate disabled speculative decoding entirely for tiny
dense MTP models because the MTP draft head's per-token cost exceeds
the acceptance savings at that scale. The "fully off" fallback was
conservative -- ngram-mod has near-zero idle cost on diverse content
and consistently outperforms both off and draft-mtp at sub-3B.

Clean-methodology bench (each of 9 distinct prompts run once after
two unrelated warmup prompts so the ngram-mod hash pool is
realistically populated but never holds the exact deterministic
output we're about to measure):

  Q4_K_XL on B200:
    0.8B  OFF=451  draft-mtp n=2=263 (0.58x)  ngram-only=498 (1.10x)
    2B    OFF=377  draft-mtp n=2=308 (0.82x)  ngram-only=369 (1.00x)
    4B    OFF=240  draft-mtp n=2=260 (1.08x)  -- 4B+ wins with MTP

  Q4_K_XL on x86 48 cores:
    0.8B  OFF= 80  chained n=2= 69 (0.86x)  ngram-only= 95 (1.19x)
    2B    OFF= 62  chained n=2= 51 (0.83x)  ngram-only= 63 (1.01x)
    4B    OFF= 31  chained n=2= 41 (1.33x)

Change:
- Raise the MTP-skip threshold from 2.0B to 3.0B (2B falls below it).
- When skipping the MTP head, fall back to --spec-type ngram-mod via
  the probe-driven _build_ngram_mod_flags helper. Works on both
  post-rename and pre-rename llama-server builds.
- If the binary advertises neither ngram-mod flavor, fall back to
  spec-off (older binaries that don't support ngram-mod at all).
- Mirror the same fallback in _already_in_target_state so a sub-3B
  reload-with-default does not bounce a ngram-mod backend.

Tests updated: monkeypatch probe_server_capabilities so the gate
behavior is deterministic regardless of which llama-server happens
to be on the host. +1 new test for the "binary has no ngram-mod
support" branch; renamed prior 2B/0.8B tests to reflect new semantics.

This generalizes the size gate to be probe-driven instead of a hard
"disable spec" branch.
…restore mtp_engaged)

Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping
the two already-merged commits and renumbering the remaining nine. The
staging fork was sitting on the pre-rebase llama_cpp.py + tests; this
commit replays the rebased file content while preserving PR unslothai#5585's
mtp_engaged auto-fit headroom (staging-only patch absent upstream).

Restored on top of the new file:
- mtp_engaged: bool = False on _fit_context_to_vram
- budget_frac = 0.85 if mtp_engaged else 0.90
- _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did
  not force --spec-type) before the auto-fit GPU subset loops
- mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites

All MTP detection + fit_context + fit_mtp tests pass (161 passed).
…P+Ngram / Off)

Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option
Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs,
ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes
force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no
ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the
ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off
state, kept so the Switch's "disable" capability isn't lost.

Backend
- New module-level _canonicalize_spec_mode(value) maps any accepted input
  (canonical, legacy "default"/"draft-mtp"/"ngram-mod"/"ngram-simple", or
  comma-chained "ngram-mod,draft-mtp") onto one of auto/mtp/ngram/mtp+ngram/
  off/ngram-simple/None. Lets external callers + old persisted UI state
  round-trip without breaking.
- LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode
  property storing the canonical UI mode the user requested. Status responses
  round-trip this instead of the resolved internal flag, so the dropdown
  restores the picked value after reload / refresh (Auto on a 27B MTP GGUF
  resolves to draft-mtp internally but the dropdown stays on "Auto").
- The resolver block in load_model is extracted into a unit-testable
  _build_speculative_flags method. Forced MTP/MTP+Ngram on a sub-3B or
  non-MTP GGUF logs a warning and engages anyway (user override > the
  Auto-path sub-3B fallback).
- _already_in_target_state and routes/inference._request_matches_loaded_settings
  now compare canonical-requested mode, dropping the old auto-promotion
  mirror. spec_draft_n_max still gates on the resolved spec (so Auto +
  changed n_max still bounces a reload).

Frontend
- chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV
  Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft
  Tokens input only visible when speculativeType is "mtp" or "mtp+ngram".
- chat-runtime-store.ts: initial value flips from "default" to "auto".
- use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend
  canonicalizer so persisted "default"/"draft-mtp"/"ngram-mod"/chain values
  hydrate to the right dropdown option.
- types/api.ts: docs the canonical wire vocabulary.

Staging-only carry-over (PR unslothai#5585): mtp_engaged restored on
_fit_context_to_vram, with the auto-fit gate widened to recognise forced
MTP modes in addition to Auto's promotion path.

Tests
- 53 new assertions in test_llama_cpp_mtp_detection.py: full
  _canonicalize_spec_mode table, a 23-row resolver matrix across
  (requested mode) x (GPU/CPU) x (model size class), plus n_max override,
  user-extra-args precedence, requested-mode round-trip, and graceful
  degrade on an outdated llama-server without an MTP token.
- 165 existing backend tests + 3 staging fit-mtp-engaged tests still
  green. 218 total in the MTP/server-args/reload-inheritance suite.
When the user switches from model A to a different model B, clear the
runtime store's speculativeType + specDraftNMax (and their loaded*
shadows). The new load request then carries null, the backend
canonicalises that to "auto", and its platform-aware resolver runs
fresh for the new model.

Without this, a non-MTP model loaded with "Off" carried the Off choice
into a subsequent MTP load, suppressing MTP auto-promotion (and the
sub-3B ngram-mod fallback) until the user manually opened settings and
flipped the dropdown back to Auto. The clean-sweep deep probe caught
it as anomaly A-1.

The reset only fires when currentCheckpoint !== modelId, so a same-model
reapply or forceReload still honours the user's current spec choice.
End-to-end probe on Qwen3.5-4B-GGUF (non-MTP, Off) -> Qwen3.5-0.8B-MTP
confirms: dropdown shows Auto, /api/inference/status returns
"speculative_type":"auto", studio.log shows the Auto sub-3B fallback
emitted --spec-type ngram-mod.
@danielhanchen danielhanchen deleted the studio-mtp-ci branch May 19, 2026 13:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants