ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582 by danielhanchen · Pull Request #132 · danielhanchen/unsloth-staging-2

danielhanchen · 2026-05-18T18:01:30Z

Staging PR to exercise PRs unslothai#5575 (comma-chained --spec-type for CPU/Mac) and unslothai#5582 (--spec-draft-n-max toggle) across the three OSes upstream has no GPU coverage for.

Workflows added

mtp-tests-ci.yml: pytest test_llama_cpp_mtp_detection.py + test_llama_server_args.py on ubuntu-latest, macos-14, windows-latest. Pure-Python, no llama-server, no GPU.
mtp-prebuilt-probe.yml: install Studio via install.sh / install.ps1 on each OS and assert the bundled llama-server advertises --spec-type with the draft-mtp / mtp alias plus --spec-draft-n-max.

Both pin concurrency.cancel-in-progress and only trigger on the studio-mtp-ci branch + matching paths.

What this catches

Platform-specific path / shell / encoding issues in the spec-decoding emit path before merging upstream.
A prebuilt llama-server regression that drops the draft-mtp / mtp alias on any one OS.

Reset

This branch also strips the prior staging-fork workflow bloat so the 5-concurrent-Windows-runner cap stays headroom-clear and unrelated path changes do not re-trigger.

Strip the existing workflow bloat (Mac/Windows smoke matrix, mlx, notebooks, version-compat, wheel-smoke etc.) and add two targeted workflows that exercise the MTP spec-decoding path the PRs unslothai#5575 and unslothai#5582 touch: 1. mtp-tests-ci.yml runs the two pytest files that pin Studio's MTP argv emission and spec_draft_n_max plumbing on ubuntu-latest, macos-14, and windows-latest. Pure-Python, no llama-server, no GPU. 2. mtp-prebuilt-probe.yml installs Studio via install.sh / install.ps1 on each OS and asserts the bundled llama-server advertises --spec-type with the draft-mtp / mtp alias and the --spec-draft-n-max knob. This is what probe_server_capabilities does at runtime; running it on CI catches a prebuilt regression before users see it. Both workflows pin concurrency.cancel-in-progress so each new push supersedes the prior run, and they only trigger on the studio-mtp-ci branch + the relevant paths. Also overlays the two pending PRs (unslothai#5575 + unslothai#5582) onto this staging branch so the workflows actually exercise the fixed emission path rather than the stale main.

gemini-code-assist

Code Review

This pull request introduces support for Multi-Token Prediction (MTP) speculative decoding, including auto-detection for MTP-capable models based on GGUF metadata and naming conventions. It adds a new 'Draft Tokens' configuration to the UI and improves the robustness of offline model loading by implementing a DNS probe and local Hugging Face cache resolution. Additionally, the backend now tees llama-server output to dedicated log files for better post-mortem analysis. Feedback identifies a high-severity issue regarding potential race conditions caused by the global modification of environment variables within the new offline-handling context manager.

gemini-code-assist · 2026-05-18T18:05:47Z

+def _hf_offline_if_dns_dead():
+    """Set HF_HUB_OFFLINE for the body of this block only when DNS to
+    huggingface.co fails. Restores the env on exit so a transient
+    resolver hiccup at the start of one load can't quarantine the whole
+    process. Respects an explicit user setting (no-op if already set)."""
+    if "HF_HUB_OFFLINE" in os.environ:
+        yield False
+        return
+    if not _probe_dns_dead():
+        yield False
+        return
+
+    transformers_was_set = "TRANSFORMERS_OFFLINE" in os.environ
+    os.environ["HF_HUB_OFFLINE"] = "1"
+    if not transformers_was_set:
+        os.environ["TRANSFORMERS_OFFLINE"] = "1"
+    logger.warning("huggingface.co unreachable; using local HF cache for this load.")
+    try:
+        yield True
+    finally:
+        os.environ.pop("HF_HUB_OFFLINE", None)
+        if not transformers_was_set:
+            os.environ.pop("TRANSFORMERS_OFFLINE", None)


The _hf_offline_if_dns_dead context manager modifies os.environ globally without synchronization. Since this context manager is used in routes/inference.py outside of the LlamaCppBackend lock, concurrent requests can interfere with each other's environment variables, leading to race conditions where HF_HUB_OFFLINE or TRANSFORMERS_OFFLINE are incorrectly set or unset. This should be protected by a thread-safe mechanism or handled without modifying global environment state.

…ity)

…hai#5585)

The previous overlay copied the vram-headroom branch's llama_cpp.py verbatim and lost the spec_draft_n_max plumbing that the new _already_in_target_state regression tests need. Combine both PR diffs (unslothai#5582 + unslothai#5585) and copy the merged file in. Also deselect TestTransformersIntrospection in the workflow filter; those tests need a live transformers config that is GPU-only on free runners.

Cherry-pick from upstream unsloth PR unslothai#5575: - Skip MTP auto-promote for models <2B params; sub-2B dense regresses vs spec-off (0.8B Q4_K_XL: 452 -> 283 t/s, 0.63x). - Backfill /v1/chat/completions usage from llama-server timings so the Studio UI t/s widget shows real decode throughput instead of falling back to wall-clock when usage.completion_tokens is zero. Mirrors auto-promote gate in _already_in_target_state. Applies usage backfill at all four metadata-yield sites in generate_chat_completion and generate_chat_completion_with_tools. Tests: +3 sub-2B gate tests, +4 backfill tests. 289 passed locally.

Cherry-pick from upstream PR unslothai#5575: - Probe llama-server --help to tell post-rename --spec-ngram-mod-n-{match,min,max} flags apart from pre-rename --spec-ngram-size-n / --draft-min / --draft-max via the "argument has been removed" description on stub entries. - Cached on (path, mtime), like the existing mtp_token probe. - Wire CPU/Mac MTP comma-chain and standalone ngram-mod branches to emit the right knob set, or degrade gracefully (warn + skip ngram chain) on binaries that have neither. Tests: +8 new tests (4 probe-detection, 4 build_ngram_mod_flags).

The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch, which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom changes. That overwrote the mtp_engaged-aware budget path on staging and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and test_fit_mtp_engaged_unchanged_when_kv_off_gpu. Restore: - mtp_engaged: bool = False param on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage computation before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All 299 staging backend tests pass.

Earlier sub-2B gate disabled speculative decoding entirely for tiny dense MTP models because the MTP draft head's per-token cost exceeds the acceptance savings at that scale. The "fully off" fallback was conservative -- ngram-mod has near-zero idle cost on diverse content and consistently outperforms both off and draft-mtp at sub-3B. Clean-methodology bench (each of 9 distinct prompts run once after two unrelated warmup prompts so the ngram-mod hash pool is realistically populated but never holds the exact deterministic output we're about to measure): Q4_K_XL on B200: 0.8B OFF=451 draft-mtp n=2=263 (0.58x) ngram-only=498 (1.10x) 2B OFF=377 draft-mtp n=2=308 (0.82x) ngram-only=369 (1.00x) 4B OFF=240 draft-mtp n=2=260 (1.08x) -- 4B+ wins with MTP Q4_K_XL on x86 48 cores: 0.8B OFF= 80 chained n=2= 69 (0.86x) ngram-only= 95 (1.19x) 2B OFF= 62 chained n=2= 51 (0.83x) ngram-only= 63 (1.01x) 4B OFF= 31 chained n=2= 41 (1.33x) Change: - Raise the MTP-skip threshold from 2.0B to 3.0B (2B falls below it). - When skipping the MTP head, fall back to --spec-type ngram-mod via the probe-driven _build_ngram_mod_flags helper. Works on both post-rename and pre-rename llama-server builds. - If the binary advertises neither ngram-mod flavor, fall back to spec-off (older binaries that don't support ngram-mod at all). - Mirror the same fallback in _already_in_target_state so a sub-3B reload-with-default does not bounce a ngram-mod backend. Tests updated: monkeypatch probe_server_capabilities so the gate behavior is deterministic regardless of which llama-server happens to be on the host. +1 new test for the "binary has no ngram-mod support" branch; renamed prior 2B/0.8B tests to reflect new semantics. This generalizes the size gate to be probe-driven instead of a hard "disable spec" branch.

…restore mtp_engaged) Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping the two already-merged commits and renumbering the remaining nine. The staging fork was sitting on the pre-rebase llama_cpp.py + tests; this commit replays the rebased file content while preserving PR unslothai#5585's mtp_engaged auto-fit headroom (staging-only patch absent upstream). Restored on top of the new file: - mtp_engaged: bool = False on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did not force --spec-type) before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All MTP detection + fit_context + fit_mtp tests pass (161 passed).

…P+Ngram / Off) Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs, ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off state, kept so the Switch's "disable" capability isn't lost. Backend - New module-level _canonicalize_spec_mode(value) maps any accepted input (canonical, legacy "default"/"draft-mtp"/"ngram-mod"/"ngram-simple", or comma-chained "ngram-mod,draft-mtp") onto one of auto/mtp/ngram/mtp+ngram/ off/ngram-simple/None. Lets external callers + old persisted UI state round-trip without breaking. - LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode property storing the canonical UI mode the user requested. Status responses round-trip this instead of the resolved internal flag, so the dropdown restores the picked value after reload / refresh (Auto on a 27B MTP GGUF resolves to draft-mtp internally but the dropdown stays on "Auto"). - The resolver block in load_model is extracted into a unit-testable _build_speculative_flags method. Forced MTP/MTP+Ngram on a sub-3B or non-MTP GGUF logs a warning and engages anyway (user override > the Auto-path sub-3B fallback). - _already_in_target_state and routes/inference._request_matches_loaded_settings now compare canonical-requested mode, dropping the old auto-promotion mirror. spec_draft_n_max still gates on the resolved spec (so Auto + changed n_max still bounces a reload). Frontend - chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft Tokens input only visible when speculativeType is "mtp" or "mtp+ngram". - chat-runtime-store.ts: initial value flips from "default" to "auto". - use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend canonicalizer so persisted "default"/"draft-mtp"/"ngram-mod"/chain values hydrate to the right dropdown option. - types/api.ts: docs the canonical wire vocabulary. Staging-only carry-over (PR unslothai#5585): mtp_engaged restored on _fit_context_to_vram, with the auto-fit gate widened to recognise forced MTP modes in addition to Auto's promotion path. Tests - 53 new assertions in test_llama_cpp_mtp_detection.py: full _canonicalize_spec_mode table, a 23-row resolver matrix across (requested mode) x (GPU/CPU) x (model size class), plus n_max override, user-extra-args precedence, requested-mode round-trip, and graceful degrade on an outdated llama-server without an MTP token. - 165 existing backend tests + 3 staging fit-mtp-engaged tests still green. 218 total in the MTP/server-args/reload-inheritance suite.

When the user switches from model A to a different model B, clear the runtime store's speculativeType + specDraftNMax (and their loaded* shadows). The new load request then carries null, the backend canonicalises that to "auto", and its platform-aware resolver runs fresh for the new model. Without this, a non-MTP model loaded with "Off" carried the Off choice into a subsequent MTP load, suppressing MTP auto-promotion (and the sub-3B ngram-mod fallback) until the user manually opened settings and flipped the dropdown back to Auto. The clean-sweep deep probe caught it as anomaly A-1. The reset only fires when currentCheckpoint !== modelId, so a same-model reapply or forceReload still honours the user's current spec choice. End-to-end probe on Qwen3.5-4B-GGUF (non-MTP, Off) -> Qwen3.5-0.8B-MTP confirms: dropdown shows Auto, /api/inference/status returns "speculative_type":"auto", studio.log shows the Auto sub-3B fallback emitted --spec-type ngram-mod.

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

shimmyshimmer added 3 commits May 18, 2026 18:07

tests: skip MTP capability-probe tests on Windows (bash stub portabil…

71f4927

…ity)

ci: probe prebuilt llama-server via direct download (skip install.sh)

4b2d6f7

ci: drop macOS/Windows from prebuilt-probe (Linux-only release)

0ab226b

danielhanchen mentioned this pull request May 18, 2026

studio: add --spec-draft-n-max toggle for MTP speculative decoding unslothai/unsloth#5582

Merged

6 tasks

shimmyshimmer added 2 commits May 18, 2026 18:21

studio: reserve VRAM headroom for MTP draft cache in auto-fit (unslot…

806d332

…hai#5585)

ci: include kv-cache fit tests in MTP tests CI

4a9c2e9

danielhanchen mentioned this pull request May 18, 2026

studio: reserve VRAM headroom for the MTP draft cache in auto-fit unslothai/unsloth#5585

Merged

3 tasks

shimmyshimmer and others added 8 commits May 18, 2026 18:53

danielhanchen closed this May 19, 2026

danielhanchen deleted the studio-mtp-ci branch May 19, 2026 13:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582#132

ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582#132
danielhanchen wants to merge 14 commits into
mainfrom
studio-mtp-ci

danielhanchen commented May 18, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

danielhanchen commented May 18, 2026

Workflows added

What this catches

Reset

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot May 18, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants