ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582#132
ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582#132danielhanchen wants to merge 14 commits into
Conversation
Strip the existing workflow bloat (Mac/Windows smoke matrix, mlx, notebooks, version-compat, wheel-smoke etc.) and add two targeted workflows that exercise the MTP spec-decoding path the PRs unslothai#5575 and unslothai#5582 touch: 1. mtp-tests-ci.yml runs the two pytest files that pin Studio's MTP argv emission and spec_draft_n_max plumbing on ubuntu-latest, macos-14, and windows-latest. Pure-Python, no llama-server, no GPU. 2. mtp-prebuilt-probe.yml installs Studio via install.sh / install.ps1 on each OS and asserts the bundled llama-server advertises --spec-type with the draft-mtp / mtp alias and the --spec-draft-n-max knob. This is what probe_server_capabilities does at runtime; running it on CI catches a prebuilt regression before users see it. Both workflows pin concurrency.cancel-in-progress so each new push supersedes the prior run, and they only trigger on the studio-mtp-ci branch + the relevant paths. Also overlays the two pending PRs (unslothai#5575 + unslothai#5582) onto this staging branch so the workflows actually exercise the fixed emission path rather than the stale main.
There was a problem hiding this comment.
Code Review
This pull request introduces support for Multi-Token Prediction (MTP) speculative decoding, including auto-detection for MTP-capable models based on GGUF metadata and naming conventions. It adds a new 'Draft Tokens' configuration to the UI and improves the robustness of offline model loading by implementing a DNS probe and local Hugging Face cache resolution. Additionally, the backend now tees llama-server output to dedicated log files for better post-mortem analysis. Feedback identifies a high-severity issue regarding potential race conditions caused by the global modification of environment variables within the new offline-handling context manager.
| def _hf_offline_if_dns_dead(): | ||
| """Set HF_HUB_OFFLINE for the body of this block only when DNS to | ||
| huggingface.co fails. Restores the env on exit so a transient | ||
| resolver hiccup at the start of one load can't quarantine the whole | ||
| process. Respects an explicit user setting (no-op if already set).""" | ||
| if "HF_HUB_OFFLINE" in os.environ: | ||
| yield False | ||
| return | ||
| if not _probe_dns_dead(): | ||
| yield False | ||
| return | ||
|
|
||
| transformers_was_set = "TRANSFORMERS_OFFLINE" in os.environ | ||
| os.environ["HF_HUB_OFFLINE"] = "1" | ||
| if not transformers_was_set: | ||
| os.environ["TRANSFORMERS_OFFLINE"] = "1" | ||
| logger.warning("huggingface.co unreachable; using local HF cache for this load.") | ||
| try: | ||
| yield True | ||
| finally: | ||
| os.environ.pop("HF_HUB_OFFLINE", None) | ||
| if not transformers_was_set: | ||
| os.environ.pop("TRANSFORMERS_OFFLINE", None) |
There was a problem hiding this comment.
The _hf_offline_if_dns_dead context manager modifies os.environ globally without synchronization. Since this context manager is used in routes/inference.py outside of the LlamaCppBackend lock, concurrent requests can interfere with each other's environment variables, leading to race conditions where HF_HUB_OFFLINE or TRANSFORMERS_OFFLINE are incorrectly set or unset. This should be protected by a thread-safe mechanism or handled without modifying global environment state.
The previous overlay copied the vram-headroom branch's llama_cpp.py verbatim and lost the spec_draft_n_max plumbing that the new _already_in_target_state regression tests need. Combine both PR diffs (unslothai#5582 + unslothai#5585) and copy the merged file in. Also deselect TestTransformersIntrospection in the workflow filter; those tests need a live transformers config that is GPU-only on free runners.
Cherry-pick from upstream unsloth PR unslothai#5575: - Skip MTP auto-promote for models <2B params; sub-2B dense regresses vs spec-off (0.8B Q4_K_XL: 452 -> 283 t/s, 0.63x). - Backfill /v1/chat/completions usage from llama-server timings so the Studio UI t/s widget shows real decode throughput instead of falling back to wall-clock when usage.completion_tokens is zero. Mirrors auto-promote gate in _already_in_target_state. Applies usage backfill at all four metadata-yield sites in generate_chat_completion and generate_chat_completion_with_tools. Tests: +3 sub-2B gate tests, +4 backfill tests. 289 passed locally.
Cherry-pick from upstream PR unslothai#5575: - Probe llama-server --help to tell post-rename --spec-ngram-mod-n-{match,min,max} flags apart from pre-rename --spec-ngram-size-n / --draft-min / --draft-max via the "argument has been removed" description on stub entries. - Cached on (path, mtime), like the existing mtp_token probe. - Wire CPU/Mac MTP comma-chain and standalone ngram-mod branches to emit the right knob set, or degrade gracefully (warn + skip ngram chain) on binaries that have neither. Tests: +8 new tests (4 probe-detection, 4 build_ngram_mod_flags).
The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch, which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom changes. That overwrote the mtp_engaged-aware budget path on staging and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and test_fit_mtp_engaged_unchanged_when_kv_off_gpu. Restore: - mtp_engaged: bool = False param on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage computation before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All 299 staging backend tests pass.
Earlier sub-2B gate disabled speculative decoding entirely for tiny
dense MTP models because the MTP draft head's per-token cost exceeds
the acceptance savings at that scale. The "fully off" fallback was
conservative -- ngram-mod has near-zero idle cost on diverse content
and consistently outperforms both off and draft-mtp at sub-3B.
Clean-methodology bench (each of 9 distinct prompts run once after
two unrelated warmup prompts so the ngram-mod hash pool is
realistically populated but never holds the exact deterministic
output we're about to measure):
Q4_K_XL on B200:
0.8B OFF=451 draft-mtp n=2=263 (0.58x) ngram-only=498 (1.10x)
2B OFF=377 draft-mtp n=2=308 (0.82x) ngram-only=369 (1.00x)
4B OFF=240 draft-mtp n=2=260 (1.08x) -- 4B+ wins with MTP
Q4_K_XL on x86 48 cores:
0.8B OFF= 80 chained n=2= 69 (0.86x) ngram-only= 95 (1.19x)
2B OFF= 62 chained n=2= 51 (0.83x) ngram-only= 63 (1.01x)
4B OFF= 31 chained n=2= 41 (1.33x)
Change:
- Raise the MTP-skip threshold from 2.0B to 3.0B (2B falls below it).
- When skipping the MTP head, fall back to --spec-type ngram-mod via
the probe-driven _build_ngram_mod_flags helper. Works on both
post-rename and pre-rename llama-server builds.
- If the binary advertises neither ngram-mod flavor, fall back to
spec-off (older binaries that don't support ngram-mod at all).
- Mirror the same fallback in _already_in_target_state so a sub-3B
reload-with-default does not bounce a ngram-mod backend.
Tests updated: monkeypatch probe_server_capabilities so the gate
behavior is deterministic regardless of which llama-server happens
to be on the host. +1 new test for the "binary has no ngram-mod
support" branch; renamed prior 2B/0.8B tests to reflect new semantics.
This generalizes the size gate to be probe-driven instead of a hard
"disable spec" branch.
…restore mtp_engaged) Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping the two already-merged commits and renumbering the remaining nine. The staging fork was sitting on the pre-rebase llama_cpp.py + tests; this commit replays the rebased file content while preserving PR unslothai#5585's mtp_engaged auto-fit headroom (staging-only patch absent upstream). Restored on top of the new file: - mtp_engaged: bool = False on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did not force --spec-type) before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All MTP detection + fit_context + fit_mtp tests pass (161 passed).
…P+Ngram / Off) Replace the Chat Settings Speculative Decoding on/off Switch with a 5-option Select. Auto preserves today's platform-aware resolver (MTP on MTP GGUFs, ngram-mod fallback for sub-3B, --spec-default for non-MTP). The other 3 modes force the user's choice on BOTH GPU and CPU: MTP emits draft-mtp only (no ngram chain on CPU), Ngram emits ngram-mod only, MTP+Ngram emits the ngram-mod,draft-mtp chain on both platforms. Off is the existing fully-off state, kept so the Switch's "disable" capability isn't lost. Backend - New module-level _canonicalize_spec_mode(value) maps any accepted input (canonical, legacy "default"/"draft-mtp"/"ngram-mod"/"ngram-simple", or comma-chained "ngram-mod,draft-mtp") onto one of auto/mtp/ngram/mtp+ngram/ off/ngram-simple/None. Lets external callers + old persisted UI state round-trip without breaking. - LlamaCppBackend grows a _requested_spec_mode field + requested_spec_mode property storing the canonical UI mode the user requested. Status responses round-trip this instead of the resolved internal flag, so the dropdown restores the picked value after reload / refresh (Auto on a 27B MTP GGUF resolves to draft-mtp internally but the dropdown stays on "Auto"). - The resolver block in load_model is extracted into a unit-testable _build_speculative_flags method. Forced MTP/MTP+Ngram on a sub-3B or non-MTP GGUF logs a warning and engages anyway (user override > the Auto-path sub-3B fallback). - _already_in_target_state and routes/inference._request_matches_loaded_settings now compare canonical-requested mode, dropping the old auto-promotion mirror. spec_draft_n_max still gates on the resolved spec (so Auto + changed n_max still bounces a reload). Frontend - chat-settings-sheet.tsx: Switch swapped for Select modeled on the KV Cache Dtype Select. Items: Auto / MTP / Ngram / MTP+Ngram / Off. Draft Tokens input only visible when speculativeType is "mtp" or "mtp+ngram". - chat-runtime-store.ts: initial value flips from "default" to "auto". - use-chat-model-runtime.ts normalizeSpeculativeType mirrors the backend canonicalizer so persisted "default"/"draft-mtp"/"ngram-mod"/chain values hydrate to the right dropdown option. - types/api.ts: docs the canonical wire vocabulary. Staging-only carry-over (PR unslothai#5585): mtp_engaged restored on _fit_context_to_vram, with the auto-fit gate widened to recognise forced MTP modes in addition to Auto's promotion path. Tests - 53 new assertions in test_llama_cpp_mtp_detection.py: full _canonicalize_spec_mode table, a 23-row resolver matrix across (requested mode) x (GPU/CPU) x (model size class), plus n_max override, user-extra-args precedence, requested-mode round-trip, and graceful degrade on an outdated llama-server without an MTP token. - 165 existing backend tests + 3 staging fit-mtp-engaged tests still green. 218 total in the MTP/server-args/reload-inheritance suite.
When the user switches from model A to a different model B, clear the runtime store's speculativeType + specDraftNMax (and their loaded* shadows). The new load request then carries null, the backend canonicalises that to "auto", and its platform-aware resolver runs fresh for the new model. Without this, a non-MTP model loaded with "Off" carried the Off choice into a subsequent MTP load, suppressing MTP auto-promotion (and the sub-3B ngram-mod fallback) until the user manually opened settings and flipped the dropdown back to Auto. The clean-sweep deep probe caught it as anomaly A-1. The reset only fires when currentCheckpoint !== modelId, so a same-model reapply or forceReload still honours the user's current spec choice. End-to-end probe on Qwen3.5-4B-GGUF (non-MTP, Off) -> Qwen3.5-0.8B-MTP confirms: dropdown shows Auto, /api/inference/status returns "speculative_type":"auto", studio.log shows the Auto sub-3B fallback emitted --spec-type ngram-mod.
Staging PR to exercise PRs unslothai#5575 (comma-chained --spec-type for CPU/Mac) and unslothai#5582 (--spec-draft-n-max toggle) across the three OSes upstream has no GPU coverage for.
Workflows added
Both pin concurrency.cancel-in-progress and only trigger on the studio-mtp-ci branch + matching paths.
What this catches
Reset
This branch also strips the prior staging-fork workflow bloat so the 5-concurrent-Windows-runner cap stays headroom-clear and unrelated path changes do not re-trigger.