studio: emit one comma-chained --spec-type for CPU/Mac MTP path by danielhanchen · Pull Request #5575 · unslothai/unsloth

danielhanchen · 2026-05-18T17:11:55Z

Summary

The CPU/Mac MTP branch in LlamaCppBackend.load_model was emitting --spec-type twice in the same llama-server invocation (--spec-type {mtp_token} ... --spec-type ngram-mod ...). llama-server takes a single --spec-type, whose value may be comma-separated to chain implementations (e.g. ngram-mod,draft-mtp). The repeated form is not the documented chaining mechanism and silently drops one of the two specs depending on argv handling, so on Macs and CPU-only hosts the layered MTP plus ngram-mod path was not actually engaging both.
Collapse the pair to a single --spec-type ngram-mod,{mtp_token} while keeping the existing --spec-draft-n-max 3 and the three --spec-ngram-mod-n-* knobs unchanged.
Fix the now-incorrect docstring on _extra_args_set_spec_type that claimed llama-server accumulates repeated --spec-type. It does not; chaining is the comma-separated value form.
Update the matching pass-through fixture in test_llama_server_args.py so validate_extra_args round-trips the new emitted shape.

What changes in the emitted command

GPU branch is unchanged (single --spec-type {mtp_token} --spec-draft-n-max 6).

CPU/Mac branch before:

--spec-type {mtp_token} --spec-draft-n-max 3 \
--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 6

CPU/Mac branch after:

--spec-type ngram-mod,{mtp_token} --spec-draft-n-max 3 \
--spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 6

mtp_token is still resolved from probe_server_capabilities so older prebuilts that advertise draft-mtp and newer ones that advertise mtp both work.

Tests

studio/backend/tests/test_llama_server_args.py: pass-through fixture updated. 72 passed.
studio/backend/tests/test_llama_cpp_mtp_detection.py: 65 passed, no changes needed.

Test plan

python -m pytest studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_llama_cpp_mtp_detection.py green (137 passed)
Manual verification on a Mac / CPU host that --spec-type ngram-mod,{mtp_token} engages both layers in llama-server logs

Note

Follow-up work (UI toggle for --spec-draft-n-max, sub-3B fallback to ngram-only, legacy llama-server probe, Studio chat-completions usage backfill, and the full speedup bench data) lives in PR #5582 which is stacked on top of this one.

llama-server takes a single --spec-type whose value may be comma-separated to chain implementations (e.g. ngram-mod,draft-mtp). The CPU/Mac MTP branch in LlamaCppBackend.load_model was passing --spec-type twice in the same invocation, which is not the documented chaining mechanism and silently drops one of the two specs depending on llama.cpp's argv handling. Collapse the pair to --spec-type ngram-mod,{mtp_token} and update the stale _extra_args_set_spec_type docstring that claimed llama-server accumulates repeated --spec-type. Update the matching pass-through fixture in test_llama_server_args.py.

gemini-code-assist

Code Review

This pull request modifies the llama-server command-line argument construction to use a single, comma-separated --spec-type flag instead of multiple repeated flags when combining ngram-mod and MTP tokens. It also updates the relevant docstrings and test cases to align with this change. I have no feedback to provide.

Two correctness fixes against the llama.cpp server README: 1. The CPU/Mac comma-chained branch was emitting --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48, which is nonsensical (min > max). Per the upstream default the value is 64. 2. The standalone ngram-mod branch was emitting --spec-ngram-size-n, --draft-min, --draft-max. llama.cpp removed those arg aliases for ngram-mod (they live only on the ngram-simple / map families now); the correct knobs are --spec-ngram-mod-n-match / n-min / n-max. Also refresh the inline comment block to point at the server README rather than the older docs/speculative.md draft- aliases.

danielhanchen · 2026-05-18T17:52:23Z

Pushed a follow-up commit (3d6a4f2) that fixes two upstream alignment bugs found while reading the llama.cpp server README:

The CPU/Mac comma-chained branch was emitting --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48. min > max is nonsensical and the upstream default for n-max is 64. Fixed.
The standalone ngram-mod branch was emitting --spec-ngram-size-n / --draft-min / --draft-max. llama.cpp removed those arg aliases for the ngram-mod path (they remain on ngram-simple / map families only). The correct knobs are --spec-ngram-mod-n-match / n-min / n-max. Fixed.

Also confirmed via the same README that --spec-type takes a single comma-separated value (--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache), so the comma-chain change in this PR is the upstream-blessed form.

tests/test_llama_server_args.py + tests/test_llama_cpp_mtp_detection.py: 137 passed.

Cherry-pick from upstream unsloth PR unslothai#5575: - Skip MTP auto-promote for models <2B params; sub-2B dense regresses vs spec-off (0.8B Q4_K_XL: 452 -> 283 t/s, 0.63x). - Backfill /v1/chat/completions usage from llama-server timings so the Studio UI t/s widget shows real decode throughput instead of falling back to wall-clock when usage.completion_tokens is zero. Mirrors auto-promote gate in _already_in_target_state. Applies usage backfill at all four metadata-yield sites in generate_chat_completion and generate_chat_completion_with_tools. Tests: +3 sub-2B gate tests, +4 backfill tests. 289 passed locally.

Cherry-pick from upstream PR unslothai#5575: - Probe llama-server --help to tell post-rename --spec-ngram-mod-n-{match,min,max} flags apart from pre-rename --spec-ngram-size-n / --draft-min / --draft-max via the "argument has been removed" description on stub entries. - Cached on (path, mtime), like the existing mtp_token probe. - Wire CPU/Mac MTP comma-chain and standalone ngram-mod branches to emit the right knob set, or degrade gracefully (warn + skip ngram chain) on binaries that have neither. Tests: +8 new tests (4 probe-detection, 4 build_ngram_mod_flags).

The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch, which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom changes. That overwrote the mtp_engaged-aware budget path on staging and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and test_fit_mtp_engaged_unchanged_when_kv_off_gpu. Restore: - mtp_engaged: bool = False param on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage computation before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All 299 staging backend tests pass.

…restore mtp_engaged) Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping the two already-merged commits and renumbering the remaining nine. The staging fork was sitting on the pre-rebase llama_cpp.py + tests; this commit replays the rebased file content while preserving PR unslothai#5585's mtp_engaged auto-fit headroom (staging-only patch absent upstream). Restored on top of the new file: - mtp_engaged: bool = False on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did not force --spec-type) before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All MTP detection + fit_context + fit_mtp tests pass (161 passed).

…othai#5575) * studio: emit one comma-chained --spec-type for CPU/Mac MTP path llama-server takes a single --spec-type whose value may be comma-separated to chain implementations (e.g. ngram-mod,draft-mtp). The CPU/Mac MTP branch in LlamaCppBackend.load_model was passing --spec-type twice in the same invocation, which is not the documented chaining mechanism and silently drops one of the two specs depending on llama.cpp's argv handling. Collapse the pair to --spec-type ngram-mod,{mtp_token} and update the stale _extra_args_set_spec_type docstring that claimed llama-server accumulates repeated --spec-type. Update the matching pass-through fixture in test_llama_server_args.py. * studio: align MTP ngram-mod knobs with llama.cpp upstream defaults Two correctness fixes against the llama.cpp server README: 1. The CPU/Mac comma-chained branch was emitting --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48, which is nonsensical (min > max). Per the upstream default the value is 64. 2. The standalone ngram-mod branch was emitting --spec-ngram-size-n, --draft-min, --draft-max. llama.cpp removed those arg aliases for ngram-mod (they live only on the ngram-simple / map families now); the correct knobs are --spec-ngram-mod-n-match / n-min / n-max. Also refresh the inline comment block to point at the server README rather than the older docs/speculative.md draft- aliases.

danielhanchen requested a review from rolandtannous as a code owner May 18, 2026 17:11

gemini-code-assist Bot reviewed May 18, 2026

View reviewed changes

danielhanchen mentioned this pull request May 18, 2026

studio: add --spec-draft-n-max toggle for MTP speculative decoding #5582

Merged

6 tasks

This was referenced May 18, 2026

ci: MTP staging workflows (linux/macOS/Windows) for #5575 + #5582 danielhanchen/unsloth-staging-2#132

Closed

studio: reserve VRAM headroom for the MTP draft cache in auto-fit #5585

Merged

danielhanchen merged commit 5ce4ab4 into main May 19, 2026
29 of 32 checks passed

danielhanchen deleted the studio-mtp-cpu-comma-spec-type branch May 19, 2026 10:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: emit one comma-chained --spec-type for CPU/Mac MTP path#5575

studio: emit one comma-chained --spec-type for CPU/Mac MTP path#5575
danielhanchen merged 2 commits into
mainfrom
studio-mtp-cpu-comma-spec-type

danielhanchen commented May 18, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

danielhanchen commented May 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

What changes in the emitted command

Tests

Test plan

Note

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

danielhanchen commented May 18, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

danielhanchen commented May 18, 2026 •

edited

Loading