Skip to content

studio: emit one comma-chained --spec-type for CPU/Mac MTP path#5575

Merged
danielhanchen merged 2 commits into
mainfrom
studio-mtp-cpu-comma-spec-type
May 19, 2026
Merged

studio: emit one comma-chained --spec-type for CPU/Mac MTP path#5575
danielhanchen merged 2 commits into
mainfrom
studio-mtp-cpu-comma-spec-type

Conversation

@danielhanchen

@danielhanchen danielhanchen commented May 18, 2026

Copy link
Copy Markdown
Member

Summary

  • The CPU/Mac MTP branch in LlamaCppBackend.load_model was emitting --spec-type twice in the same llama-server invocation (--spec-type {mtp_token} ... --spec-type ngram-mod ...). llama-server takes a single --spec-type, whose value may be comma-separated to chain implementations (e.g. ngram-mod,draft-mtp). The repeated form is not the documented chaining mechanism and silently drops one of the two specs depending on argv handling, so on Macs and CPU-only hosts the layered MTP plus ngram-mod path was not actually engaging both.
  • Collapse the pair to a single --spec-type ngram-mod,{mtp_token} while keeping the existing --spec-draft-n-max 3 and the three --spec-ngram-mod-n-* knobs unchanged.
  • Fix the now-incorrect docstring on _extra_args_set_spec_type that claimed llama-server accumulates repeated --spec-type. It does not; chaining is the comma-separated value form.
  • Update the matching pass-through fixture in test_llama_server_args.py so validate_extra_args round-trips the new emitted shape.

What changes in the emitted command

GPU branch is unchanged (single --spec-type {mtp_token} --spec-draft-n-max 6).

CPU/Mac branch before:

--spec-type {mtp_token} --spec-draft-n-max 3 \
--spec-type ngram-mod \
--spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 6

CPU/Mac branch after:

--spec-type ngram-mod,{mtp_token} --spec-draft-n-max 3 \
--spec-ngram-mod-n-match 24 --spec-ngram-mod-n-min 48 --spec-ngram-mod-n-max 6

mtp_token is still resolved from probe_server_capabilities so older prebuilts that advertise draft-mtp and newer ones that advertise mtp both work.

Tests

  • studio/backend/tests/test_llama_server_args.py: pass-through fixture updated. 72 passed.
  • studio/backend/tests/test_llama_cpp_mtp_detection.py: 65 passed, no changes needed.

Test plan

  • python -m pytest studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_llama_cpp_mtp_detection.py green (137 passed)
  • Manual verification on a Mac / CPU host that --spec-type ngram-mod,{mtp_token} engages both layers in llama-server logs

Note

Follow-up work (UI toggle for --spec-draft-n-max, sub-3B fallback to ngram-only, legacy llama-server probe, Studio chat-completions usage backfill, and the full speedup bench data) lives in PR #5582 which is stacked on top of this one.

llama-server takes a single --spec-type whose value may be
comma-separated to chain implementations (e.g. ngram-mod,draft-mtp).
The CPU/Mac MTP branch in LlamaCppBackend.load_model was passing
--spec-type twice in the same invocation, which is not the documented
chaining mechanism and silently drops one of the two specs depending
on llama.cpp's argv handling.

Collapse the pair to --spec-type ngram-mod,{mtp_token} and update the
stale _extra_args_set_spec_type docstring that claimed llama-server
accumulates repeated --spec-type. Update the matching pass-through
fixture in test_llama_server_args.py.

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request modifies the llama-server command-line argument construction to use a single, comma-separated --spec-type flag instead of multiple repeated flags when combining ngram-mod and MTP tokens. It also updates the relevant docstrings and test cases to align with this change. I have no feedback to provide.

Two correctness fixes against the llama.cpp server README:

1. The CPU/Mac comma-chained branch was emitting
   --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48, which is
   nonsensical (min > max). Per the upstream default the value is 64.

2. The standalone ngram-mod branch was emitting --spec-ngram-size-n,
   --draft-min, --draft-max. llama.cpp removed those arg aliases for
   ngram-mod (they live only on the ngram-simple / map families now);
   the correct knobs are --spec-ngram-mod-n-match / n-min / n-max.

Also refresh the inline comment block to point at the server README
rather than the older docs/speculative.md draft- aliases.
@danielhanchen

Copy link
Copy Markdown
Member Author

Pushed a follow-up commit (3d6a4f2) that fixes two upstream alignment bugs found while reading the llama.cpp server README:

  1. The CPU/Mac comma-chained branch was emitting --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48. min > max is nonsensical and the upstream default for n-max is 64. Fixed.
  2. The standalone ngram-mod branch was emitting --spec-ngram-size-n / --draft-min / --draft-max. llama.cpp removed those arg aliases for the ngram-mod path (they remain on ngram-simple / map families only). The correct knobs are --spec-ngram-mod-n-match / n-min / n-max. Fixed.

Also confirmed via the same README that --spec-type takes a single comma-separated value (--spec-type none,draft-simple,draft-eagle3,draft-mtp,ngram-simple,ngram-map-k,ngram-map-k4v,ngram-mod,ngram-cache), so the comma-chain change in this PR is the upstream-blessed form.

tests/test_llama_server_args.py + tests/test_llama_cpp_mtp_detection.py: 137 passed.

danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
Cherry-pick from upstream unsloth PR unslothai#5575:
- Skip MTP auto-promote for models <2B params; sub-2B dense regresses
  vs spec-off (0.8B Q4_K_XL: 452 -> 283 t/s, 0.63x).
- Backfill /v1/chat/completions usage from llama-server timings so the
  Studio UI t/s widget shows real decode throughput instead of falling
  back to wall-clock when usage.completion_tokens is zero.

Mirrors auto-promote gate in _already_in_target_state. Applies usage
backfill at all four metadata-yield sites in generate_chat_completion
and generate_chat_completion_with_tools.

Tests: +3 sub-2B gate tests, +4 backfill tests. 289 passed locally.
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
Cherry-pick from upstream PR unslothai#5575:
- Probe llama-server --help to tell post-rename
  --spec-ngram-mod-n-{match,min,max} flags apart from pre-rename
  --spec-ngram-size-n / --draft-min / --draft-max via the "argument
  has been removed" description on stub entries.
- Cached on (path, mtime), like the existing mtp_token probe.
- Wire CPU/Mac MTP comma-chain and standalone ngram-mod branches to
  emit the right knob set, or degrade gracefully (warn + skip ngram
  chain) on binaries that have neither.

Tests: +8 new tests (4 probe-detection, 4 build_ngram_mod_flags).
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod
fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch,
which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom
changes. That overwrote the mtp_engaged-aware budget path on staging
and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and
test_fit_mtp_engaged_unchanged_when_kv_off_gpu.

Restore:
- mtp_engaged: bool = False param on _fit_context_to_vram
- budget_frac = 0.85 if mtp_engaged else 0.90
- _mtp_will_engage computation before the auto-fit GPU subset loops
- mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites

All 299 staging backend tests pass.
@danielhanchen danielhanchen merged commit 5ce4ab4 into main May 19, 2026
29 of 32 checks passed
@danielhanchen danielhanchen deleted the studio-mtp-cpu-comma-spec-type branch May 19, 2026 10:16
danielhanchen pushed a commit to danielhanchen/unsloth-staging-2 that referenced this pull request May 19, 2026
…restore mtp_engaged)

Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping
the two already-merged commits and renumbering the remaining nine. The
staging fork was sitting on the pre-rebase llama_cpp.py + tests; this
commit replays the rebased file content while preserving PR unslothai#5585's
mtp_engaged auto-fit headroom (staging-only patch absent upstream).

Restored on top of the new file:
- mtp_engaged: bool = False on _fit_context_to_vram
- budget_frac = 0.85 if mtp_engaged else 0.90
- _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did
  not force --spec-type) before the auto-fit GPU subset loops
- mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites

All MTP detection + fit_context + fit_mtp tests pass (161 passed).
rsd-darshan pushed a commit to rsd-darshan/unsloth that referenced this pull request Jun 3, 2026
…othai#5575)

* studio: emit one comma-chained --spec-type for CPU/Mac MTP path

llama-server takes a single --spec-type whose value may be
comma-separated to chain implementations (e.g. ngram-mod,draft-mtp).
The CPU/Mac MTP branch in LlamaCppBackend.load_model was passing
--spec-type twice in the same invocation, which is not the documented
chaining mechanism and silently drops one of the two specs depending
on llama.cpp's argv handling.

Collapse the pair to --spec-type ngram-mod,{mtp_token} and update the
stale _extra_args_set_spec_type docstring that claimed llama-server
accumulates repeated --spec-type. Update the matching pass-through
fixture in test_llama_server_args.py.

* studio: align MTP ngram-mod knobs with llama.cpp upstream defaults

Two correctness fixes against the llama.cpp server README:

1. The CPU/Mac comma-chained branch was emitting
   --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48, which is
   nonsensical (min > max). Per the upstream default the value is 64.

2. The standalone ngram-mod branch was emitting --spec-ngram-size-n,
   --draft-min, --draft-max. llama.cpp removed those arg aliases for
   ngram-mod (they live only on the ngram-simple / map families now);
   the correct knobs are --spec-ngram-mod-n-match / n-min / n-max.

Also refresh the inline comment block to point at the server README
rather than the older docs/speculative.md draft- aliases.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant