studio: emit one comma-chained --spec-type for CPU/Mac MTP path#5575
Conversation
llama-server takes a single --spec-type whose value may be
comma-separated to chain implementations (e.g. ngram-mod,draft-mtp).
The CPU/Mac MTP branch in LlamaCppBackend.load_model was passing
--spec-type twice in the same invocation, which is not the documented
chaining mechanism and silently drops one of the two specs depending
on llama.cpp's argv handling.
Collapse the pair to --spec-type ngram-mod,{mtp_token} and update the
stale _extra_args_set_spec_type docstring that claimed llama-server
accumulates repeated --spec-type. Update the matching pass-through
fixture in test_llama_server_args.py.
There was a problem hiding this comment.
Code Review
This pull request modifies the llama-server command-line argument construction to use a single, comma-separated --spec-type flag instead of multiple repeated flags when combining ngram-mod and MTP tokens. It also updates the relevant docstrings and test cases to align with this change. I have no feedback to provide.
Two correctness fixes against the llama.cpp server README: 1. The CPU/Mac comma-chained branch was emitting --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48, which is nonsensical (min > max). Per the upstream default the value is 64. 2. The standalone ngram-mod branch was emitting --spec-ngram-size-n, --draft-min, --draft-max. llama.cpp removed those arg aliases for ngram-mod (they live only on the ngram-simple / map families now); the correct knobs are --spec-ngram-mod-n-match / n-min / n-max. Also refresh the inline comment block to point at the server README rather than the older docs/speculative.md draft- aliases.
|
Pushed a follow-up commit (3d6a4f2) that fixes two upstream alignment bugs found while reading the llama.cpp server README:
Also confirmed via the same README that
|
Cherry-pick from upstream unsloth PR unslothai#5575: - Skip MTP auto-promote for models <2B params; sub-2B dense regresses vs spec-off (0.8B Q4_K_XL: 452 -> 283 t/s, 0.63x). - Backfill /v1/chat/completions usage from llama-server timings so the Studio UI t/s widget shows real decode throughput instead of falling back to wall-clock when usage.completion_tokens is zero. Mirrors auto-promote gate in _already_in_target_state. Applies usage backfill at all four metadata-yield sites in generate_chat_completion and generate_chat_completion_with_tools. Tests: +3 sub-2B gate tests, +4 backfill tests. 289 passed locally.
Cherry-pick from upstream PR unslothai#5575: - Probe llama-server --help to tell post-rename --spec-ngram-mod-n-{match,min,max} flags apart from pre-rename --spec-ngram-size-n / --draft-min / --draft-max via the "argument has been removed" description on stub entries. - Cached on (path, mtime), like the existing mtp_token probe. - Wire CPU/Mac MTP comma-chain and standalone ngram-mod branches to emit the right knob set, or degrade gracefully (warn + skip ngram chain) on binaries that have neither. Tests: +8 new tests (4 probe-detection, 4 build_ngram_mod_flags).
The previous two staging cherry-picks (sub-2B gate, legacy ngram-mod fallback) imported llama_cpp.py from the upstream PR unslothai#5575 branch, which does not include PR unslothai#5585's mtp_engaged auto-fit VRAM headroom changes. That overwrote the mtp_engaged-aware budget path on staging and broke test_fit_mtp_engaged_returns_smaller_or_equal_context and test_fit_mtp_engaged_unchanged_when_kv_off_gpu. Restore: - mtp_engaged: bool = False param on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage computation before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All 299 staging backend tests pass.
…restore mtp_engaged) Upstream PR unslothai#5582 was rebased onto new main (PR unslothai#5575 merged), dropping the two already-merged commits and renumbering the remaining nine. The staging fork was sitting on the pre-rebase llama_cpp.py + tests; this commit replays the rebased file content while preserving PR unslothai#5585's mtp_engaged auto-fit headroom (staging-only patch absent upstream). Restored on top of the new file: - mtp_engaged: bool = False on _fit_context_to_vram - budget_frac = 0.85 if mtp_engaged else 0.90 - _mtp_will_engage gate (MTP name and/or nextn_predict_layers, user did not force --spec-type) before the auto-fit GPU subset loops - mtp_engaged = _mtp_will_engage at both _fit_context_to_vram callsites All MTP detection + fit_context + fit_mtp tests pass (161 passed).
…othai#5575) * studio: emit one comma-chained --spec-type for CPU/Mac MTP path llama-server takes a single --spec-type whose value may be comma-separated to chain implementations (e.g. ngram-mod,draft-mtp). The CPU/Mac MTP branch in LlamaCppBackend.load_model was passing --spec-type twice in the same invocation, which is not the documented chaining mechanism and silently drops one of the two specs depending on llama.cpp's argv handling. Collapse the pair to --spec-type ngram-mod,{mtp_token} and update the stale _extra_args_set_spec_type docstring that claimed llama-server accumulates repeated --spec-type. Update the matching pass-through fixture in test_llama_server_args.py. * studio: align MTP ngram-mod knobs with llama.cpp upstream defaults Two correctness fixes against the llama.cpp server README: 1. The CPU/Mac comma-chained branch was emitting --spec-ngram-mod-n-max 6 with --spec-ngram-mod-n-min 48, which is nonsensical (min > max). Per the upstream default the value is 64. 2. The standalone ngram-mod branch was emitting --spec-ngram-size-n, --draft-min, --draft-max. llama.cpp removed those arg aliases for ngram-mod (they live only on the ngram-simple / map families now); the correct knobs are --spec-ngram-mod-n-match / n-min / n-max. Also refresh the inline comment block to point at the server README rather than the older docs/speculative.md draft- aliases.
Summary
LlamaCppBackend.load_modelwas emitting--spec-typetwice in the samellama-serverinvocation (--spec-type {mtp_token} ... --spec-type ngram-mod ...). llama-server takes a single--spec-type, whose value may be comma-separated to chain implementations (e.g.ngram-mod,draft-mtp). The repeated form is not the documented chaining mechanism and silently drops one of the two specs depending on argv handling, so on Macs and CPU-only hosts the layered MTP plus ngram-mod path was not actually engaging both.--spec-type ngram-mod,{mtp_token}while keeping the existing--spec-draft-n-max 3and the three--spec-ngram-mod-n-*knobs unchanged._extra_args_set_spec_typethat claimed llama-server accumulates repeated--spec-type. It does not; chaining is the comma-separated value form.test_llama_server_args.pysovalidate_extra_argsround-trips the new emitted shape.What changes in the emitted command
GPU branch is unchanged (single
--spec-type {mtp_token} --spec-draft-n-max 6).CPU/Mac branch before:
CPU/Mac branch after:
mtp_tokenis still resolved fromprobe_server_capabilitiesso older prebuilts that advertisedraft-mtpand newer ones that advertisemtpboth work.Tests
studio/backend/tests/test_llama_server_args.py: pass-through fixture updated. 72 passed.studio/backend/tests/test_llama_cpp_mtp_detection.py: 65 passed, no changes needed.Test plan
python -m pytest studio/backend/tests/test_llama_server_args.py studio/backend/tests/test_llama_cpp_mtp_detection.pygreen (137 passed)--spec-type ngram-mod,{mtp_token}engages both layers in llama-server logsNote
Follow-up work (UI toggle for
--spec-draft-n-max, sub-3B fallback to ngram-only, legacy llama-server probe, Studio chat-completions usage backfill, and the full speedup bench data) lives in PR #5582 which is stacked on top of this one.