studio: add speculative decoding support (ngram-mod, on by default)#4836
Conversation
Enable n-gram speculative decoding for GGUF models in Unsloth Studio. Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation with zero VRAM cost via a 4MB fixed hash table that auto-resets on low acceptance rates. Backend: - Add speculative_type field to LoadRequest, LoadResponse, and InferenceStatusResponse pydantic models - Add speculative_type parameter to LlamaCppBackend.load_model() with allowlist validation (ngram-simple, ngram-mod) - Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags to llama-server when ngram-mod is active - Default to ngram-mod for non-vision GGUF models server-side - Silently skip speculative decoding for vision models (unsupported in llama.cpp server-context.cpp) Frontend: - Add speculative_type to TS API types - Add speculativeType/loadedSpeculativeType to chat runtime store with default value of "ngram-mod" - Add On/Off toggle in Model settings section (GGUF only, hidden for vision models), included in dirty check for Apply/Reset - Wire speculative_type through model load request and response - Restore speculative type state on page refresh/reconnect
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a77a8ac5a8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if effective_spec_type is None and not config.is_vision: | ||
| effective_spec_type = "ngram-mod" |
There was a problem hiding this comment.
Respect explicit disable of speculative decoding
When the client sends speculative_type: null to turn speculative decoding off (the new settings UI does this), this branch rewrites null to "ngram-mod" for all non-vision GGUF loads, so the model is always relaunched with speculation enabled. In practice, users cannot disable the feature via the new On/Off toggle because null is treated as both “unset” and “off”; this should distinguish explicit disable from missing input.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Code Review
This pull request implements speculative decoding support for GGUF models, specifically adding 'ngram-simple' and 'ngram-mod' modes to improve generation speed with zero VRAM cost. The changes span the backend llama-cpp integration, API models, and the frontend settings UI. A critical issue was identified in the backend routing logic where a hardcoded default prevents users from explicitly disabling speculative decoding, as the backend overrides the null value sent by the frontend when the feature is toggled off.
| effective_spec_type = request.speculative_type | ||
| if effective_spec_type is None and not config.is_vision: | ||
| effective_spec_type = "ngram-mod" |
There was a problem hiding this comment.
This logic forces ngram-mod whenever speculative_type is None. However, the frontend sends null (which becomes None in Python) when the user explicitly toggles speculative decoding to Off. This override makes it impossible for users to disable the feature for non-vision GGUF models. Since the frontend store already handles the default value (ngram-mod), this backend default is redundant and should be removed to respect the user's choice.
| effective_spec_type = request.speculative_type | |
| if effective_spec_type is None and not config.is_vision: | |
| effective_spec_type = "ngram-mod" | |
| effective_spec_type = request.speculative_type |
The backend was overriding speculative_type=None to "ngram-mod" for non-vision GGUF models, which prevented users from disabling spec decoding via the UI toggle. The frontend store already defaults to "ngram-mod", so the backend fallback was redundant and blocked the explicit "Off" setting.
Update speculative decoding params to match the recommended values from llama.cpp docs (docs/speculative.md): --spec-ngram-size-n 24 (was 16, docs say small n not recommended) --draft-min 48 (was 0) --draft-max 64 (was 24, docs note MoEs need long drafts) Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes), not 4 MB.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 67dae36881
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| speculativeType: loadedSpec, | ||
| loadedSpeculativeType: loadedSpec, |
There was a problem hiding this comment.
Preserve speculative default across non-GGUF loads
This assignment unconditionally overwrites the store with loadResponse.speculative_type, but non-GGUF (and vision GGUF) load responses return null/missing for that field. After loading one of those models, speculativeType becomes null, and the next GGUF load reuses that value in the request, so speculative decoding is silently disabled even though the feature is intended to default to On. Restrict this overwrite to GGUF models (or keep the previous/default value for non-GGUF responses).
Useful? React with 👍 / 👎.
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 66797d16ab
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| trust_remote_code: paramsBeforeLoad.trustRemoteCode ?? false, | ||
| chat_template_override: chatTemplateOverride, | ||
| cache_type_kv: kvCacheDtype, | ||
| speculative_type: speculativeType, |
There was a problem hiding this comment.
Propagate speculative setting in rollback load request
The new speculative_type is only attached to the primary loadModel request path, so when a model switch fails and performLoad() falls back to reloading the previous checkpoint, that rollback request still omits this field. In that failure scenario, GGUF rollback loads come back with speculative decoding disabled (null) even if the previous model had it enabled, so a transient load error silently changes runtime behavior after recovery.
Useful? React with 👍 / 👎.
Include speedup numbers from llama.cpp PRs #18471 and #19164 as an inline comment so future readers understand the expected gains.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1e8f4ac792
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| n_ctx = request.max_seq_length, | ||
| chat_template_override = request.chat_template_override, | ||
| cache_type_kv = request.cache_type_kv, | ||
| speculative_type = request.speculative_type, |
There was a problem hiding this comment.
Apply default speculative mode when field is omitted
Forwarding request.speculative_type directly means omitted input is treated as None (because LoadRequest.speculative_type defaults to None), so GGUF loads from callers that still omit this field (e.g. chat-adapter.ts auto-load paths and shared-composer.tsx compare loads) silently start with speculative decoding off. That makes behavior depend on which UI path triggered the load, despite this feature being intended to be on by default for GGUF models.
Useful? React with 👍 / 👎.
…nslothai#4836) * studio: add speculative decoding support (ngram-mod, on by default) Enable n-gram speculative decoding for GGUF models in Unsloth Studio. Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation with zero VRAM cost via a 4MB fixed hash table that auto-resets on low acceptance rates. Backend: - Add speculative_type field to LoadRequest, LoadResponse, and InferenceStatusResponse pydantic models - Add speculative_type parameter to LlamaCppBackend.load_model() with allowlist validation (ngram-simple, ngram-mod) - Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags to llama-server when ngram-mod is active - Default to ngram-mod for non-vision GGUF models server-side - Silently skip speculative decoding for vision models (unsupported in llama.cpp server-context.cpp) Frontend: - Add speculative_type to TS API types - Add speculativeType/loadedSpeculativeType to chat runtime store with default value of "ngram-mod" - Add On/Off toggle in Model settings section (GGUF only, hidden for vision models), included in dirty check for Apply/Reset - Wire speculative_type through model load request and response - Restore speculative type state on page refresh/reconnect * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: remove server-side speculative decoding override The backend was overriding speculative_type=None to "ngram-mod" for non-vision GGUF models, which prevented users from disabling spec decoding via the UI toggle. The frontend store already defaults to "ngram-mod", so the backend fallback was redundant and blocked the explicit "Off" setting. * fix: use recommended ngram-mod params from llama.cpp docs Update speculative decoding params to match the recommended values from llama.cpp docs (docs/speculative.md): --spec-ngram-size-n 24 (was 16, docs say small n not recommended) --draft-min 48 (was 0) --draft-max 64 (was 24, docs note MoEs need long drafts) Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes), not 4 MB. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark table and references to speculative decoding comment Include speedup numbers from llama.cpp PRs #18471 and #19164 as an inline comment so future readers understand the expected gains. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
|
@copilot can we support the ability to disable the Vision Encoder, so that we can have n-gram speculative decoding |
Summary
ngram-modmode (PR #19164): ~16 MB shared hash pool, zero VRAM cost, variable speedup depending on content repetition--spec-ngram-size-n 24 --draft-min 48 --draft-max 64Performance (from llama.cpp PRs #18471, #19164)
Speculative decoding helps most when the model repeats existing text (code refactoring, summarization, reasoning that restates earlier thinking):
For general chat with low repetition, acceptance rate is near zero with negligible overhead (~5 ms per request).
Changes
Backend (3 files):
studio/backend/models/inference.py--speculative_typefield onLoadRequest,LoadResponse,InferenceStatusResponsestudio/backend/core/inference/llama_cpp.py--speculative_typeparam onload_model(), passes--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64to llama-server, silently skips for vision modelsstudio/backend/routes/inference.py-- Passes speculative_type through to load/status responsesFrontend (4 files):
studio/frontend/src/features/chat/types/api.ts--speculative_typeon request/response typesstudio/frontend/src/features/chat/stores/chat-runtime-store.ts--speculativeTypestate (default"ngram-mod"), setter, resetstudio/frontend/src/features/chat/chat-settings-sheet.tsx-- On/Off dropdown in Model section, GGUF only, hidden for visionstudio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts-- Wires speculative_type through load request/response and reconnectTest plan
--spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64