studio: add speculative decoding support (ngram-mod, on by default) by danielhanchen · Pull Request #4836 · unslothai/unsloth

danielhanchen · 2026-04-03T19:14:33Z

Summary

Enables n-gram speculative decoding for all GGUF models in Unsloth Studio, on by default
Uses llama.cpp's ngram-mod mode (PR #19164): ~16 MB shared hash pool, zero VRAM cost, variable speedup depending on content repetition
llama.cpp author recommends enabling this by default: "Overall, I think this speculator can become enabled by default in llama-server"
Params from llama.cpp docs: --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
Auto-disabled for vision models (unsupported in llama.cpp)
On/Off toggle in Chat Settings > Model section with Apply/Reset support

Performance (from llama.cpp PRs #18471, #19164)

Speculative decoding helps most when the model repeats existing text (code refactoring, summarization, reasoning that restates earlier thinking):

Scenario	Without	With ngram-mod	Speedup
gpt-oss-120b code refactor (#18471)	181 t/s	446 t/s	2.5x
Qwen3-235B offloaded (#18471)	11.8 t/s	21.0 t/s	1.8x
gpt-oss-120b code repeat, 92% acceptance (#19164)	181 t/s	814 t/s	4.5x

For general chat with low repetition, acceptance rate is near zero with negligible overhead (~5 ms per request).

Changes

Backend (3 files):

studio/backend/models/inference.py -- speculative_type field on LoadRequest, LoadResponse, InferenceStatusResponse
studio/backend/core/inference/llama_cpp.py -- speculative_type param on load_model(), passes --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 to llama-server, silently skips for vision models
studio/backend/routes/inference.py -- Passes speculative_type through to load/status responses

Frontend (4 files):

studio/frontend/src/features/chat/types/api.ts -- speculative_type on request/response types
studio/frontend/src/features/chat/stores/chat-runtime-store.ts -- speculativeType state (default "ngram-mod"), setter, reset
studio/frontend/src/features/chat/chat-settings-sheet.tsx -- On/Off dropdown in Model section, GGUF only, hidden for vision
studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts -- Wires speculative_type through load request/response and reconnect

Test plan

Load a GGUF model, confirm llama-server logs show --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
Send chat messages, confirm responses work correctly
Toggle speculative decoding Off in settings, click Apply, confirm logs show no spec flags
Toggle back On, click Apply, confirm spec flags reappear
Load a vision model (e.g. Gemma-3 with mmproj), confirm spec decoding toggle is hidden
Refresh the page with a GGUF model loaded, confirm spec decoding state is restored

Enable n-gram speculative decoding for GGUF models in Unsloth Studio. Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation with zero VRAM cost via a 4MB fixed hash table that auto-resets on low acceptance rates. Backend: - Add speculative_type field to LoadRequest, LoadResponse, and InferenceStatusResponse pydantic models - Add speculative_type parameter to LlamaCppBackend.load_model() with allowlist validation (ngram-simple, ngram-mod) - Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags to llama-server when ngram-mod is active - Default to ngram-mod for non-vision GGUF models server-side - Silently skip speculative decoding for vision models (unsupported in llama.cpp server-context.cpp) Frontend: - Add speculative_type to TS API types - Add speculativeType/loadedSpeculativeType to chat runtime store with default value of "ngram-mod" - Add On/Off toggle in Model settings section (GGUF only, hidden for vision models), included in dirty check for Apply/Reset - Wire speculative_type through model load request and response - Restore speculative type state on page refresh/reconnect

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a77a8ac5a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T19:17:55Z

+            if effective_spec_type is None and not config.is_vision:
+                effective_spec_type = "ngram-mod"


Respect explicit disable of speculative decoding

When the client sends speculative_type: null to turn speculative decoding off (the new settings UI does this), this branch rewrites null to "ngram-mod" for all non-vision GGUF loads, so the model is always relaunched with speculation enabled. In practice, users cannot disable the feature via the new On/Off toggle because null is treated as both “unset” and “off”; this should distinguish explicit disable from missing input.

Useful? React with 👍 / 👎.

gemini-code-assist

Code Review

This pull request implements speculative decoding support for GGUF models, specifically adding 'ngram-simple' and 'ngram-mod' modes to improve generation speed with zero VRAM cost. The changes span the backend llama-cpp integration, API models, and the frontend settings UI. A critical issue was identified in the backend routing logic where a hardcoded default prevents users from explicitly disabling speculative decoding, as the backend overrides the null value sent by the frontend when the feature is toggled off.

gemini-code-assist · 2026-04-03T19:19:50Z

+            effective_spec_type = request.speculative_type
+            if effective_spec_type is None and not config.is_vision:
+                effective_spec_type = "ngram-mod"


This logic forces ngram-mod whenever speculative_type is None. However, the frontend sends null (which becomes None in Python) when the user explicitly toggles speculative decoding to Off. This override makes it impossible for users to disable the feature for non-vision GGUF models. Since the frontend store already handles the default value (ngram-mod), this backend default is redundant and should be removed to respect the user's choice.

Suggested change

effective_spec_type = request.speculative_type

if effective_spec_type is None and not config.is_vision:

effective_spec_type = "ngram-mod"

effective_spec_type = request.speculative_type

The backend was overriding speculative_type=None to "ngram-mod" for non-vision GGUF models, which prevented users from disabling spec decoding via the UI toggle. The frontend store already defaults to "ngram-mod", so the backend fallback was redundant and blocked the explicit "Off" setting.

Update speculative decoding params to match the recommended values from llama.cpp docs (docs/speculative.md): --spec-ngram-size-n 24 (was 16, docs say small n not recommended) --draft-min 48 (was 0) --draft-max 64 (was 24, docs note MoEs need long drafts) Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes), not 4 MB.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 67dae36881

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T20:43:01Z

+              speculativeType: loadedSpec,
+              loadedSpeculativeType: loadedSpec,


Preserve speculative default across non-GGUF loads

This assignment unconditionally overwrites the store with loadResponse.speculative_type, but non-GGUF (and vision GGUF) load responses return null/missing for that field. After loading one of those models, speculativeType becomes null, and the next GGUF load reuses that value in the request, so speculative decoding is silently disabled even though the feature is intended to default to On. Restrict this overwrite to GGUF models (or keep the previous/default value for non-GGUF responses).

Useful? React with 👍 / 👎.

for more information, see https://pre-commit.ci

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66797d16ab

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T20:47:26Z

              trust_remote_code: paramsBeforeLoad.trustRemoteCode ?? false,
              chat_template_override: chatTemplateOverride,
              cache_type_kv: kvCacheDtype,
+              speculative_type: speculativeType,


Propagate speculative setting in rollback load request

The new speculative_type is only attached to the primary loadModel request path, so when a model switch fails and performLoad() falls back to reloading the previous checkpoint, that rollback request still omits this field. In that failure scenario, GGUF rollback loads come back with speculative decoding disabled (null) even if the previous model had it enabled, so a transient load error silently changes runtime behavior after recovery.

Useful? React with 👍 / 👎.

Include speedup numbers from llama.cpp PRs #18471 and #19164 as an inline comment so future readers understand the expected gains.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e8f4ac792

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-04-03T20:59:07Z

                    n_ctx = request.max_seq_length,
                    chat_template_override = request.chat_template_override,
                    cache_type_kv = request.cache_type_kv,
+                    speculative_type = request.speculative_type,


Apply default speculative mode when field is omitted

Forwarding request.speculative_type directly means omitted input is treated as None (because LoadRequest.speculative_type defaults to None), so GGUF loads from callers that still omit this field (e.g. chat-adapter.ts auto-load paths and shared-composer.tsx compare loads) silently start with speculative decoding off. That makes behavior depend on which UI path triggered the load, despite this feature being intended to be on by default for GGUF models.

Useful? React with 👍 / 👎.

…nslothai#4836) * studio: add speculative decoding support (ngram-mod, on by default) Enable n-gram speculative decoding for GGUF models in Unsloth Studio. Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation with zero VRAM cost via a 4MB fixed hash table that auto-resets on low acceptance rates. Backend: - Add speculative_type field to LoadRequest, LoadResponse, and InferenceStatusResponse pydantic models - Add speculative_type parameter to LlamaCppBackend.load_model() with allowlist validation (ngram-simple, ngram-mod) - Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags to llama-server when ngram-mod is active - Default to ngram-mod for non-vision GGUF models server-side - Silently skip speculative decoding for vision models (unsupported in llama.cpp server-context.cpp) Frontend: - Add speculative_type to TS API types - Add speculativeType/loadedSpeculativeType to chat runtime store with default value of "ngram-mod" - Add On/Off toggle in Model settings section (GGUF only, hidden for vision models), included in dirty check for Apply/Reset - Wire speculative_type through model load request and response - Restore speculative type state on page refresh/reconnect * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * fix: remove server-side speculative decoding override The backend was overriding speculative_type=None to "ngram-mod" for non-vision GGUF models, which prevented users from disabling spec decoding via the UI toggle. The frontend store already defaults to "ngram-mod", so the backend fallback was redundant and blocked the explicit "Off" setting. * fix: use recommended ngram-mod params from llama.cpp docs Update speculative decoding params to match the recommended values from llama.cpp docs (docs/speculative.md): --spec-ngram-size-n 24 (was 16, docs say small n not recommended) --draft-min 48 (was 0) --draft-max 64 (was 24, docs note MoEs need long drafts) Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes), not 4 MB. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * add benchmark table and references to speculative decoding comment Include speedup numbers from llama.cpp PRs #18471 and #19164 as an inline comment so future readers understand the expected gains. --------- Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

NourEldin-Osama · 2026-04-16T20:47:04Z

@copilot can we support the ability to disable the Vision Encoder, so that we can have n-gram speculative decoding

danielhanchen requested review from Manan17, rolandtannous and wasimysaid as code owners April 3, 2026 19:14

[pre-commit.ci] auto fixes from pre-commit.com hooks

a77a8ac

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

gemini-code-assist Bot reviewed Apr 3, 2026

View reviewed changes

danielhanchen added 2 commits April 3, 2026 20:39

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

[pre-commit.ci] auto fixes from pre-commit.com hooks

66797d1

for more information, see https://pre-commit.ci

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

add benchmark table and references to speculative decoding comment

1e8f4ac

Include speedup numbers from llama.cpp PRs #18471 and #19164 as an inline comment so future readers understand the expected gains.

danielhanchen merged commit a32b871 into main Apr 3, 2026
5 checks passed

danielhanchen deleted the feat/studio-speculative-decoding branch April 3, 2026 20:57

chatgpt-codex-connector Bot reviewed Apr 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

studio: add speculative decoding support (ngram-mod, on by default)#4836

studio: add speculative decoding support (ngram-mod, on by default)#4836
danielhanchen merged 6 commits into
mainfrom
feat/studio-speculative-decoding

danielhanchen commented Apr 3, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Uh oh!

NourEldin-Osama commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if effective_spec_type is None and not config.is_vision:
		effective_spec_type = "ngram-mod"

		speculativeType: loadedSpec,
		loadedSpeculativeType: loadedSpec,

Uh oh!

Conversation

danielhanchen commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Performance (from llama.cpp PRs #18471, #19164)

Changes

Test plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

NourEldin-Osama commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

danielhanchen commented Apr 3, 2026 •

edited

Loading