Skip to content

studio: add speculative decoding support (ngram-mod, on by default)#4836

Merged
danielhanchen merged 6 commits into
mainfrom
feat/studio-speculative-decoding
Apr 3, 2026
Merged

studio: add speculative decoding support (ngram-mod, on by default)#4836
danielhanchen merged 6 commits into
mainfrom
feat/studio-speculative-decoding

Conversation

@danielhanchen
Copy link
Copy Markdown
Member

@danielhanchen danielhanchen commented Apr 3, 2026

Summary

  • Enables n-gram speculative decoding for all GGUF models in Unsloth Studio, on by default
  • Uses llama.cpp's ngram-mod mode (PR #19164): ~16 MB shared hash pool, zero VRAM cost, variable speedup depending on content repetition
  • llama.cpp author recommends enabling this by default: "Overall, I think this speculator can become enabled by default in llama-server"
  • Params from llama.cpp docs: --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
  • Auto-disabled for vision models (unsupported in llama.cpp)
  • On/Off toggle in Chat Settings > Model section with Apply/Reset support

Performance (from llama.cpp PRs #18471, #19164)

Speculative decoding helps most when the model repeats existing text (code refactoring, summarization, reasoning that restates earlier thinking):

Scenario Without With ngram-mod Speedup
gpt-oss-120b code refactor (#18471) 181 t/s 446 t/s 2.5x
Qwen3-235B offloaded (#18471) 11.8 t/s 21.0 t/s 1.8x
gpt-oss-120b code repeat, 92% acceptance (#19164) 181 t/s 814 t/s 4.5x

For general chat with low repetition, acceptance rate is near zero with negligible overhead (~5 ms per request).

Changes

Backend (3 files):

  • studio/backend/models/inference.py -- speculative_type field on LoadRequest, LoadResponse, InferenceStatusResponse
  • studio/backend/core/inference/llama_cpp.py -- speculative_type param on load_model(), passes --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64 to llama-server, silently skips for vision models
  • studio/backend/routes/inference.py -- Passes speculative_type through to load/status responses

Frontend (4 files):

  • studio/frontend/src/features/chat/types/api.ts -- speculative_type on request/response types
  • studio/frontend/src/features/chat/stores/chat-runtime-store.ts -- speculativeType state (default "ngram-mod"), setter, reset
  • studio/frontend/src/features/chat/chat-settings-sheet.tsx -- On/Off dropdown in Model section, GGUF only, hidden for vision
  • studio/frontend/src/features/chat/hooks/use-chat-model-runtime.ts -- Wires speculative_type through load request/response and reconnect

Test plan

  • Load a GGUF model, confirm llama-server logs show --spec-type ngram-mod --spec-ngram-size-n 24 --draft-min 48 --draft-max 64
  • Send chat messages, confirm responses work correctly
  • Toggle speculative decoding Off in settings, click Apply, confirm logs show no spec flags
  • Toggle back On, click Apply, confirm spec flags reappear
  • Load a vision model (e.g. Gemma-3 with mmproj), confirm spec decoding toggle is hidden
  • Refresh the page with a GGUF model loaded, confirm spec decoding state is restored

Enable n-gram speculative decoding for GGUF models in Unsloth Studio.
Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation
with zero VRAM cost via a 4MB fixed hash table that auto-resets on
low acceptance rates.

Backend:
- Add speculative_type field to LoadRequest, LoadResponse, and
  InferenceStatusResponse pydantic models
- Add speculative_type parameter to LlamaCppBackend.load_model()
  with allowlist validation (ngram-simple, ngram-mod)
- Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags
  to llama-server when ngram-mod is active
- Default to ngram-mod for non-vision GGUF models server-side
- Silently skip speculative decoding for vision models (unsupported
  in llama.cpp server-context.cpp)

Frontend:
- Add speculative_type to TS API types
- Add speculativeType/loadedSpeculativeType to chat runtime store
  with default value of "ngram-mod"
- Add On/Off toggle in Model settings section (GGUF only, hidden
  for vision models), included in dirty check for Apply/Reset
- Wire speculative_type through model load request and response
- Restore speculative type state on page refresh/reconnect
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a77a8ac5a8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +254 to +255
if effective_spec_type is None and not config.is_vision:
effective_spec_type = "ngram-mod"
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Respect explicit disable of speculative decoding

When the client sends speculative_type: null to turn speculative decoding off (the new settings UI does this), this branch rewrites null to "ngram-mod" for all non-vision GGUF loads, so the model is always relaunched with speculation enabled. In practice, users cannot disable the feature via the new On/Off toggle because null is treated as both “unset” and “off”; this should distinguish explicit disable from missing input.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements speculative decoding support for GGUF models, specifically adding 'ngram-simple' and 'ngram-mod' modes to improve generation speed with zero VRAM cost. The changes span the backend llama-cpp integration, API models, and the frontend settings UI. A critical issue was identified in the backend routing logic where a hardcoded default prevents users from explicitly disabling speculative decoding, as the backend overrides the null value sent by the frontend when the feature is toggled off.

Comment thread studio/backend/routes/inference.py Outdated
Comment on lines +253 to +255
effective_spec_type = request.speculative_type
if effective_spec_type is None and not config.is_vision:
effective_spec_type = "ngram-mod"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

This logic forces ngram-mod whenever speculative_type is None. However, the frontend sends null (which becomes None in Python) when the user explicitly toggles speculative decoding to Off. This override makes it impossible for users to disable the feature for non-vision GGUF models. Since the frontend store already handles the default value (ngram-mod), this backend default is redundant and should be removed to respect the user's choice.

Suggested change
effective_spec_type = request.speculative_type
if effective_spec_type is None and not config.is_vision:
effective_spec_type = "ngram-mod"
effective_spec_type = request.speculative_type

The backend was overriding speculative_type=None to "ngram-mod" for
non-vision GGUF models, which prevented users from disabling spec
decoding via the UI toggle. The frontend store already defaults to
"ngram-mod", so the backend fallback was redundant and blocked the
explicit "Off" setting.
Update speculative decoding params to match the recommended values
from llama.cpp docs (docs/speculative.md):
  --spec-ngram-size-n 24 (was 16, docs say small n not recommended)
  --draft-min 48 (was 0)
  --draft-max 64 (was 24, docs note MoEs need long drafts)

Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes),
not 4 MB.
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 67dae36881

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +465 to +466
speculativeType: loadedSpec,
loadedSpeculativeType: loadedSpec,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Preserve speculative default across non-GGUF loads

This assignment unconditionally overwrites the store with loadResponse.speculative_type, but non-GGUF (and vision GGUF) load responses return null/missing for that field. After loading one of those models, speculativeType becomes null, and the next GGUF load reuses that value in the request, so speculative decoding is silently disabled even though the feature is intended to default to On. Restrict this overwrite to GGUF models (or keep the previous/default value for non-GGUF responses).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 66797d16ab

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

trust_remote_code: paramsBeforeLoad.trustRemoteCode ?? false,
chat_template_override: chatTemplateOverride,
cache_type_kv: kvCacheDtype,
speculative_type: speculativeType,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Propagate speculative setting in rollback load request

The new speculative_type is only attached to the primary loadModel request path, so when a model switch fails and performLoad() falls back to reloading the previous checkpoint, that rollback request still omits this field. In that failure scenario, GGUF rollback loads come back with speculative decoding disabled (null) even if the previous model had it enabled, so a transient load error silently changes runtime behavior after recovery.

Useful? React with 👍 / 👎.

Include speedup numbers from llama.cpp PRs #18471 and #19164 as an
inline comment so future readers understand the expected gains.
@danielhanchen danielhanchen merged commit a32b871 into main Apr 3, 2026
5 checks passed
@danielhanchen danielhanchen deleted the feat/studio-speculative-decoding branch April 3, 2026 20:57
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1e8f4ac792

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

n_ctx = request.max_seq_length,
chat_template_override = request.chat_template_override,
cache_type_kv = request.cache_type_kv,
speculative_type = request.speculative_type,
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Apply default speculative mode when field is omitted

Forwarding request.speculative_type directly means omitted input is treated as None (because LoadRequest.speculative_type defaults to None), so GGUF loads from callers that still omit this field (e.g. chat-adapter.ts auto-load paths and shared-composer.tsx compare loads) silently start with speculative decoding off. That makes behavior depend on which UI path triggered the load, despite this feature being intended to be on by default for GGUF models.

Useful? React with 👍 / 👎.

shibizhao pushed a commit to shibizhao/unsloth-npu that referenced this pull request Apr 7, 2026
…nslothai#4836)

* studio: add speculative decoding support (ngram-mod, on by default)

Enable n-gram speculative decoding for GGUF models in Unsloth Studio.
Uses llama.cpp's ngram-mod mode which gives 10-40% faster generation
with zero VRAM cost via a 4MB fixed hash table that auto-resets on
low acceptance rates.

Backend:
- Add speculative_type field to LoadRequest, LoadResponse, and
  InferenceStatusResponse pydantic models
- Add speculative_type parameter to LlamaCppBackend.load_model()
  with allowlist validation (ngram-simple, ngram-mod)
- Pass --spec-type, --spec-ngram-size-n 16, --draft-max 24 flags
  to llama-server when ngram-mod is active
- Default to ngram-mod for non-vision GGUF models server-side
- Silently skip speculative decoding for vision models (unsupported
  in llama.cpp server-context.cpp)

Frontend:
- Add speculative_type to TS API types
- Add speculativeType/loadedSpeculativeType to chat runtime store
  with default value of "ngram-mod"
- Add On/Off toggle in Model settings section (GGUF only, hidden
  for vision models), included in dirty check for Apply/Reset
- Wire speculative_type through model load request and response
- Restore speculative type state on page refresh/reconnect

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* fix: remove server-side speculative decoding override

The backend was overriding speculative_type=None to "ngram-mod" for
non-vision GGUF models, which prevented users from disabling spec
decoding via the UI toggle. The frontend store already defaults to
"ngram-mod", so the backend fallback was redundant and blocked the
explicit "Off" setting.

* fix: use recommended ngram-mod params from llama.cpp docs

Update speculative decoding params to match the recommended values
from llama.cpp docs (docs/speculative.md):
  --spec-ngram-size-n 24 (was 16, docs say small n not recommended)
  --draft-min 48 (was 0)
  --draft-max 64 (was 24, docs note MoEs need long drafts)

Also fix comment: ngram-mod uses ~16 MB (4M entries * 4 bytes),
not 4 MB.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* add benchmark table and references to speculative decoding comment

Include speedup numbers from llama.cpp PRs #18471 and #19164 as an
inline comment so future readers understand the expected gains.

---------

Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
@NourEldin-Osama
Copy link
Copy Markdown

@copilot can we support the ability to disable the Vision Encoder, so that we can have n-gram speculative decoding

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants