Skip to content

Comments

common : add common_speculative_is_compat()#19270

Merged
ggerganov merged 3 commits intomasterfrom
gg/spec-disable-for-recurrent
Feb 6, 2026
Merged

common : add common_speculative_is_compat()#19270
ggerganov merged 3 commits intomasterfrom
gg/spec-disable-for-recurrent

Conversation

@ggerganov
Copy link
Member

@ggerganov ggerganov commented Feb 2, 2026

fix #19267

Memory modules that do not support removing the last tokens from the context (such as recurrent modules) cannot perform speculative decoding. Add new common_speculative_is_compat() to query this functionality and use it in llama-server to disable speculative decoding for those contexts.

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering if we can do this inside common_speculative_init instead.

For example, common_speculative_init can try to evaluate 2 tokens, then remove the first one. If llama_memory_seq_rm returns error, then we throw an error saying the model is not compatible.

Btw, I think it's better to throw an error and exit, rather than a warning.

@coder543
Copy link

coder543 commented Feb 3, 2026

I just ran into #19267, and it would be cool if there were a way to make this compatible rather than just disabling it, but disabling it is better than crashing. With Qwen3-Coder-Next, ngram-mod could provide large speedups during coding workflows.

@ggerganov
Copy link
Member Author

@ngxson Implemented this idea in a new common_speculative_is_compat() helper function.

Btw, I think it's better to throw an error and exit, rather than a warning.

Do you have something specific in mind? In my server config, I want to set a default ngram-based spec decoding and have it applied for all routed models. When a routed model does not support it, it still continues to work. So I think a warning is better.

@ggerganov ggerganov changed the title llama : add llama_memory_can_rm_suffix() common : add common_speculative_is_compat() Feb 4, 2026
@ggerganov ggerganov requested a review from ngxson February 5, 2026 08:08
@ggerganov
Copy link
Member Author

@ngxson Gentle ping

Copy link
Collaborator

@ngxson ngxson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah sorry I missed the notif. LGTM!

Just wondering if we should also do the same check for draft model.

@ggerganov
Copy link
Member Author

Yes, I think we can do that. Will follow up in next PR.

@ggerganov ggerganov merged commit dfde599 into master Feb 6, 2026
70 of 78 checks passed
@ggerganov ggerganov deleted the gg/spec-disable-for-recurrent branch February 6, 2026 14:47
liparetejas pushed a commit to liparetejas/llama.cpp that referenced this pull request Feb 23, 2026
* llama : add llama_memory_can_rm_suffix()

* Revert "llama : add llama_memory_can_rm_suffix()"

This reverts commit d30e59b.

* spec : check if the target context is compatible for spec decoding
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Eval bug: spec-type ngram-mod crash with Qwen3Next

4 participants