common : add common_speculative_is_compat()#19270
Conversation
ngxson
left a comment
There was a problem hiding this comment.
Just wondering if we can do this inside common_speculative_init instead.
For example, common_speculative_init can try to evaluate 2 tokens, then remove the first one. If llama_memory_seq_rm returns error, then we throw an error saying the model is not compatible.
Btw, I think it's better to throw an error and exit, rather than a warning.
|
I just ran into #19267, and it would be cool if there were a way to make this compatible rather than just disabling it, but disabling it is better than crashing. With Qwen3-Coder-Next, ngram-mod could provide large speedups during coding workflows. |
|
@ngxson Implemented this idea in a new
Do you have something specific in mind? In my server config, I want to set a default ngram-based spec decoding and have it applied for all routed models. When a routed model does not support it, it still continues to work. So I think a warning is better. |
|
@ngxson Gentle ping |
ngxson
left a comment
There was a problem hiding this comment.
Yeah sorry I missed the notif. LGTM!
Just wondering if we should also do the same check for draft model.
|
Yes, I think we can do that. Will follow up in next PR. |
* llama : add llama_memory_can_rm_suffix() * Revert "llama : add llama_memory_can_rm_suffix()" This reverts commit d30e59b. * spec : check if the target context is compatible for spec decoding
fix #19267
Memory modules that do not support removing the last tokens from the context (such as recurrent modules) cannot perform speculative decoding. Add new
common_speculative_is_compat()to query this functionality and use it inllama-serverto disable speculative decoding for those contexts.