chore(server): quiet benign Gemma4-Assistant memory-probe warning on MTP load#94
Merged
Merged
Conversation
The MTP draft memory-probe path (server-context.cpp ~line 856) creates a throwaway llama_context with `cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP` to measure context+compute bytes for fit_params. For the Gemma4-Assistant arch, this throws because `cparams.ctx_other` is required and the target context doesn't exist yet at probe time — upstream's own src/llama-context.cpp init explicitly notes "this is normal during memory fitting" in the exception message and carries a TODO to switch to a typed llama_exception so the warning can be skipped. Until that upstream change lands and flows in via a master sync, scan the exception message for the self-identifying "normal during memory fitting" marker and downgrade WRN -> DBG for that specific case. Real failures (model load failed, etc.) still surface as SRV_WRN. Eliminates the misleading "[spec] failed to measure draft model memory: failed to create llama_context from model" line that appears on every gemma-4-12b-qat-mtp pod start despite the model loading + running fine at ~110 tok/s (verified Phase 6 on titan, image unified-llm:mtp-pr23398-5e6dff22).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
server-context.cpp(~line 856) throws on Gemma4-Assistant becausecparams.ctx_otheris required and the target context doesn't exist yet at probe time.llama_exceptionso the warning can be skipped.SRV_WRN->SRV_DBGfor that specific case. Real failures (model load failed, etc.) still surface asSRV_WRN.Why
[spec] failed to measure draft model memory: failed to create llama_context from modelline that appears on every gemma-4-12b-qat-mtp pod start despite the model loading + running fine at ~110 tok/s (verified Phase 6 on titan, imageunified-llm:mtp-pr23398-5e6dff22).Test plan
Related
src/llama-context.cppnear theGEMMA4_ASSISTANTctx_othercheck (carries a TODO from am17an's upstream PR llama : add Gemma4 MTP ggml-org/llama.cpp#23398 to switch to typed exception)