chore(server): quiet benign Gemma4-Assistant memory-probe warning on MTP load by marksverdhei · Pull Request #94 · heiervang-technologies/ht-llama.cpp

marksverdhei · 2026-06-07T14:16:55Z

Summary

The MTP draft memory-probe in server-context.cpp (~line 856) throws on Gemma4-Assistant because cparams.ctx_other is required and the target context doesn't exist yet at probe time.
Upstream's own init explicitly notes "this is normal during memory fitting" in the exception message and carries a TODO to switch to a typed llama_exception so the warning can be skipped.
Until that upstream change lands and flows in via a master sync, scan the exception message for the self-identifying "normal during memory fitting" marker and downgrade SRV_WRN -> SRV_DBG for that specific case. Real failures (model load failed, etc.) still surface as SRV_WRN.

Why

Eliminates the misleading [spec] failed to measure draft model memory: failed to create llama_context from model line that appears on every gemma-4-12b-qat-mtp pod start despite the model loading + running fine at ~110 tok/s (verified Phase 6 on titan, image unified-llm:mtp-pr23398-5e6dff22).
The warning was non-blocking and we shipped γ around it, but it pollutes the logs every load and would lead to a future operator chasing a non-bug.

Test plan

cmake build of llama-server still clean (CUDA on, all targets)
No behavior change for real failures (only downgrade when the exception self-identifies as benign-during-fit)
Snoop verifies the warning is gone on the next titan pod restart (no roll/bake needed — just a logging change; we'll fold it into the next bake naturally)

Source of the exception: src/llama-context.cpp near the GEMMA4_ASSISTANT ctx_other check (carries a TODO from am17an's upstream PR llama : add Gemma4 MTP ggml-org/llama.cpp#23398 to switch to typed exception)
γ master sync PR that brought the warning in: feat(sync): upstream master sync (42 commits) + Gemma4 MTP via PR #23398 vendor #93

The MTP draft memory-probe path (server-context.cpp ~line 856) creates a throwaway llama_context with `cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP` to measure context+compute bytes for fit_params. For the Gemma4-Assistant arch, this throws because `cparams.ctx_other` is required and the target context doesn't exist yet at probe time — upstream's own src/llama-context.cpp init explicitly notes "this is normal during memory fitting" in the exception message and carries a TODO to switch to a typed llama_exception so the warning can be skipped. Until that upstream change lands and flows in via a master sync, scan the exception message for the self-identifying "normal during memory fitting" marker and downgrade WRN -> DBG for that specific case. Real failures (model load failed, etc.) still surface as SRV_WRN. Eliminates the misleading "[spec] failed to measure draft model memory: failed to create llama_context from model" line that appears on every gemma-4-12b-qat-mtp pod start despite the model loading + running fine at ~110 tok/s (verified Phase 6 on titan, image unified-llm:mtp-pr23398-5e6dff22).

marksverdhei merged commit f6feddb into ht Jun 7, 2026
1 of 8 checks passed

marksverdhei deleted the chore/quiet-spec-mtp-probe-warning branch June 7, 2026 14:49

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

chore(server): quiet benign Gemma4-Assistant memory-probe warning on MTP load#94

chore(server): quiet benign Gemma4-Assistant memory-probe warning on MTP load#94
marksverdhei merged 1 commit into
htfrom
chore/quiet-spec-mtp-probe-warning

marksverdhei commented Jun 7, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

marksverdhei commented Jun 7, 2026

Summary

Why

Test plan

Related

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant