Skip to content

chore(server): quiet benign Gemma4-Assistant memory-probe warning on MTP load#94

Merged
marksverdhei merged 1 commit into
htfrom
chore/quiet-spec-mtp-probe-warning
Jun 7, 2026
Merged

chore(server): quiet benign Gemma4-Assistant memory-probe warning on MTP load#94
marksverdhei merged 1 commit into
htfrom
chore/quiet-spec-mtp-probe-warning

Conversation

@marksverdhei

Copy link
Copy Markdown

Summary

  • The MTP draft memory-probe in server-context.cpp (~line 856) throws on Gemma4-Assistant because cparams.ctx_other is required and the target context doesn't exist yet at probe time.
  • Upstream's own init explicitly notes "this is normal during memory fitting" in the exception message and carries a TODO to switch to a typed llama_exception so the warning can be skipped.
  • Until that upstream change lands and flows in via a master sync, scan the exception message for the self-identifying "normal during memory fitting" marker and downgrade SRV_WRN -> SRV_DBG for that specific case. Real failures (model load failed, etc.) still surface as SRV_WRN.

Why

  • Eliminates the misleading [spec] failed to measure draft model memory: failed to create llama_context from model line that appears on every gemma-4-12b-qat-mtp pod start despite the model loading + running fine at ~110 tok/s (verified Phase 6 on titan, image unified-llm:mtp-pr23398-5e6dff22).
  • The warning was non-blocking and we shipped γ around it, but it pollutes the logs every load and would lead to a future operator chasing a non-bug.

Test plan

  • cmake build of llama-server still clean (CUDA on, all targets)
  • No behavior change for real failures (only downgrade when the exception self-identifies as benign-during-fit)
  • Snoop verifies the warning is gone on the next titan pod restart (no roll/bake needed — just a logging change; we'll fold it into the next bake naturally)

Related

The MTP draft memory-probe path (server-context.cpp ~line 856) creates
a throwaway llama_context with `cparams.ctx_type = LLAMA_CONTEXT_TYPE_MTP`
to measure context+compute bytes for fit_params. For the Gemma4-Assistant
arch, this throws because `cparams.ctx_other` is required and the target
context doesn't exist yet at probe time — upstream's own
src/llama-context.cpp init explicitly notes "this is normal during memory
fitting" in the exception message and carries a TODO to switch to a typed
llama_exception so the warning can be skipped.

Until that upstream change lands and flows in via a master sync, scan the
exception message for the self-identifying "normal during memory fitting"
marker and downgrade WRN -> DBG for that specific case. Real failures
(model load failed, etc.) still surface as SRV_WRN.

Eliminates the misleading "[spec] failed to measure draft model memory:
failed to create llama_context from model" line that appears on every
gemma-4-12b-qat-mtp pod start despite the model loading + running fine
at ~110 tok/s (verified Phase 6 on titan, image
unified-llm:mtp-pr23398-5e6dff22).
@marksverdhei marksverdhei merged commit f6feddb into ht Jun 7, 2026
1 of 8 checks passed
@marksverdhei marksverdhei deleted the chore/quiet-spec-mtp-probe-warning branch June 7, 2026 14:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant