Skip to content

UPSTREAM PR #18862: llama : remove write/read of output ids/logits/embeddings#1167

Open
loci-dev wants to merge 6 commits into
mainfrom
loci/pr-18862-sampling-state
Open

UPSTREAM PR #18862: llama : remove write/read of output ids/logits/embeddings#1167
loci-dev wants to merge 6 commits into
mainfrom
loci/pr-18862-sampling-state

Conversation

@loci-dev
Copy link
Copy Markdown

Note

Source pull request: ggml-org/llama.cpp#18862

This commit removes the write/read of output ids, logits and embeddings from the llama context state.

This commit removes the write/read of output ids, logits and
embeddings from the llama context state.

Refs: ggml-org/llama.cpp#18862 (comment)
This commit updates the session handing in the completion tool to handle
the that logits are no longer stored in the session file. Instead, we
need to replay the last token to get the logits for sampling.
This commit adds a new function which is responsible for decoding prompt
and optionally handle the saving for session data.
This commit updates the save-load-state example to utilize the new
llama_state_load_file function for loading the model state from a file.
And it also replays the last token after loading since this state is now
stored before the last token is processed.

I'm not sure if this is acceptable or not, as it does change the example
to not directly use llama_state_get_data and llama_state_set_data for
loading which might have been the point of the example.
This commit updates the save-load-state example to set the n_seq_max
parameter to 2 when initializing the ctx3 context.

The motivation for this change is that using 1 as n_parallel/n_seq_max
the context only supports one sequence, but the test laster tries to
use a second sequence which results in the following error:
```console
main : loaded state with 4 tokens
main : seq 0 copied, 225760 bytes
main : kv cache cleared
find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value
state_read_meta: failed to find available cells in kv cache
```
This seems to only happen for recurrent/hybrid models.
This commit extracts the replay_last_token function from
save-load-state.cpp to common.h.

The motivation for this is to allow reuse of the function but also to
clarify the intent of code that replays the last token after loading
the session state.
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Feb 12, 2026

Overview

Analysis of 115,793 functions across 15 binaries revealed 171 modified, 30 new, and 0 removed functions. The primary change is a session state management optimization delivering a 77.8% improvement in state loading time (25,546ns → 5,673ns, saving 19,873ns). Standard library functions show compiler-driven variations (±68-177%) with minimal absolute impact (20-200ns in initialization code). Performance-critical inference paths (matrix operations, attention, quantization) remain completely unmodified.

Power Consumption Changes:

  • libllama.so: -0.174% (core library improvement)
  • llama-cvector-generator: +0.045%
  • llama-gguf-split: +0.497%
  • llama-tokenize: +0.455%
  • llama-quantize: +0.449%
  • llama-bench: +0.301%
  • llama-tts: +0.047%
  • libmtmd.so: -0.0%
  • libggml.so, libggml-base.so, libggml-cpu.so: 0.0% (unchanged)
  • llama-qwen2vl-cli, llama-llava-cli, llama-minicpmv-cli, llama-gemma3-cli: 0.0% (unchanged)

Overall power consumption: -0.0054% (negligible improvement).

Function Analysis

Major Improvement: llama_context::state_read_data (libllama.so)

  • Response time: 25,546ns → 5,673ns (-77.8%, -19,873ns)
  • Throughput time: 1,084ns → 410ns (-62.2%, -674ns)
  • Source changes: Commit 24a085f removed 70 lines eliminating serialization of output IDs, logits, and embeddings buffers. New strategy saves state before last token and replays on load to regenerate outputs on-demand, reducing state file sizes by 50-90%.

STL Container Accessor Improvements (compiler-driven):

  • std::vector::end() and begin() variants across multiple binaries: 68-75% faster (243-265ns → 60-85ns)
  • Source changes: None. Improvements from better compiler inlining and optimization.

Allocator Function Variations:

  • std::vector<char>::_S_max_size (cvector-generator): +176.7% throughput (119ns → 328ns)
  • std::vector<pair<...>>::_S_max_size (cvector-generator): -62.6% throughput (333ns → 125ns)
  • Source changes: None. Compiler optimization variations in libstdc++ template instantiations.

Minor Regressions:

  • common_log_set_verbosity_thold (gguf-split): +124.8% (17ns → 38ns absolute)
  • jinja::value_bool_t::unique_hash (cvector-generator): +149.5% throughput (79ns → 197ns)
  • Regex NFA and HTTP functions (llama-tts): +110-160% throughput, but dead code (never executed)
  • Source changes: None. All are compiler optimization artifacts in initialization or unused code.

Additional Findings

Architectural Soundness: Changes appropriately eliminate transient output data (logits, embeddings) from state files while preserving essential state (KV cache). The replay mechanism maintains correctness through deterministic regeneration. GGML libraries show zero changes, confirming performance-critical tensor operations (70-90% of inference time) are unaffected. GPU backends (CUDA, Metal, HIP, Vulkan) remain unmodified. The optimization benefits production deployments with frequent state save/load operations (servers, multi-turn conversations) while having no impact on per-token inference performance.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 11 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 6495042 to 61b4303 Compare February 28, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 2 times, most recently from ef246cc to 8c889a6 Compare March 2, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 12 times, most recently from 59f2b25 to d63964d Compare March 24, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from 8fec234 to 82160d6 Compare March 31, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from fd3ce9d to 1770118 Compare April 6, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants