UPSTREAM PR #18862: llama : remove write/read of output ids/logits/embeddings by loci-dev · Pull Request #1167 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-12T11:20:36Z

Note

Source pull request: ggml-org/llama.cpp#18862

This commit removes the write/read of output ids, logits and embeddings from the llama context state.

This commit removes the write/read of output ids, logits and embeddings from the llama context state. Refs: ggml-org/llama.cpp#18862 (comment)

This commit updates the session handing in the completion tool to handle the that logits are no longer stored in the session file. Instead, we need to replay the last token to get the logits for sampling.

This commit adds a new function which is responsible for decoding prompt and optionally handle the saving for session data.

This commit updates the save-load-state example to utilize the new llama_state_load_file function for loading the model state from a file. And it also replays the last token after loading since this state is now stored before the last token is processed. I'm not sure if this is acceptable or not, as it does change the example to not directly use llama_state_get_data and llama_state_set_data for loading which might have been the point of the example.

This commit updates the save-load-state example to set the n_seq_max parameter to 2 when initializing the ctx3 context. The motivation for this change is that using 1 as n_parallel/n_seq_max the context only supports one sequence, but the test laster tries to use a second sequence which results in the following error: ```console main : loaded state with 4 tokens main : seq 0 copied, 225760 bytes main : kv cache cleared find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value state_read_meta: failed to find available cells in kv cache ``` This seems to only happen for recurrent/hybrid models.

This commit extracts the replay_last_token function from save-load-state.cpp to common.h. The motivation for this is to allow reuse of the function but also to clarify the intent of code that replays the last token after loading the session state.

loci-review · 2026-02-12T12:52:36Z

Overview

Analysis of 115,793 functions across 15 binaries revealed 171 modified, 30 new, and 0 removed functions. The primary change is a session state management optimization delivering a 77.8% improvement in state loading time (25,546ns → 5,673ns, saving 19,873ns). Standard library functions show compiler-driven variations (±68-177%) with minimal absolute impact (20-200ns in initialization code). Performance-critical inference paths (matrix operations, attention, quantization) remain completely unmodified.

Power Consumption Changes:

libllama.so: -0.174% (core library improvement)
llama-cvector-generator: +0.045%
llama-gguf-split: +0.497%
llama-tokenize: +0.455%
llama-quantize: +0.449%
llama-bench: +0.301%
llama-tts: +0.047%
libmtmd.so: -0.0%
libggml.so, libggml-base.so, libggml-cpu.so: 0.0% (unchanged)
llama-qwen2vl-cli, llama-llava-cli, llama-minicpmv-cli, llama-gemma3-cli: 0.0% (unchanged)

Overall power consumption: -0.0054% (negligible improvement).

Function Analysis

Major Improvement: llama_context::state_read_data (libllama.so)

Response time: 25,546ns → 5,673ns (-77.8%, -19,873ns)
Throughput time: 1,084ns → 410ns (-62.2%, -674ns)
Source changes: Commit 24a085f removed 70 lines eliminating serialization of output IDs, logits, and embeddings buffers. New strategy saves state before last token and replays on load to regenerate outputs on-demand, reducing state file sizes by 50-90%.

STL Container Accessor Improvements (compiler-driven):

std::vector::end() and begin() variants across multiple binaries: 68-75% faster (243-265ns → 60-85ns)
Source changes: None. Improvements from better compiler inlining and optimization.

Allocator Function Variations:

std::vector<char>::_S_max_size (cvector-generator): +176.7% throughput (119ns → 328ns)
std::vector<pair<...>>::_S_max_size (cvector-generator): -62.6% throughput (333ns → 125ns)
Source changes: None. Compiler optimization variations in libstdc++ template instantiations.

Minor Regressions:

common_log_set_verbosity_thold (gguf-split): +124.8% (17ns → 38ns absolute)
jinja::value_bool_t::unique_hash (cvector-generator): +149.5% throughput (79ns → 197ns)
Regex NFA and HTTP functions (llama-tts): +110-160% throughput, but dead code (never executed)
Source changes: None. All are compiler optimization artifacts in initialization or unused code.

Additional Findings

Architectural Soundness: Changes appropriately eliminate transient output data (logits, embeddings) from state files while preserving essential state (KV cache). The replay mechanism maintains correctness through deterministic regeneration. GGML libraries show zero changes, confirming performance-critical tensor operations (70-90% of inference time) are unaffected. GPU backends (CUDA, Metal, HIP, Vulkan) remain unmodified. The optimization benefits production deployments with frequent state save/load operations (servers, multi-turn conversations) while having no impact on per-token inference performance.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

danbev added 6 commits February 11, 2026 06:32

llama : remove write/read of output ids/logits/embeddings

24a085f

This commit removes the write/read of output ids, logits and embeddings from the llama context state. Refs: ggml-org/llama.cpp#18862 (comment)

completion : add replying of session state

03758ab

This commit updates the session handing in the completion tool to handle the that logits are no longer stored in the session file. Instead, we need to replay the last token to get the logits for sampling.

common : add common_prompt_batch_decode function

f6c2803

This commit adds a new function which is responsible for decoding prompt and optionally handle the saving for session data.

loci-dev temporarily deployed to PROD__AL_DEMO February 12, 2026 11:20 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 11 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 6495042 to 61b4303 Compare February 28, 2026 02:16

loci-dev force-pushed the main branch 2 times, most recently from ef246cc to 8c889a6 Compare March 2, 2026 02:17

loci-dev force-pushed the main branch 12 times, most recently from 59f2b25 to d63964d Compare March 24, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from 8fec234 to 82160d6 Compare March 31, 2026 02:17

loci-dev force-pushed the main branch 9 times, most recently from fd3ce9d to 1770118 Compare April 6, 2026 02:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18862: llama : remove write/read of output ids/logits/embeddings#1167

UPSTREAM PR #18862: llama : remove write/read of output ids/logits/embeddings#1167
loci-dev wants to merge 6 commits into
mainfrom
loci/pr-18862-sampling-state

loci-dev commented Feb 12, 2026

Uh oh!

loci-review Bot commented Feb 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 12, 2026

Uh oh!

loci-review Bot commented Feb 12, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants