UPSTREAM PR #18862: llama : remove write/read of output ids/logits/embeddings#1167
UPSTREAM PR #18862: llama : remove write/read of output ids/logits/embeddings#1167loci-dev wants to merge 6 commits into
Conversation
This commit removes the write/read of output ids, logits and embeddings from the llama context state. Refs: ggml-org/llama.cpp#18862 (comment)
This commit updates the session handing in the completion tool to handle the that logits are no longer stored in the session file. Instead, we need to replay the last token to get the logits for sampling.
This commit adds a new function which is responsible for decoding prompt and optionally handle the saving for session data.
This commit updates the save-load-state example to utilize the new llama_state_load_file function for loading the model state from a file. And it also replays the last token after loading since this state is now stored before the last token is processed. I'm not sure if this is acceptable or not, as it does change the example to not directly use llama_state_get_data and llama_state_set_data for loading which might have been the point of the example.
This commit updates the save-load-state example to set the n_seq_max parameter to 2 when initializing the ctx3 context. The motivation for this change is that using 1 as n_parallel/n_seq_max the context only supports one sequence, but the test laster tries to use a second sequence which results in the following error: ```console main : loaded state with 4 tokens main : seq 0 copied, 225760 bytes main : kv cache cleared find_slot: seq_id=1 >= n_seq_max=1 Try using a bigger --parallel value state_read_meta: failed to find available cells in kv cache ``` This seems to only happen for recurrent/hybrid models.
This commit extracts the replay_last_token function from save-load-state.cpp to common.h. The motivation for this is to allow reuse of the function but also to clarify the intent of code that replays the last token after loading the session state.
OverviewAnalysis of 115,793 functions across 15 binaries revealed 171 modified, 30 new, and 0 removed functions. The primary change is a session state management optimization delivering a 77.8% improvement in state loading time (25,546ns → 5,673ns, saving 19,873ns). Standard library functions show compiler-driven variations (±68-177%) with minimal absolute impact (20-200ns in initialization code). Performance-critical inference paths (matrix operations, attention, quantization) remain completely unmodified. Power Consumption Changes:
Overall power consumption: -0.0054% (negligible improvement). Function AnalysisMajor Improvement:
STL Container Accessor Improvements (compiler-driven):
Allocator Function Variations:
Minor Regressions:
Additional FindingsArchitectural Soundness: Changes appropriately eliminate transient output data (logits, embeddings) from state files while preserving essential state (KV cache). The replay mechanism maintains correctness through deterministic regeneration. GGML libraries show zero changes, confirming performance-critical tensor operations (70-90% of inference time) are unaffected. GPU backends (CUDA, Metal, HIP, Vulkan) remain unmodified. The optimization benefits production deployments with frequent state save/load operations (servers, multi-turn conversations) while having no impact on per-token inference performance. 🔎 Full breakdown: Loci Inspector. |
10f8f26 to
a6ecec6
Compare
6495042 to
61b4303
Compare
ef246cc to
8c889a6
Compare
59f2b25 to
d63964d
Compare
8fec234 to
82160d6
Compare
fd3ce9d to
1770118
Compare
Note
Source pull request: ggml-org/llama.cpp#18862
This commit removes the write/read of output ids, logits and embeddings from the llama context state.