Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#17000

fix #16983

Perform the check for context shifting only while the slot is generating tokens. Should not perform it during the start or prompt processing phase.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp versions 34b5fa0f-964d-4b42-9a35-b6c74df16918 vs a98c0b17-e20d-4b11-8978-6d6d10c53020, the changes show minimal performance impact with no meaningful modifications to core inference functions.

Key Findings

Performance Metrics:

  • Highest Response Time change: llm_graph_input_out_ids::can_reuse (+0.096%, +0.06 ns) in build.bin.libllama.so
  • Highest Throughput change: std::_Optional_base constructor (-0.170%, -0.04 ns) in build.bin.llama-cvector-generator
  • Both changes are within measurement noise tolerance and represent negligible variations

Core Function Impact:
The identified functions with performance changes are not part of the critical inference pipeline. Core functions like llama_decode(), llama_encode(), and llama_tokenize() show no measurable changes, indicating no impact on tokens per second performance for inference workloads.

Power Consumption Analysis:
All 15 analyzed binaries show 0.0% change in estimated power consumption, ranging from 232.6 nJ to 322.8 μJ. No energy efficiency regression or improvement detected across the entire codebase.

Flame Graph and CFG Analysis:
The llm_graph_input_out_ids::can_reuse function shows identical assembly code between versions with a simple, optimized execution profile (65 ns single-frame execution). CFG comparison reveals byte-for-byte identical instruction sequences, confirming the 0.06 ns timing difference is measurement artifact rather than functional change.

GitHub Code Review:
PR #80 introduces a targeted server optimization, changing context shift logic from slot.is_processing() to slot.state == SLOT_STATE_GENERATING. This change restricts context shifting to occur only during active token generation, potentially improving server responsiveness during prompt processing phases. The modification affects server-side slot management but does not impact core inference performance.

Conclusion:
The version comparison reveals maintenance-level changes with no measurable impact on inference performance, memory efficiency, or energy consumption. All variations fall within measurement precision limits, indicating stable performance characteristics across the codebase.

@DajanaV DajanaV force-pushed the main branch 27 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 8a26d77 to b1d9e01 Compare November 13, 2025 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants