UPSTREAM PR #17000: server : do context shift only while generating #80

DajanaV · 2025-11-04T14:48:35Z

fix #16983

Perform the check for context shifting only while the slot is generating tokens. Should not perform it during the start or prompt processing phase.

loci-agentic-ai · 2025-11-04T15:27:04Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Based on the comprehensive analysis of llama.cpp versions 34b5fa0f-964d-4b42-9a35-b6c74df16918 vs a98c0b17-e20d-4b11-8978-6d6d10c53020, the changes show minimal performance impact with no meaningful modifications to core inference functions.

Key Findings

Performance Metrics:

Highest Response Time change: llm_graph_input_out_ids::can_reuse (+0.096%, +0.06 ns) in build.bin.libllama.so
Highest Throughput change: std::_Optional_base constructor (-0.170%, -0.04 ns) in build.bin.llama-cvector-generator
Both changes are within measurement noise tolerance and represent negligible variations

Core Function Impact:
The identified functions with performance changes are not part of the critical inference pipeline. Core functions like llama_decode(), llama_encode(), and llama_tokenize() show no measurable changes, indicating no impact on tokens per second performance for inference workloads.

Power Consumption Analysis:
All 15 analyzed binaries show 0.0% change in estimated power consumption, ranging from 232.6 nJ to 322.8 μJ. No energy efficiency regression or improvement detected across the entire codebase.

Flame Graph and CFG Analysis:
The llm_graph_input_out_ids::can_reuse function shows identical assembly code between versions with a simple, optimized execution profile (65 ns single-frame execution). CFG comparison reveals byte-for-byte identical instruction sequences, confirming the 0.06 ns timing difference is measurement artifact rather than functional change.

GitHub Code Review:
PR #80 introduces a targeted server optimization, changing context shift logic from slot.is_processing() to slot.state == SLOT_STATE_GENERATING. This change restricts context shifting to occur only during active token generation, potentially improving server responsiveness during prompt processing phases. The modification affects server-side slot management but does not impact core inference performance.

Conclusion:
The version comparison reveals maintenance-level changes with no measurable impact on inference performance, memory efficiency, or energy consumption. All variations fall within measurement precision limits, indicating stable performance characteristics across the codebase.

server : do context shift only while generating

7e3667f

DajanaV temporarily deployed to PROD__AL_DEMO November 4, 2025 14:48 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 27 times, most recently from 44faeaa to d7421a0 Compare November 8, 2025 09:08

DajanaV force-pushed the main branch 30 times, most recently from 8a26d77 to b1d9e01 Compare November 13, 2025 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #17000: server : do context shift only while generating #80

UPSTREAM PR #17000: server : do context shift only while generating #80

Uh oh!

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #17000: server : do context shift only while generating #80

Are you sure you want to change the base?

UPSTREAM PR #17000: server : do context shift only while generating #80

Uh oh!

Conversation

DajanaV commented Nov 4, 2025

Uh oh!

loci-agentic-ai bot commented Nov 4, 2025

Performance Analysis Summary

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants