UPSTREAM PR #18811: sampling : remove sampling branching in output_reserve by loci-dev · Pull Request #907 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-13T15:40:25Z

This commit updates output_reserve in llama-context.cpp to always allocate sampling buffers regardless of whether sampling is needed for the current batch.

The motivation for this is to avoid reallocations and branching based on the sampling requirements of the batch.

loci-review · 2026-01-13T17:09:06Z

Explore the complete analysis inside the Version Insights

Performance Review Report

Summary

Negligible performance impact from commit 728b00a. The single commit modified sampling buffer allocation logic, resulting in minor timing variations in C++ STL functions (183-222ns changes) with no meaningful impact on inference performance. Power consumption decreased by 0.012% in libllama.so.

Commit Context

Single commit: 728b00a by Daniel Bevenius - "sampling: remove sampling branching in output_reserve"
Files changed: 11 modified, 37 added, 3 deleted

The commit simplified conditional buffer allocation logic in llama-context.cpp by removing branching in the output_reserve() function, replacing conditional allocation with unconditional worst-case buffer sizing.

Performance Analysis

Changed Functions

All 10 functions with the largest metric changes are C++ standard library (STL) template instantiations, not llama.cpp application code:

Improvements (compiler optimizations):

std::_Rb_tree::find(): -91ns (10% improvement) - buffer type lookups during initialization
std::unique_ptr::operator=(): -75ns (9% improvement) - graph context ownership transfer
std::vector::_S_check_init_len(): -46ns (7.5% improvement) - tokenization vector validation
__gnu_cxx::__val_comp_iter(): -222ns (65% improvement) - PLAMO2 tokenizer comparator

Regressions (compiler code generation changes):

std::_Rb_tree::end(): +184ns (230% increase) - map iterator operations
std::vector<wchar_t>::cbegin(): +180ns (215% increase) - Unicode tokenization
std::vector<unique_ptr>::back(): +189ns (73% increase) - backend buffer access
std::vector<string>::empty(): +190ns (69% increase) - string vector checks

Root Cause

The performance variations stem from compiler code generation differences in STL template instantiations, not from the source code change itself. The commit modified buffer allocation patterns, which affected compiler optimization decisions for inlined STL functions. CFG analyses show structural changes (additional basic blocks, branch reorganization) in the generated assembly code.

Critical Path Impact

None of the affected functions are in the inference hot path. All execute during:

Model loading and initialization (map operations, buffer setup)
Tokenization preprocessing (regex iterators, vector operations)
One-time graph construction (unique_ptr assignments)

The absolute timing changes (46-222ns) are insignificant compared to typical inference operations (milliseconds) and tensor computations (microseconds).

Power Consumption

libllama.so shows 0.012% power consumption decrease (240,539 → 240,510 nanojoules), indicating neutral to slightly positive energy efficiency. All other binaries show zero change.

Conclusion

The commit successfully simplified sampling logic without performance degradation. STL function timing variations are compiler artifacts with no practical impact on llama.cpp's core inference performance.

This commit updates output_reserve in llama-context.cpp to always allocate sampling buffers regardless of whether sampling is needed for the current batch. The motivation for this is to avoid reallocations and branching based on the sampling requirements of the batch.

loci-review · 2026-01-15T07:49:01Z

Explore the complete analysis inside the Version Insights

Now I have all the necessary information. Let me create a comprehensive performance review report:

Performance Review Report

Commit: 9d9be8b by Daniel Bevenius
Change: "sampling : remove sampling branching in output_reserve"
Files Modified: 2 modified, 37 added, 3 deleted

Performance Impact: Negligible

This commit removes conditional branching in sampling buffer allocation logic, resulting in minimal performance impact. Only 5 standard library functions show measurable changes, all under 100 nanoseconds in absolute terms.

Analysis Summary

The observed performance differences stem from compiler code generation variations in STL template instantiations rather than the source code modifications themselves. All affected functions are compiler-generated standard library code with no direct source changes.

Key Findings:

std::unique_ptr assignment operator (+75ns): Increased from 765ns to 840ns due to an additional CFG basic block. This function handles ownership transfer when creating LLaDA-MoE graph builders and executes once per batch during graph construction, not in the per-token hot path.
std::future_error constructor (+25ns): Increased from 283ns to 308ns from reorganized stack canary verification. Affects only error handling paths that rarely execute during normal operation.
Red-black tree operations (-6ns and -15ns): Minor improvements in _S_key (113ns→107ns) and _M_insert (525ns→510ns) from compiler optimizations during KV cache initialization.
std::vector::back() (-5ns): Improved from 261ns to 255ns in UTF-8 candidate retrieval during grammar-constrained sampling.

Power Consumption

The libllama.so binary shows a 0.016% increase in estimated power consumption (240,794→240,832 nanojoules), which is effectively unmeasurable and within noise margins. All other binaries show zero change.

Context

None of the affected functions are in performance-critical inference paths. The changes reflect compiler optimization decisions (instruction scheduling, basic block organization, register allocation) rather than algorithmic regressions. The commit's intent to simplify branching logic in output_reserve() is sound; the minor STL-level variations are expected compiler behavior with no practical impact on inference performance.

loci-dev temporarily deployed to PROD__AL_DEMO January 13, 2026 15:40 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 4d071b3 to 8e509d5 Compare January 13, 2026 16:13

loci-dev force-pushed the main branch 9 times, most recently from b3746c2 to 60b319a Compare January 15, 2026 02:47

loci-dev force-pushed the main branch from 60b319a to a8ba58c Compare January 15, 2026 06:14

loci-dev force-pushed the upstream-PR18811-branch_danbev-sampling-output-reserve branch from 728b00a to 9d9be8b Compare January 15, 2026 06:45

loci-dev temporarily deployed to PROD__AL_DEMO January 15, 2026 06:45 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 13 times, most recently from 27b0027 to 87e1a20 Compare January 17, 2026 14:07

loci-dev force-pushed the main branch 30 times, most recently from b12bb9f to 7a4df67 Compare January 24, 2026 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18811: sampling : remove sampling branching in output_reserve#907

UPSTREAM PR #18811: sampling : remove sampling branching in output_reserve#907
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18811-branch_danbev-sampling-output-reserve

loci-dev commented Jan 13, 2026

Uh oh!

loci-review Bot commented Jan 13, 2026

Uh oh!

loci-review Bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 13, 2026

Uh oh!

loci-review Bot commented Jan 13, 2026

Performance Review Report

Summary

Commit Context

Performance Analysis

Changed Functions

Root Cause

Critical Path Impact

Power Consumption

Conclusion

Uh oh!

loci-review Bot commented Jan 15, 2026

Performance Review Report

Performance Impact: Negligible

Analysis Summary

Power Consumption

Context

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants