Skip to content

UPSTREAM PR #18811: sampling : remove sampling branching in output_reserve#907

Open
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18811-branch_danbev-sampling-output-reserve
Open

UPSTREAM PR #18811: sampling : remove sampling branching in output_reserve#907
loci-dev wants to merge 1 commit into
mainfrom
upstream-PR18811-branch_danbev-sampling-output-reserve

Conversation

@loci-dev
Copy link
Copy Markdown

Mirrored from ggml-org/llama.cpp#18811

This commit updates output_reserve in llama-context.cpp to always allocate sampling buffers regardless of whether sampling is needed for the current batch.

The motivation for this is to avoid reallocations and branching based on the sampling requirements of the batch.

@loci-review
Copy link
Copy Markdown

loci-review Bot commented Jan 13, 2026

Explore the complete analysis inside the Version Insights

Performance Review Report

Summary

Negligible performance impact from commit 728b00a. The single commit modified sampling buffer allocation logic, resulting in minor timing variations in C++ STL functions (183-222ns changes) with no meaningful impact on inference performance. Power consumption decreased by 0.012% in libllama.so.

Commit Context

Single commit: 728b00a by Daniel Bevenius - "sampling: remove sampling branching in output_reserve"
Files changed: 11 modified, 37 added, 3 deleted

The commit simplified conditional buffer allocation logic in llama-context.cpp by removing branching in the output_reserve() function, replacing conditional allocation with unconditional worst-case buffer sizing.

Performance Analysis

Changed Functions

All 10 functions with the largest metric changes are C++ standard library (STL) template instantiations, not llama.cpp application code:

Improvements (compiler optimizations):

  • std::_Rb_tree::find(): -91ns (10% improvement) - buffer type lookups during initialization
  • std::unique_ptr::operator=(): -75ns (9% improvement) - graph context ownership transfer
  • std::vector::_S_check_init_len(): -46ns (7.5% improvement) - tokenization vector validation
  • __gnu_cxx::__val_comp_iter(): -222ns (65% improvement) - PLAMO2 tokenizer comparator

Regressions (compiler code generation changes):

  • std::_Rb_tree::end(): +184ns (230% increase) - map iterator operations
  • std::vector<wchar_t>::cbegin(): +180ns (215% increase) - Unicode tokenization
  • std::vector<unique_ptr>::back(): +189ns (73% increase) - backend buffer access
  • std::vector<string>::empty(): +190ns (69% increase) - string vector checks

Root Cause

The performance variations stem from compiler code generation differences in STL template instantiations, not from the source code change itself. The commit modified buffer allocation patterns, which affected compiler optimization decisions for inlined STL functions. CFG analyses show structural changes (additional basic blocks, branch reorganization) in the generated assembly code.

Critical Path Impact

None of the affected functions are in the inference hot path. All execute during:

  • Model loading and initialization (map operations, buffer setup)
  • Tokenization preprocessing (regex iterators, vector operations)
  • One-time graph construction (unique_ptr assignments)

The absolute timing changes (46-222ns) are insignificant compared to typical inference operations (milliseconds) and tensor computations (microseconds).

Power Consumption

libllama.so shows 0.012% power consumption decrease (240,539 → 240,510 nanojoules), indicating neutral to slightly positive energy efficiency. All other binaries show zero change.

Conclusion

The commit successfully simplified sampling logic without performance degradation. STL function timing variations are compiler artifacts with no practical impact on llama.cpp's core inference performance.

@loci-dev loci-dev force-pushed the main branch 9 times, most recently from b3746c2 to 60b319a Compare January 15, 2026 02:47
This commit updates output_reserve in llama-context.cpp to always
allocate sampling buffers regardless of whether sampling is needed for
the current batch.

The motivation for this is to avoid reallocations and branching based on
the sampling requirements of the batch.
@loci-dev loci-dev force-pushed the upstream-PR18811-branch_danbev-sampling-output-reserve branch from 728b00a to 9d9be8b Compare January 15, 2026 06:45
@loci-review
Copy link
Copy Markdown

loci-review Bot commented Jan 15, 2026

Explore the complete analysis inside the Version Insights

Now I have all the necessary information. Let me create a comprehensive performance review report:


Performance Review Report

Commit: 9d9be8b by Daniel Bevenius
Change: "sampling : remove sampling branching in output_reserve"
Files Modified: 2 modified, 37 added, 3 deleted

Performance Impact: Negligible

This commit removes conditional branching in sampling buffer allocation logic, resulting in minimal performance impact. Only 5 standard library functions show measurable changes, all under 100 nanoseconds in absolute terms.

Analysis Summary

The observed performance differences stem from compiler code generation variations in STL template instantiations rather than the source code modifications themselves. All affected functions are compiler-generated standard library code with no direct source changes.

Key Findings:

  1. std::unique_ptr assignment operator (+75ns): Increased from 765ns to 840ns due to an additional CFG basic block. This function handles ownership transfer when creating LLaDA-MoE graph builders and executes once per batch during graph construction, not in the per-token hot path.

  2. std::future_error constructor (+25ns): Increased from 283ns to 308ns from reorganized stack canary verification. Affects only error handling paths that rarely execute during normal operation.

  3. Red-black tree operations (-6ns and -15ns): Minor improvements in _S_key (113ns→107ns) and _M_insert (525ns→510ns) from compiler optimizations during KV cache initialization.

  4. std::vector::back() (-5ns): Improved from 261ns to 255ns in UTF-8 candidate retrieval during grammar-constrained sampling.

Power Consumption

The libllama.so binary shows a 0.016% increase in estimated power consumption (240,794→240,832 nanojoules), which is effectively unmeasurable and within noise margins. All other binaries show zero change.

Context

None of the affected functions are in performance-critical inference paths. The changes reflect compiler optimization decisions (instruction scheduling, basic block organization, register allocation) rather than algorithmic regressions. The commit's intent to simplify branching logic in output_reserve() is sound; the minor STL-level variations are expected compiler behavior with no practical impact on inference performance.

@loci-dev loci-dev force-pushed the main branch 13 times, most recently from 27b0027 to 87e1a20 Compare January 17, 2026 14:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from b12bb9f to 7a4df67 Compare January 24, 2026 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants