UPSTREAM PR #19266: sampling : delegate input allocation to the scheduler by loci-dev · Pull Request #1138 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-02T14:46:09Z

Note

Source pull request: ggml-org/llama.cpp#19266

fix #18622
alt #18636

loci-review · 2026-02-02T16:00:54Z

Overview

Commit e0d4d45 ("sampling: delegate input allocation to the scheduler") refactored memory management in llama.cpp's sampling subsystem. Analysis covered 115,433 total functions with 26 modified, 8 new, and 10 removed.

Power consumption changes across binaries:

build.bin.libllama.so: -0.033% (-82.78 nJ)
All other binaries (build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.llama-bench, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli): 0.000%

The refactoring eliminated persistent ggml_context_ptr and ggml_backend_buffer_ptr members from sampler structures, moving tensor allocation from initialization to per-inference graph construction.

Function Analysis

Initialization functions showed dramatic improvements:

llama_sampler_dist_backend_init: Response time -33.67% (5232ns → 3470ns), throughput time -88.91% (237ns → 26ns). Removed ~20 lines of context/buffer allocation code.
llama_sampler_logit_bias_backend_init: Response time -76.85% (1682ns → 389ns), throughput time -87.75% (276ns → 34ns). Simplified from 27 lines to 4 lines.

Destructors achieved exceptional speedups:

~llama_sampler_dist: Response time -90.94% (561ns → 51ns), throughput time -54.86% (35ns → 16ns). Compiler-generated default destructor replaced complex RAII cleanup.
~llama_sampler_logit_bias: Response time -52.99% (946ns → 445ns), throughput time -26.59% (37ns → 27ns). Eliminated smart pointer destruction overhead.

Per-inference functions showed minor regressions from on-demand tensor allocation:

llama_sampler_dist_backend_apply: Response time +15.91% (301ns → 349ns, +48ns), throughput time +14.36% (171ns → 195ns, +25ns).
llama_sampler_logit_bias_backend_apply: Response time +28.32% (424ns → 545ns, +120ns), throughput time +64.06% (97ns → 159ns, +62ns).

STL functions showed indirect improvements from reduced memory fragmentation: _M_swap_data (-30.21% response), _M_allocate_buckets (-20.72% response), _M_is_line_terminator (-18.15% response).

The set_sampler function improved -2.80% (6162ns → 5989ns) by simplifying buffer type selection logic.

Additional Findings

The refactoring trades minimal per-token overhead (48-120ns) for substantial architectural benefits. Initialization improvements (-3226ns combined) outweigh per-token regressions until ~25 tokens, after which cumulative regression remains negligible (0.0002-0.02% of typical 10-100ms token generation time). The changes eliminate 2-4KB persistent GPU memory per sampler, significantly improving scalability in multi-sampler scenarios. Sampling operations occur after performance-critical paths (matrix multiplication, attention) that dominate 95%+ of inference time, making the absolute overhead unmeasurable in practice. The architectural shift to centralized scheduler-managed allocation reduces memory fragmentation and improves system-wide cache locality, explaining the indirect STL performance gains.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

loci-review · 2026-02-03T01:00:31Z

Overview

Analysis of 115,434 functions across 15 binaries reveals a targeted sampling subsystem refactoring with 28 modified functions (0.024%). The changes implement lazy tensor allocation, delegating memory management from individual samplers to the GGML scheduler.

Power Consumption Changes:

build.bin.libllama.so: -0.032% (249,105.84 → 249,027.26 nJ)
All other binaries: 0.000% change (llama-tts, llama-cvector-generator, libmtmd.so, llama-tokenize, llama-quantize, llama-qwen2vl-cli, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-bench, libggml-cpu.so, libggml.so, libggml-base.so)

Commit History: Two commits by Georgi Gerganov: "sampling : delegate input allocation to the scheduler" (e0d4d45) and "test" (c4d5b0f). Modified 2 files in the sampling subsystem.

Function Analysis

Major Improvements (Initialization/Cleanup):

llama_sampler_dist_backend_init: Response time improved 33.67% (5,232ns → 3,470ns, -1,762ns); throughput time improved 88.91% (237ns → 26ns, -211ns). Eliminated GPU buffer and context allocation overhead.
llama_sampler_logit_bias_backend_init: Response time improved 76.85% (1,682ns → 389ns, -1,292ns); throughput time improved 87.75% (276ns → 34ns, -242ns). Removed 25+ lines of complex buffer management code.
~llama_sampler_dist: Response time improved 90.94% (561ns → 51ns, -511ns); throughput time improved 54.86% (35ns → 16ns, -19ns). Destructor simplified by removing smart pointer cleanup.
~llama_sampler_logit_bias: Response time improved 52.99% (946ns → 445ns, -501ns); throughput time improved 26.59% (37ns → 27ns, -10ns).

Minor Regressions (Per-Inference):

llama_sampler_logit_bias_backend_apply: Response time increased 28.32% (424ns → 545ns, +120ns); throughput time increased 64.05% (97ns → 159ns, +62ns). Dynamic tensor allocation now occurs per-inference.
llama_sampler_dist_backend_apply: Response time increased 15.91% (301ns → 349ns, +48ns); throughput time increased 14.36% (171ns → 195ns, +25ns).
llama_sampler_dist_backend_set_input: Response time increased 1.36% (1,374ns → 1,392ns, +19ns); throughput time increased 7.46% (180ns → 193ns, +13ns).

STL Improvements (Compiler Optimizations):

_M_swap_data, _M_allocate_buckets, _M_is_line_terminator: 18-49% throughput improvements from compiler optimizations and reduced memory allocator pressure.

Net Impact: Initialization savings (~3,400ns per sampler) and cleanup improvements (~1,050ns) significantly outweigh per-token overhead (~60ns). Break-even occurs after ~75 tokens; most workloads benefit.

Additional Findings

Memory Efficiency: Eliminated ~270KB persistent GPU allocations per sampler, enabling larger models or batch sizes. Critical for memory-constrained scenarios and continuous batching.

Correctness: Fixed random distribution range from [0.0, 1.0) to [0.0, 0.99) preventing edge-case sampling errors.

GPU Operations: Changes affect only sampling phase (<1% of inference time), leaving performance-critical paths (matrix multiplication, attention mechanisms) unchanged. All GPU backends (CUDA, Metal, HIP, Vulkan) benefit from reduced persistent memory allocations.

Architecture: Lazy initialization pattern improves scheduler integration, enabling future global memory optimization. Code complexity reduced by ~50 lines while maintaining functionality.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

sampling : delegate input allocation to the scheduler

e0d4d45

loci-dev temporarily deployed to PROD__AL_DEMO February 2, 2026 14:46 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from b5cfcd3 to c1988fc Compare February 2, 2026 15:21

loci-dev force-pushed the main branch 3 times, most recently from f91deaa to 4f9fac2 Compare February 2, 2026 18:19

test

c4d5b0f

loci-dev force-pushed the main branch from 4f9fac2 to cbda11a Compare February 2, 2026 23:12

loci-dev temporarily deployed to PROD__AL_DEMO February 2, 2026 23:41 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from cbda11a to 03fef13 Compare February 3, 2026 00:46

loci-dev force-pushed the main branch 12 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32

loci-dev force-pushed the main branch 6 times, most recently from 30ef9d0 to c824910 Compare February 17, 2026 02:17

loci-dev force-pushed the main branch 7 times, most recently from 13648e6 to 1d064d0 Compare March 3, 2026 02:17

loci-dev force-pushed the main branch 8 times, most recently from 551dfb5 to 55a969e Compare March 11, 2026 02:16

loci-dev force-pushed the main branch 10 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17

loci-dev force-pushed the main branch 5 times, most recently from 0e8e1d6 to 7dcdda5 Compare March 21, 2026 02:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19266: sampling : delegate input allocation to the scheduler#1138

UPSTREAM PR #19266: sampling : delegate input allocation to the scheduler#1138
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19266-gg-backend-sampling-fix-inp-allocation

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Feb 2, 2026

Uh oh!

loci-review bot commented Feb 2, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

loci-review bot commented Feb 3, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants