Skip to content

UPSTREAM PR #19266: sampling : delegate input allocation to the scheduler#1138

Open
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19266-gg-backend-sampling-fix-inp-allocation
Open

UPSTREAM PR #19266: sampling : delegate input allocation to the scheduler#1138
loci-dev wants to merge 2 commits intomainfrom
loci/pr-19266-gg-backend-sampling-fix-inp-allocation

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 2, 2026

Note

Source pull request: ggml-org/llama.cpp#19266

fix #18622
alt #18636

@loci-review
Copy link

loci-review bot commented Feb 2, 2026

Overview

Commit e0d4d45 ("sampling: delegate input allocation to the scheduler") refactored memory management in llama.cpp's sampling subsystem. Analysis covered 115,433 total functions with 26 modified, 8 new, and 10 removed.

Power consumption changes across binaries:

  • build.bin.libllama.so: -0.033% (-82.78 nJ)
  • All other binaries (build.bin.llama-cvector-generator, build.bin.llama-tts, build.bin.libmtmd.so, build.bin.libggml.so, build.bin.libggml-base.so, build.bin.libggml-cpu.so, build.bin.llama-bench, build.bin.llama-gemma3-cli, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli): 0.000%

The refactoring eliminated persistent ggml_context_ptr and ggml_backend_buffer_ptr members from sampler structures, moving tensor allocation from initialization to per-inference graph construction.

Function Analysis

Initialization functions showed dramatic improvements:

  • llama_sampler_dist_backend_init: Response time -33.67% (5232ns → 3470ns), throughput time -88.91% (237ns → 26ns). Removed ~20 lines of context/buffer allocation code.
  • llama_sampler_logit_bias_backend_init: Response time -76.85% (1682ns → 389ns), throughput time -87.75% (276ns → 34ns). Simplified from 27 lines to 4 lines.

Destructors achieved exceptional speedups:

  • ~llama_sampler_dist: Response time -90.94% (561ns → 51ns), throughput time -54.86% (35ns → 16ns). Compiler-generated default destructor replaced complex RAII cleanup.
  • ~llama_sampler_logit_bias: Response time -52.99% (946ns → 445ns), throughput time -26.59% (37ns → 27ns). Eliminated smart pointer destruction overhead.

Per-inference functions showed minor regressions from on-demand tensor allocation:

  • llama_sampler_dist_backend_apply: Response time +15.91% (301ns → 349ns, +48ns), throughput time +14.36% (171ns → 195ns, +25ns).
  • llama_sampler_logit_bias_backend_apply: Response time +28.32% (424ns → 545ns, +120ns), throughput time +64.06% (97ns → 159ns, +62ns).

STL functions showed indirect improvements from reduced memory fragmentation: _M_swap_data (-30.21% response), _M_allocate_buckets (-20.72% response), _M_is_line_terminator (-18.15% response).

The set_sampler function improved -2.80% (6162ns → 5989ns) by simplifying buffer type selection logic.

Additional Findings

The refactoring trades minimal per-token overhead (48-120ns) for substantial architectural benefits. Initialization improvements (-3226ns combined) outweigh per-token regressions until ~25 tokens, after which cumulative regression remains negligible (0.0002-0.02% of typical 10-100ms token generation time). The changes eliminate 2-4KB persistent GPU memory per sampler, significantly improving scalability in multi-sampler scenarios. Sampling operations occur after performance-critical paths (matrix multiplication, attention) that dominate 95%+ of inference time, making the absolute overhead unmeasurable in practice. The architectural shift to centralized scheduler-managed allocation reduces memory fragmentation and improves system-wide cache locality, explaining the indirect STL performance gains.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 3 times, most recently from f91deaa to 4f9fac2 Compare February 2, 2026 18:19
@loci-review
Copy link

loci-review bot commented Feb 3, 2026

Overview

Analysis of 115,434 functions across 15 binaries reveals a targeted sampling subsystem refactoring with 28 modified functions (0.024%). The changes implement lazy tensor allocation, delegating memory management from individual samplers to the GGML scheduler.

Power Consumption Changes:

  • build.bin.libllama.so: -0.032% (249,105.84 → 249,027.26 nJ)
  • All other binaries: 0.000% change (llama-tts, llama-cvector-generator, libmtmd.so, llama-tokenize, llama-quantize, llama-qwen2vl-cli, llama-gemma3-cli, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-bench, libggml-cpu.so, libggml.so, libggml-base.so)

Commit History: Two commits by Georgi Gerganov: "sampling : delegate input allocation to the scheduler" (e0d4d45) and "test" (c4d5b0f). Modified 2 files in the sampling subsystem.

Function Analysis

Major Improvements (Initialization/Cleanup):

  • llama_sampler_dist_backend_init: Response time improved 33.67% (5,232ns → 3,470ns, -1,762ns); throughput time improved 88.91% (237ns → 26ns, -211ns). Eliminated GPU buffer and context allocation overhead.
  • llama_sampler_logit_bias_backend_init: Response time improved 76.85% (1,682ns → 389ns, -1,292ns); throughput time improved 87.75% (276ns → 34ns, -242ns). Removed 25+ lines of complex buffer management code.
  • ~llama_sampler_dist: Response time improved 90.94% (561ns → 51ns, -511ns); throughput time improved 54.86% (35ns → 16ns, -19ns). Destructor simplified by removing smart pointer cleanup.
  • ~llama_sampler_logit_bias: Response time improved 52.99% (946ns → 445ns, -501ns); throughput time improved 26.59% (37ns → 27ns, -10ns).

Minor Regressions (Per-Inference):

  • llama_sampler_logit_bias_backend_apply: Response time increased 28.32% (424ns → 545ns, +120ns); throughput time increased 64.05% (97ns → 159ns, +62ns). Dynamic tensor allocation now occurs per-inference.
  • llama_sampler_dist_backend_apply: Response time increased 15.91% (301ns → 349ns, +48ns); throughput time increased 14.36% (171ns → 195ns, +25ns).
  • llama_sampler_dist_backend_set_input: Response time increased 1.36% (1,374ns → 1,392ns, +19ns); throughput time increased 7.46% (180ns → 193ns, +13ns).

STL Improvements (Compiler Optimizations):

  • _M_swap_data, _M_allocate_buckets, _M_is_line_terminator: 18-49% throughput improvements from compiler optimizations and reduced memory allocator pressure.

Net Impact: Initialization savings (~3,400ns per sampler) and cleanup improvements (~1,050ns) significantly outweigh per-token overhead (~60ns). Break-even occurs after ~75 tokens; most workloads benefit.

Additional Findings

Memory Efficiency: Eliminated ~270KB persistent GPU allocations per sampler, enabling larger models or batch sizes. Critical for memory-constrained scenarios and continuous batching.

Correctness: Fixed random distribution range from [0.0, 1.0) to [0.0, 0.99) preventing edge-case sampling errors.

GPU Operations: Changes affect only sampling phase (<1% of inference time), leaving performance-critical paths (matrix multiplication, attention mechanisms) unchanged. All GPU backends (CUDA, Metal, HIP, Vulkan) benefit from reduced persistent memory allocations.

Architecture: Lazy initialization pattern improves scheduler integration, enabling future global memory optimization. Code complexity reduced by ~50 lines while maintaining functionality.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 12 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 6 times, most recently from 30ef9d0 to c824910 Compare February 17, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 7 times, most recently from 13648e6 to 1d064d0 Compare March 3, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 551dfb5 to 55a969e Compare March 11, 2026 02:16
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 5ac00d6 to 998dd7a Compare March 18, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 5 times, most recently from 0e8e1d6 to 7dcdda5 Compare March 21, 2026 02:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants