UPSTREAM PR #19266: sampling : delegate input allocation to the scheduler#1138
UPSTREAM PR #19266: sampling : delegate input allocation to the scheduler#1138
Conversation
OverviewCommit e0d4d45 ("sampling: delegate input allocation to the scheduler") refactored memory management in llama.cpp's sampling subsystem. Analysis covered 115,433 total functions with 26 modified, 8 new, and 10 removed. Power consumption changes across binaries:
The refactoring eliminated persistent Function AnalysisInitialization functions showed dramatic improvements:
Destructors achieved exceptional speedups:
Per-inference functions showed minor regressions from on-demand tensor allocation:
STL functions showed indirect improvements from reduced memory fragmentation: _M_swap_data (-30.21% response), _M_allocate_buckets (-20.72% response), _M_is_line_terminator (-18.15% response). The set_sampler function improved -2.80% (6162ns → 5989ns) by simplifying buffer type selection logic. Additional FindingsThe refactoring trades minimal per-token overhead (48-120ns) for substantial architectural benefits. Initialization improvements (-3226ns combined) outweigh per-token regressions until ~25 tokens, after which cumulative regression remains negligible (0.0002-0.02% of typical 10-100ms token generation time). The changes eliminate 2-4KB persistent GPU memory per sampler, significantly improving scalability in multi-sampler scenarios. Sampling operations occur after performance-critical paths (matrix multiplication, attention) that dominate 95%+ of inference time, making the absolute overhead unmeasurable in practice. The architectural shift to centralized scheduler-managed allocation reduces memory fragmentation and improves system-wide cache locality, explaining the indirect STL performance gains. 🔎 Full breakdown: Loci Inspector. |
f91deaa to
4f9fac2
Compare
OverviewAnalysis of 115,434 functions across 15 binaries reveals a targeted sampling subsystem refactoring with 28 modified functions (0.024%). The changes implement lazy tensor allocation, delegating memory management from individual samplers to the GGML scheduler. Power Consumption Changes:
Commit History: Two commits by Georgi Gerganov: "sampling : delegate input allocation to the scheduler" (e0d4d45) and "test" (c4d5b0f). Modified 2 files in the sampling subsystem. Function AnalysisMajor Improvements (Initialization/Cleanup):
Minor Regressions (Per-Inference):
STL Improvements (Compiler Optimizations):
Net Impact: Initialization savings (~3,400ns per sampler) and cleanup improvements (~1,050ns) significantly outweigh per-token overhead (~60ns). Break-even occurs after ~75 tokens; most workloads benefit. Additional FindingsMemory Efficiency: Eliminated ~270KB persistent GPU allocations per sampler, enabling larger models or batch sizes. Critical for memory-constrained scenarios and continuous batching. Correctness: Fixed random distribution range from [0.0, 1.0) to [0.0, 0.99) preventing edge-case sampling errors. GPU Operations: Changes affect only sampling phase (<1% of inference time), leaving performance-critical paths (matrix multiplication, attention mechanisms) unchanged. All GPU backends (CUDA, Metal, HIP, Vulkan) benefit from reduced persistent memory allocations. Architecture: Lazy initialization pattern improves scheduler integration, enabling future global memory optimization. Code complexity reduced by ~50 lines while maintaining functionality. 🔎 Full breakdown: Loci Inspector. |
048ad94 to
6c1fde6
Compare
30ef9d0 to
c824910
Compare
13648e6 to
1d064d0
Compare
551dfb5 to
55a969e
Compare
5ac00d6 to
998dd7a
Compare
0e8e1d6 to
7dcdda5
Compare
Note
Source pull request: ggml-org/llama.cpp#19266
fix #18622
alt #18636