UPSTREAM PR #19278: ggml: added cleanups in ggml_quantize_free#1139
UPSTREAM PR #19278: ggml: added cleanups in ggml_quantize_free#1139
Conversation
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
OverviewAnalysis of llama.cpp across 115,472 functions (6 modified, 0 new, 0 removed) reveals minimal performance impact from a single commit fixing memory leaks in quantization cleanup. Power consumption changes are negligible across all 15 binaries: Binaries analyzed:
Critical inference paths (llama_decode, matrix operations, attention, KV cache) remain unchanged. Function Analysisggml_quantize_free (libggml-base.so): Response time increased 2,787ns → 4,656ns (+67%, +1,869ns), throughput time 26ns → 34ns (+32%, +8ns). This intentional regression adds cleanup for IQ2_S, IQ1_M, and IQ3_S-512 quantization formats, fixing memory leaks. Impact occurs only at program shutdown, not during inference. std::map::_M_emplace_hint_unique (libggml-base.so): Response time improved 3,512ns → 3,456ns (-1.6%, -57ns), throughput time 195ns → 139ns (-29%, -57ns). Used in graph construction for tensor relationship tracking. Improvement likely from reduced heap fragmentation after leak fixes. std::vector<gguf_kv>::cbegin (libggml-base.so): Response time increased 84ns → 172ns (+105%, +88ns), throughput time 62ns → 151ns (+141%, +88ns). Standard library accessor showing compiler optimization artifact during GGUF metadata parsing (one-time model loading operation). Other analyzed functions (gguf_type_name, std::vector::resize, std::vector::_M_realloc_insert) showed changes under ±26ns with no meaningful impact. Additional FindingsThe commit (d3f8406) successfully prevents memory leaks for three quantization formats without affecting inference performance. Fixed leaks improve heap allocator efficiency, yielding beneficial side effects in container operations. All GPU backends (CUDA, Metal, HIP, Vulkan) and performance-critical operations remain unmodified. The 45.49 nanojoule power increase represents unmeasurable energy cost in any deployment scenario. 🔎 Full breakdown: Loci Inspector. |
048ad94 to
6c1fde6
Compare
823244c to
bab7d39
Compare
9ea4a65 to
c001e9f
Compare
ef246cc to
8c889a6
Compare
17452e3 to
551dfb5
Compare
3c7b997 to
5ac00d6
Compare
Note
Source pull request: ggml-org/llama.cpp#19278
Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.