Skip to content

UPSTREAM PR #19278: ggml: added cleanups in ggml_quantize_free#1139

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19278-master
Open

UPSTREAM PR #19278: ggml: added cleanups in ggml_quantize_free#1139
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19278-master

Conversation

@loci-dev
Copy link

@loci-dev loci-dev commented Feb 2, 2026

Note

Source pull request: ggml-org/llama.cpp#19278

Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.

Add missing cleanup calls for IQ2_S, IQ1_M quantization types and IQ3XS with 512 blocks during quantization cleanup.
@loci-review
Copy link

loci-review bot commented Feb 3, 2026

Overview

Analysis of llama.cpp across 115,472 functions (6 modified, 0 new, 0 removed) reveals minimal performance impact from a single commit fixing memory leaks in quantization cleanup. Power consumption changes are negligible across all 15 binaries:

Binaries analyzed:

  • build.bin.libggml-base.so: +0.062%
  • build.bin.llama-tts: -0.000%
  • build.bin.libmtmd.so: +0.000%
  • build.bin.llama-cvector-generator: +0.000%
  • build.bin.libllama.so: -0.000%
  • build.bin.llama-bench: 0.000%
  • build.bin.llama-tokenize: 0.000%
  • build.bin.llama-quantize: 0.000%
  • build.bin.llama-qwen2vl-cli: 0.000%
  • build.bin.libggml-cpu.so: 0.000%
  • build.bin.libggml.so: 0.000%
  • build.bin.llama-gemma3-cli: 0.000%
  • build.bin.llama-gguf-split: 0.000%
  • build.bin.llama-llava-cli: 0.000%
  • build.bin.llama-minicpmv-cli: 0.000%

Critical inference paths (llama_decode, matrix operations, attention, KV cache) remain unchanged.

Function Analysis

ggml_quantize_free (libggml-base.so): Response time increased 2,787ns → 4,656ns (+67%, +1,869ns), throughput time 26ns → 34ns (+32%, +8ns). This intentional regression adds cleanup for IQ2_S, IQ1_M, and IQ3_S-512 quantization formats, fixing memory leaks. Impact occurs only at program shutdown, not during inference.

std::map::_M_emplace_hint_unique (libggml-base.so): Response time improved 3,512ns → 3,456ns (-1.6%, -57ns), throughput time 195ns → 139ns (-29%, -57ns). Used in graph construction for tensor relationship tracking. Improvement likely from reduced heap fragmentation after leak fixes.

std::vector<gguf_kv>::cbegin (libggml-base.so): Response time increased 84ns → 172ns (+105%, +88ns), throughput time 62ns → 151ns (+141%, +88ns). Standard library accessor showing compiler optimization artifact during GGUF metadata parsing (one-time model loading operation).

Other analyzed functions (gguf_type_name, std::vector::resize, std::vector::_M_realloc_insert) showed changes under ±26ns with no meaningful impact.

Additional Findings

The commit (d3f8406) successfully prevents memory leaks for three quantization formats without affecting inference performance. Fixed leaks improve heap allocator efficiency, yielding beneficial side effects in container operations. All GPU backends (CUDA, Metal, HIP, Vulkan) and performance-critical operations remain unmodified. The 45.49 nanojoule power increase represents unmeasurable energy cost in any deployment scenario.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 12 times, most recently from 048ad94 to 6c1fde6 Compare February 3, 2026 13:32
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 823244c to bab7d39 Compare February 19, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 4 times, most recently from 9ea4a65 to c001e9f Compare February 22, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 9 times, most recently from ef246cc to 8c889a6 Compare March 2, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 17452e3 to 551dfb5 Compare March 10, 2026 02:17
@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 3c7b997 to 5ac00d6 Compare March 17, 2026 02:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants