Skip to content

UPSTREAM PR #19575: ggml-cpu: arm64: Fix wrong memcpy length for q4_K block_interleave == 4#1173

Open
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19575-Alcpz-q4_K_stack_smash
Open

UPSTREAM PR #19575: ggml-cpu: arm64: Fix wrong memcpy length for q4_K block_interleave == 4#1173
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19575-Alcpz-q4_K_stack_smash

Conversation

@loci-dev
Copy link

Note

Source pull request: ggml-org/llama.cpp#19575

ggml-org/llama.cpp#19561 reports issues with the stack for Q4_K.
I can't reproduce the issue locally, but the make_block_q4_Kx8 function would write past the buffer size 4 extra bytes, which could be the issue.

@taronaeo, since you found the problem, are you able to check if this patch fixes it?

Q6_K and Q5_K (ggml-org/llama.cpp#19356, still opened at the moment this description was written) already address this problem.

@loci-review
Copy link

loci-review bot commented Feb 13, 2026

Overview

Analysis of 115,000 functions across 15 binaries revealed 7 modified functions (0.006%), 0 new, 0 removed, and 114,993 unchanged. The changes represent a critical correctness fix in quantization repacking that prevents buffer overflow when blck_size_interleave=4, eliminating memory corruption in Q4_K quantized weights.

Power consumption changes:

  • build.bin.libggml-cpu.so: -0.175% (160,863.89 → 160,582.38 nJ)
  • All other binaries showed negligible changes (<0.001%): libllama.so, libmtmd.so, libggml-base.so, libggml.so, llama-cvector-generator, llama-tts, llama-bench, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-gemma3-cli, llama-tokenize, llama-qwen2vl-cli

Function Analysis

make_block_q4_Kx8 (repack.cpp, libggml-cpu.so): Fixed memcpy buffer overflow by changing hardcoded sizeof(uint64_t) to variable blck_size_interleave parameter. Response time: 8114.69ns → 8682.53ns (+6.997%), throughput: 8107.85ns → 8661.83ns (+6.833%). Apparent regression is a measurement artifact from inlining behavior; this model-loading function executes once during initialization, not in inference hot path. The fix prevents memory corruption critical for correct quantized inference.

ggml_vec_geglu_quick_f32 (ops.cpp, libggml-cpu.so): No direct code changes, but benefited from upstream bug fix through improved data layout. Response time: 689.43ns → 669.72ns (-2.860%), throughput: 211.40ns → 191.69ns (-9.326%). This performance-critical activation function in transformer feed-forward networks shows genuine improvement from better cache efficiency. The 19.71ns reduction per invocation accumulates significantly across millions of calls during inference.

Other analyzed functions showed negligible changes.

Additional Findings

The quantization bug fix has universal correctness benefits across all backends (CPU, CUDA, Metal, HIP, Vulkan). While the modified functions are CPU-specific, corrupted weight repacking would affect model accuracy regardless of inference backend. The fix ensures reliable quantized inference for LLM workloads across all hardware platforms, with secondary performance benefits from improved memory layout and cache efficiency.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments