UPSTREAM PR #19575: ggml-cpu: arm64: Fix wrong memcpy length for q4_K block_interleave == 4 by loci-dev · Pull Request #1173 · auroralabs-loci/llama.cpp

loci-dev · 2026-02-13T03:09:24Z

Note

Source pull request: ggml-org/llama.cpp#19575

ggml-org/llama.cpp#19561 reports issues with the stack for Q4_K.
I can't reproduce the issue locally, but the make_block_q4_Kx8 function would write past the buffer size 4 extra bytes, which could be the issue.

@taronaeo, since you found the problem, are you able to check if this patch fixes it?

Q6_K and Q5_K (ggml-org/llama.cpp#19356, still opened at the moment this description was written) already address this problem.

loci-review · 2026-02-13T04:10:07Z

Overview

Analysis of 115,000 functions across 15 binaries revealed 7 modified functions (0.006%), 0 new, 0 removed, and 114,993 unchanged. The changes represent a critical correctness fix in quantization repacking that prevents buffer overflow when blck_size_interleave=4, eliminating memory corruption in Q4_K quantized weights.

Power consumption changes:

build.bin.libggml-cpu.so: -0.175% (160,863.89 → 160,582.38 nJ)
All other binaries showed negligible changes (<0.001%): libllama.so, libmtmd.so, libggml-base.so, libggml.so, llama-cvector-generator, llama-tts, llama-bench, llama-gguf-split, llama-llava-cli, llama-minicpmv-cli, llama-quantize, llama-gemma3-cli, llama-tokenize, llama-qwen2vl-cli

Function Analysis

make_block_q4_Kx8 (repack.cpp, libggml-cpu.so): Fixed memcpy buffer overflow by changing hardcoded sizeof(uint64_t) to variable blck_size_interleave parameter. Response time: 8114.69ns → 8682.53ns (+6.997%), throughput: 8107.85ns → 8661.83ns (+6.833%). Apparent regression is a measurement artifact from inlining behavior; this model-loading function executes once during initialization, not in inference hot path. The fix prevents memory corruption critical for correct quantized inference.

ggml_vec_geglu_quick_f32 (ops.cpp, libggml-cpu.so): No direct code changes, but benefited from upstream bug fix through improved data layout. Response time: 689.43ns → 669.72ns (-2.860%), throughput: 211.40ns → 191.69ns (-9.326%). This performance-critical activation function in transformer feed-forward networks shows genuine improvement from better cache efficiency. The 19.71ns reduction per invocation accumulates significantly across millions of calls during inference.

Other analyzed functions showed negligible changes.

Additional Findings

The quantization bug fix has universal correctness benefits across all backends (CPU, CUDA, Metal, HIP, Vulkan). While the modified functions are CPU-specific, corrupted weight repacking would affect model accuracy regardless of inference backend. The fix ensures reliable quantized inference for LLM workloads across all hardware platforms, with secondary performance benefits from improved memory layout and cache efficiency.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

Fix wrong memcpy length for block_interleave == 4

97c579d

loci-dev temporarily deployed to PROD__AL_DEMO February 13, 2026 03:09 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 10 times, most recently from 10f8f26 to a6ecec6 Compare February 20, 2026 02:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19575: ggml-cpu: arm64: Fix wrong memcpy length for q4_K block_interleave == 4#1173

UPSTREAM PR #19575: ggml-cpu: arm64: Fix wrong memcpy length for q4_K block_interleave == 4#1173
loci-dev wants to merge 1 commit intomainfrom
loci/pr-19575-Alcpz-q4_K_stack_smash

loci-dev commented Feb 13, 2026

Uh oh!

loci-review bot commented Feb 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments

Conversation

loci-dev commented Feb 13, 2026

Uh oh!

loci-review bot commented Feb 13, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Comments