UPSTREAM PR #19209: ggml-cpu: split across kv for faster TG by loci-dev · Pull Request #1081 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-30T15:45:15Z

Mirrored from ggml-org/llama.cpp#19209

Continuing on #19012, In the FA CPU implementation we don't parallelize across the context size, so each thread reads the entire sequence for calculating it's values. This PR introduces chunking across the context size, where each thread maintains partial accumulators for the sum, maximum, and running VKQ sum for the the soft-max calculation and calculates this for all query heads. There is an extra reduction step in the end to reduce the partials. Original idea from https://pytorch.org/blog/flash-decoding/ but then @JohannesGaessler pointed out that the CUDA FA kernel already does this.

Tested on three models and the results are good, larger (>2x) speed-ups as context size grows. Note at lower contexts the results are a bit noisy

loci-review · 2026-01-30T17:21:04Z

Overview

Commit b8e5c58 ("ggml-cpu: split across kv for faster TG") by Aman Gupta implements a split KV cache optimization for token generation in the CPU backend. Analysis covers 115,129 total functions with 13 modified, 3 new, and 2 removed. Power consumption increased minimally in build.bin.libggml-cpu.so (+0.483%, +761.65 nJ) with no measurable change in the remaining 14 binaries: build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-gemma3-cli, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli, and build.bin.llama-bench.

Function Analysis

ggml_compute_forward_flash_attn_ext_f16 (performance-critical): Response time increased 909ns (+3.23%), throughput time increased 639ns (+27.07%). Changes added split KV path for parallel processing during token generation: threads process disjoint KV chunks, synchronize at barrier, then reduce partial results using log-sum-exp. The 27% throughput increase reflects coordination overhead (barriers, partial buffers, reduction logic) not offset by parallelization benefits in static analysis. Activates only for single-query, F32/F16 types, sequences ≥512 tokens.

ggml_graph_plan (planning phase): Response time increased 130ns (+1.08%), throughput time increased 107ns (+2.71%). Enhanced with dual-path memory calculation for Flash Attention: separate sizing for prefill (tiled) vs decode (chunked KV) phases, allocating MAX of both. Achieves ~90% reduction in decode-phase scratch buffers. Overhead justified as one-time per-batch cost for significant memory efficiency.

apply_binary_op (multiply): Response time improved 43ns (-1.25%), throughput time improved 51ns (-3.38%). No source changes; improvement attributed to reduced memory pressure from Flash Attention optimization, yielding better cache locality system-wide.

apply_unary_op (floor/bf16): Response time improved 67ns (-4.11%), throughput time improved 68ns (-6.97%). No source changes; benefits from improved cache behavior.

Other analyzed functions (argmax, unary_op variants for sqr/hardsigmoid/relu, CPU feature detection) showed minor changes (±5-153ns) from compiler artifacts or cache improvements, with no source modifications and negligible practical impact.

Additional Findings

Changes target attention mechanisms and KV cache management—two of llama.cpp's top three performance-critical areas. The split KV optimization specifically addresses token generation latency for long-context scenarios (≥512 tokens) with multi-threaded CPU execution. Static analysis shows coordination overhead but cannot capture real-world parallelization benefits. Memory efficiency improvements cascade system-wide, validating the optimization strategy. No GPU backend impact; changes isolated to CPU operations.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

ggml-cpu: split across kv for faster TG

b8e5c58

loci-dev temporarily deployed to PROD__AL_DEMO January 30, 2026 15:45 — with GitHub Actions Inactive

loci-dev force-pushed the main branch from 026c176 to 3bf1adc Compare January 30, 2026 17:17

loci-dev force-pushed the main branch 8 times, most recently from cfee0bd to c1b35fd Compare January 31, 2026 02:05

loci-dev temporarily deployed to PROD__AL_DEMO January 31, 2026 02:07 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 17 times, most recently from dcfc127 to e04dda7 Compare January 31, 2026 19:09

loci-dev force-pushed the main branch 30 times, most recently from b5cfcd3 to c1988fc Compare February 2, 2026 15:21

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #19209: ggml-cpu: split across kv for faster TG#1081

UPSTREAM PR #19209: ggml-cpu: split across kv for faster TG#1081
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19209-branch_am17an-opt-fa-decode

loci-dev commented Jan 30, 2026

Uh oh!

loci-review bot commented Jan 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

loci-dev commented Jan 30, 2026

Uh oh!

loci-review bot commented Jan 30, 2026

Overview

Function Analysis

Additional Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants