Skip to content

UPSTREAM PR #19209: ggml-cpu: split across kv for faster TG#1081

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19209-branch_am17an-opt-fa-decode
Open

UPSTREAM PR #19209: ggml-cpu: split across kv for faster TG#1081
loci-dev wants to merge 1 commit intomainfrom
upstream-PR19209-branch_am17an-opt-fa-decode

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#19209

Continuing on #19012, In the FA CPU implementation we don't parallelize across the context size, so each thread reads the entire sequence for calculating it's values. This PR introduces chunking across the context size, where each thread maintains partial accumulators for the sum, maximum, and running VKQ sum for the the soft-max calculation and calculates this for all query heads. There is an extra reduction step in the end to reduce the partials. Original idea from https://pytorch.org/blog/flash-decoding/ but then @JohannesGaessler pointed out that the CUDA FA kernel already does this.

Tested on three models and the results are good, larger (>2x) speed-ups as context size grows. Note at lower contexts the results are a bit noisy

benchmark_comparison_log

@loci-review
Copy link

loci-review bot commented Jan 30, 2026

Overview

Commit b8e5c58 ("ggml-cpu: split across kv for faster TG") by Aman Gupta implements a split KV cache optimization for token generation in the CPU backend. Analysis covers 115,129 total functions with 13 modified, 3 new, and 2 removed. Power consumption increased minimally in build.bin.libggml-cpu.so (+0.483%, +761.65 nJ) with no measurable change in the remaining 14 binaries: build.bin.llama-tts, build.bin.llama-cvector-generator, build.bin.libllama.so, build.bin.libmtmd.so, build.bin.libggml-base.so, build.bin.libggml.so, build.bin.llama-gguf-split, build.bin.llama-llava-cli, build.bin.llama-minicpmv-cli, build.bin.llama-quantize, build.bin.llama-gemma3-cli, build.bin.llama-tokenize, build.bin.llama-qwen2vl-cli, and build.bin.llama-bench.

Function Analysis

ggml_compute_forward_flash_attn_ext_f16 (performance-critical): Response time increased 909ns (+3.23%), throughput time increased 639ns (+27.07%). Changes added split KV path for parallel processing during token generation: threads process disjoint KV chunks, synchronize at barrier, then reduce partial results using log-sum-exp. The 27% throughput increase reflects coordination overhead (barriers, partial buffers, reduction logic) not offset by parallelization benefits in static analysis. Activates only for single-query, F32/F16 types, sequences ≥512 tokens.

ggml_graph_plan (planning phase): Response time increased 130ns (+1.08%), throughput time increased 107ns (+2.71%). Enhanced with dual-path memory calculation for Flash Attention: separate sizing for prefill (tiled) vs decode (chunked KV) phases, allocating MAX of both. Achieves ~90% reduction in decode-phase scratch buffers. Overhead justified as one-time per-batch cost for significant memory efficiency.

apply_binary_op (multiply): Response time improved 43ns (-1.25%), throughput time improved 51ns (-3.38%). No source changes; improvement attributed to reduced memory pressure from Flash Attention optimization, yielding better cache locality system-wide.

apply_unary_op (floor/bf16): Response time improved 67ns (-4.11%), throughput time improved 68ns (-6.97%). No source changes; benefits from improved cache behavior.

Other analyzed functions (argmax, unary_op variants for sqr/hardsigmoid/relu, CPU feature detection) showed minor changes (±5-153ns) from compiler artifacts or cache improvements, with no source modifications and negligible practical impact.

Additional Findings

Changes target attention mechanisms and KV cache management—two of llama.cpp's top three performance-critical areas. The split KV optimization specifically addresses token generation latency for long-context scenarios (≥512 tokens) with multi-threaded CPU execution. Static analysis shows coordination overhead but cannot capture real-world parallelization benefits. Memory efficiency improvements cascade system-wide, validating the optimization strategy. No GPU backend impact; changes isolated to CPU operations.

🔎 Full breakdown: Loci Inspector.
💬 Questions? Tag @loci-dev.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from cfee0bd to c1b35fd Compare January 31, 2026 02:05
@loci-dev loci-dev force-pushed the main branch 17 times, most recently from dcfc127 to e04dda7 Compare January 31, 2026 19:09
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from b5cfcd3 to c1988fc Compare February 2, 2026 15:21
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants