UPSTREAM PR #18986: mla : make the V tensor a view of K by loci-dev · Pull Request #990 · auroralabs-loci/llama.cpp

loci-dev · 2026-01-21T08:44:51Z

This is based on top of the changes in ggml-org/llama.cpp#18953.

Currently, the CUDA FA implementation has certain hardcoded assumptions about the layout of the K and V tensors when MLA is involved (see ggml-org/llama.cpp#13435). The goal of this change is to make things more generic and avoid these assumptions:

Change the concat layout of the MLA cache. The old was [pe, kv]. The new one is [kv, pe]. This has certain implications for backends such as Vulkan and overall it's better layout in terms of memory alignment
Update the graph build code to pass the V tensor as a view of K. This can be used as a signal for the CUDA (and later other) backends to avoid loading extra V data during compute. (the elimination of the V component from the llama_kv_cache will be done in follow-up PR - for now it's just a redundant data)
Added tests to exercise the "V is view of K" path of the FA - currently these tests will still fail (for CUDA only) because we changed the layout (see the point above). This needs to be taken into account in the CUDA implementation. For more info, see the comments in the test-backend-ops.cpp

loci-review · 2026-01-21T09:43:50Z

Explore the complete analysis inside the Version Insights

Performance Review Report

Summary

This review analyzes performance changes across 5 commits (33 modified, 37 added, 3 deleted files) in llama.cpp. The changes primarily focus on CUDA Flash Attention optimizations for GLM 4.7 model support and Multi-Latent Attention (MLA) memory efficiency improvements. Performance impacts are negligible at the measured function level, with absolute changes under 200 nanoseconds per function call.

Analysis

The top 4 functions by performance change are all standard library components showing compiler-level optimizations rather than algorithmic modifications:

std::vector::begin() for tensor weights improved throughput by 289% (180ns absolute) but increased response time by 215%. This compiler optimization trade-off favors batch processing during model quantization sorting operations—a non-critical path executed once during model loading, not during inference.

std::vector::begin() for llm_symbol improved 68% (181ns faster) due to better ARM64 code generation with enhanced inlining and register allocation. This accessor is used in tokenization but represents trivial O(1) pointer dereference overhead compared to actual tokenization costs (bigram merging, vocabulary lookups).

_M_destroy destructor for shared_ptr cleanup improved 37.5% (189ns faster). The primary driver is commit 4e23861's MLA optimization making V tensor a view of K, reducing tensor allocations and associated destructor invocations during model loading's parallel validation pipeline.

_M_rep_once_more regex executor improved 13.4% (44ns faster). This standard library function benefits indirectly from commit e659d81's CUDA Flash Attention GQA ratio 4 support, which reduced GPU overhead 2-4x and freed CPU resources for better regex matching performance in structured output parsing.

None of these functions are in performance-critical inference paths identified in project insights (matrix multiplication, attention computation, KV cache operations, quantization kernels). The absolute improvements total under 600 nanoseconds across all measured functions—negligible compared to typical inference latency of millions of nanoseconds per token.

Conclusion

The measured changes represent compiler optimizations and indirect benefits from GPU kernel improvements rather than performance regressions. The MLA memory optimization and CUDA Flash Attention enhancements are beneficial architectural changes that justify any minor overhead in standard library operations.

loci-review · 2026-01-21T11:49:51Z

Explore the complete analysis inside the Version Insights

Performance Review Report: llama.cpp Binary Analysis

Impact Classification: Minor

Total Functions Analyzed: 4
Commit Count: 6 (all CUDA Flash Attention and MLA optimizations)
Files Changed: 33 modified, 37 added, 3 deleted

Summary

Analysis of llama.cpp reveals performance variations in four C++ standard library functions (STL vector accessors, shared_ptr destructor, regex executor) with no source code changes in the application. All differences stem from compiler optimization variations between builds.

Performance Changes:

vector<llm_symbol>::begin(): 180.81 ns faster (264.82→84.01 ns, -68.3%) - tokenization accessor
_M_destroy (shared_ptr): 188.76 ns faster (503.20→314.44 ns, -37.5%) - cleanup phase
_M_rep_once_more (regex): 45.18 ns faster (337.20→292.02 ns, -13.4%) - chat template parsing
vector<tensor_weight*>::begin(): 180.81 ns slower (84.01→264.82 ns, +215.2%) - model loading

Net cumulative throughput: 232.78 ns faster (-28.8%), representing ~116 nanojoules energy savings per execution cycle.

Assessment

None of the analyzed functions are performance-critical. They are not in inference hot paths (matrix operations, attention, KV cache) that dominate 70-90% of execution time. The most impactful change is the tokenization accessor improvement, saving approximately 1 millisecond for 1000-token inputs—representing 0.1% of typical inference latency.

All six commits target GPU operations (CUDA Flash Attention, MLA) with zero overlap to the analyzed CPU-side STL functions. Performance differences reflect build environment variance (compiler version, optimization flags) rather than algorithmic changes. No optimization action required.

noamshiovits-Aurora · 2026-01-22T08:00:57Z

@loci-dev how many binaries were analyzed ?

loci-review · 2026-01-22T08:01:05Z

Based on the performance review report, 1 binary was analyzed: llama.cpp

The report shows analysis of the llama.cpp binary comparing two versions (base and target), examining 4 functions across those versions to understand the performance differences between them.

am17an and others added 5 commits January 20, 2026 10:54

CUDA: add gqa_ratio 4 for GLM 4.7

e659d81

add config for tile kernels

aea53cc

fix mma FA

74f8c2f

address rest of review comments

fe1703a

mla : pass V as a view of K to the FA op

4e23861

loci-dev temporarily deployed to PROD__AL_DEMO January 21, 2026 08:44 — with GitHub Actions Inactive

cuda : adjust mla logic to new layout

f07c65b

loci-dev temporarily deployed to PROD__AL_DEMO January 21, 2026 10:43 — with GitHub Actions Inactive

loci-dev force-pushed the main branch 8 times, most recently from 984fada to 54e0744 Compare January 22, 2026 06:14

loci-dev force-pushed the main branch 10 times, most recently from a54730b to 4f9b49b Compare January 23, 2026 13:22

loci-dev force-pushed the main branch 30 times, most recently from 7236b49 to abd043c Compare January 30, 2026 04:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #18986: mla : make the V tensor a view of K#990

UPSTREAM PR #18986: mla : make the V tensor a view of K#990
loci-dev wants to merge 6 commits intomainfrom
upstream-PR18986-branch_ggml-org-gg/mla-improve

loci-dev commented Jan 21, 2026

Uh oh!

loci-review bot commented Jan 21, 2026

Uh oh!

loci-review bot commented Jan 21, 2026

Uh oh!

noamshiovits-Aurora commented Jan 22, 2026

Uh oh!

loci-review bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

loci-dev commented Jan 21, 2026

Uh oh!

loci-review bot commented Jan 21, 2026

Performance Review Report

Summary

Analysis

Conclusion

Uh oh!

loci-review bot commented Jan 21, 2026

Performance Review Report: llama.cpp Binary Analysis

Impact Classification: Minor

Summary

Assessment

Uh oh!

noamshiovits-Aurora commented Jan 22, 2026

Uh oh!

loci-review bot commented Jan 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants