Skip to content

UPSTREAM PR #18986: mla : make the V tensor a view of K#990

Open
loci-dev wants to merge 6 commits intomainfrom
upstream-PR18986-branch_ggml-org-gg/mla-improve
Open

UPSTREAM PR #18986: mla : make the V tensor a view of K#990
loci-dev wants to merge 6 commits intomainfrom
upstream-PR18986-branch_ggml-org-gg/mla-improve

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#18986

cont ggml-org/llama.cpp#18953

This is based on top of the changes in ggml-org/llama.cpp#18953.

Currently, the CUDA FA implementation has certain hardcoded assumptions about the layout of the K and V tensors when MLA is involved (see ggml-org/llama.cpp#13435). The goal of this change is to make things more generic and avoid these assumptions:

  • Change the concat layout of the MLA cache. The old was [pe, kv]. The new one is [kv, pe]. This has certain implications for backends such as Vulkan and overall it's better layout in terms of memory alignment
  • Update the graph build code to pass the V tensor as a view of K. This can be used as a signal for the CUDA (and later other) backends to avoid loading extra V data during compute. (the elimination of the V component from the llama_kv_cache will be done in follow-up PR - for now it's just a redundant data)
  • Added tests to exercise the "V is view of K" path of the FA - currently these tests will still fail (for CUDA only) because we changed the layout (see the point above). This needs to be taken into account in the CUDA implementation. For more info, see the comments in the test-backend-ops.cpp

@loci-review
Copy link

loci-review bot commented Jan 21, 2026

Explore the complete analysis inside the Version Insights

Performance Review Report

Summary

This review analyzes performance changes across 5 commits (33 modified, 37 added, 3 deleted files) in llama.cpp. The changes primarily focus on CUDA Flash Attention optimizations for GLM 4.7 model support and Multi-Latent Attention (MLA) memory efficiency improvements. Performance impacts are negligible at the measured function level, with absolute changes under 200 nanoseconds per function call.

Analysis

The top 4 functions by performance change are all standard library components showing compiler-level optimizations rather than algorithmic modifications:

std::vector::begin() for tensor weights improved throughput by 289% (180ns absolute) but increased response time by 215%. This compiler optimization trade-off favors batch processing during model quantization sorting operations—a non-critical path executed once during model loading, not during inference.

std::vector::begin() for llm_symbol improved 68% (181ns faster) due to better ARM64 code generation with enhanced inlining and register allocation. This accessor is used in tokenization but represents trivial O(1) pointer dereference overhead compared to actual tokenization costs (bigram merging, vocabulary lookups).

_M_destroy destructor for shared_ptr cleanup improved 37.5% (189ns faster). The primary driver is commit 4e23861's MLA optimization making V tensor a view of K, reducing tensor allocations and associated destructor invocations during model loading's parallel validation pipeline.

_M_rep_once_more regex executor improved 13.4% (44ns faster). This standard library function benefits indirectly from commit e659d81's CUDA Flash Attention GQA ratio 4 support, which reduced GPU overhead 2-4x and freed CPU resources for better regex matching performance in structured output parsing.

None of these functions are in performance-critical inference paths identified in project insights (matrix multiplication, attention computation, KV cache operations, quantization kernels). The absolute improvements total under 600 nanoseconds across all measured functions—negligible compared to typical inference latency of millions of nanoseconds per token.

Conclusion

The measured changes represent compiler optimizations and indirect benefits from GPU kernel improvements rather than performance regressions. The MLA memory optimization and CUDA Flash Attention enhancements are beneficial architectural changes that justify any minor overhead in standard library operations.

@loci-review
Copy link

loci-review bot commented Jan 21, 2026

Explore the complete analysis inside the Version Insights

Performance Review Report: llama.cpp Binary Analysis

Impact Classification: Minor

Total Functions Analyzed: 4
Commit Count: 6 (all CUDA Flash Attention and MLA optimizations)
Files Changed: 33 modified, 37 added, 3 deleted

Summary

Analysis of llama.cpp reveals performance variations in four C++ standard library functions (STL vector accessors, shared_ptr destructor, regex executor) with no source code changes in the application. All differences stem from compiler optimization variations between builds.

Performance Changes:

  • vector<llm_symbol>::begin(): 180.81 ns faster (264.82→84.01 ns, -68.3%) - tokenization accessor
  • _M_destroy (shared_ptr): 188.76 ns faster (503.20→314.44 ns, -37.5%) - cleanup phase
  • _M_rep_once_more (regex): 45.18 ns faster (337.20→292.02 ns, -13.4%) - chat template parsing
  • vector<tensor_weight*>::begin(): 180.81 ns slower (84.01→264.82 ns, +215.2%) - model loading

Net cumulative throughput: 232.78 ns faster (-28.8%), representing ~116 nanojoules energy savings per execution cycle.

Assessment

None of the analyzed functions are performance-critical. They are not in inference hot paths (matrix operations, attention, KV cache) that dominate 70-90% of execution time. The most impactful change is the tokenization accessor improvement, saving approximately 1 millisecond for 1000-token inputs—representing 0.1% of typical inference latency.

All six commits target GPU operations (CUDA Flash Attention, MLA) with zero overlap to the analyzed CPU-side STL functions. Performance differences reflect build environment variance (compiler version, optimization flags) rather than algorithmic changes. No optimization action required.

@loci-dev loci-dev force-pushed the main branch 8 times, most recently from 984fada to 54e0744 Compare January 22, 2026 06:14
@noamshiovits-Aurora
Copy link

@loci-dev how many binaries were analyzed ?

@loci-review
Copy link

loci-review bot commented Jan 22, 2026

Based on the performance review report, 1 binary was analyzed: llama.cpp

The report shows analysis of the llama.cpp binary comparing two versions (base and target), examining 4 functions across those versions to understand the performance differences between them.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from a54730b to 4f9b49b Compare January 23, 2026 13:22
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 7236b49 to abd043c Compare January 30, 2026 04:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants