Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Oct 31, 2025

Mirrored from ggml-org/llama.cpp#16899

This is similar to what I did for the K/V bindings in the FA shader, just a lot more bindings. I did it this way to be able to keep using a single descriptor set layout, for simplicity. In the long run I think we should switch all shaders to use a single binding with N array elements, to make it possible to do indexing. But that's a bigger change than I want to do to fix this bug.

I removed the code that disabled this path for Intel, presumably it'll work after this.

danbev and others added 30 commits September 29, 2025 17:43
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426)

* sync : whisper.cpp
* ggml: add spacemit backend

Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23

* add new line at end of file

Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2

* add riscv zba extension limit

Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce

* fixed for review comments, file renamed and format

Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce

* fixed for code format, after clang-format

Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2

* use _Float16 instead of __fp16

Change-Id: I039fb02bb95270e641bc4442204e658735859d43

* add ci for riscv64-spacemit-ime-native

Change-Id: I711c1033061df1a289ea77891b2997599dfe8279

* update debian-13-riscv64-spacemit-ime-native ci label

Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a

* remove license comment for spacemit ime

Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3

* upgrade binutils for gcc ime

Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45

* add spacemit ime cross jobs

Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6

* remove native compile for riscv64-spacemit-ime

Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e

* ci : add caching for spacemit ime cross toolchain

Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de

* ci: bug fixed for cache path and env

Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a

* Update .github/workflows/build-linux-cross.yml for cache path

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* bugfixed for  build-linux-cross.yml,  syntax error

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: cailinxi <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ci : add AMD runners and workflows

* ci : move AMD jobs to separate workflow

* cont : fix paths
…locks (#16326)

* fix: prevent reasoning blocks with quotes from being truncated

* chore: update webui build output

* feat: Improve thinking content parsing

* test: Adds ChatMessage component stories for different thinking blocks

* chore: update webui build output

* fix: ChatMessage story fix

---------

Co-authored-by: Aleksander Grygier <[email protected]>
…ounding differences (#16295)

* tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences

* apply similar error bounds to test_cpy
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.

This legacy code can be removed in a future version.

Signed-off-by: Adrien Gallouët <[email protected]>
* metal : dynamic simdgroups for MV kernels

* cont : minor
* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs

* fix to ensure test-backend-ops check passes
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.

The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.

Signed-off-by: Adrien Gallouët <[email protected]>
* Work on rope

* Simplify inplace operation generation and combine mul/add generation

* Work on rope variants

* implement neox rope

* rope complete

* Add sub,div,glu operators

* implement scale op

* Update cpy shader to handle cont/more types

* formatting

* Update test vars printing for rope,rms_norm

* Avoid ROPE hardcoded constants

* Add TODO to change ROPE constants to enum

Co-authored-by: Georgi Gerganov <[email protected]>

* fix TODO comment

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* fix: skip empty sampling fields instead of coercing to 0 in chat API options

* chore: update webui build output
* common : disable progress bar without a tty

Signed-off-by: Adrien Gallouët <[email protected]>

* Add missing headers

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>
* fix ccache key for ubuntu-cpu-cmake

* set it for release as well [no ci]
…#16359)

* Make a few GLM tensors not required

layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work

* Update llama-model.cpp

layer.nextn.shared_head_norm also not required in case of future models
…(#16345)

* make ggml_vk_default_dispatcher support older vulkan headers

* simpilfy with using
* feat: Add a setting to include model name used to generate the message

* feat: UI improvements

* feat: Save model info along with the database message entry creation

* chore: Build webui static output
* feat: Improve code block theming

* chore: update webui build output

* chore: Update webui static build
…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369)

* fix: Render Conversation action dialogs as singletons from Chat Sidebar level

* chore: update webui build output

* fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup

* chore: Update webui static build

* fix: Always truncate conversation names

* chore: Update webui static build
* common: introduce http.h for httplib-based client

This change moves cpp-httplib based URL parsing and client setup into
a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`.

It is an iteration towards removing libcurl, while intentionally
minimizing changes to existing code to guarantee the same behavior when
`LLAMA_CURL` is used.

Signed-off-by: Adrien Gallouët <[email protected]>

* tools : add missing WIN32_LEAN_AND_MEAN

Signed-off-by: Adrien Gallouët <[email protected]>

---------

Signed-off-by: Adrien Gallouët <[email protected]>
Signed-off-by: Adrien Gallouët <[email protected]>
* CI: Properly install rocwmma for hip builds

on windows we now windows install rocwmma from ubuntu pacakges

* CI: update linux rocm docker build to use rocm 7.0
jeffbolznv and others added 5 commits October 29, 2025 14:44
* vulkan: Update topk_moe fusion to handle gpt's late softmax

Based on #16649.

* Add ggml_check_edges

* Add sync logging to show fusion effects

* handle clamp added in #16655

* Update ggml/src/ggml-impl.h

Co-authored-by: Diego Devesa <[email protected]>
* llama: store mrope data in KV cell

* correct x,y ordering

* address review comments

* add consistency checks

* Update src/llama-kv-cache.cpp

Co-authored-by: Georgi Gerganov <[email protected]>

* add TODO

* fix asan error

* kv-cells : improve ext handling

* cont : fix headers

---------

Co-authored-by: Georgi Gerganov <[email protected]>
This pattern appears in a lot of models, the rope operation is applied right
before storing into the KV cache (usually on the K tensor).

Add a path to some of the rope shaders that computes the destination address
based on the set_rows tensor. Compile variants of the shader with D_TYPE of
f16 (the usual KV cache type).

Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs
the fourth for the row indices.

Add fused_ops_write_mask to indicate which intermediate tensors need to write
their results to memory. Skipping writing the roped K value helps to allow more
nodes to run concurrently.

Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It
rarely starts out that way in the graph.

Add new backend tests.
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Based on the performance analysis data from the previous analysis, I'll provide a comprehensive summary focusing on the critical functions and their impact on key performance indicators.

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Changes

Response Time Degradations

  • Worst Impact: unicode.cpp__ZZ15unicode_tolowerjENKUlRKSt4pairIjjEjE_clES2_j (operator function)
    • Change: +0.08% (19 ns vs 19 ns base)
    • Location: Unicode processing within tokenization pipeline
    • Impact: Affects text preprocessing in llama_tokenize() workflow

Throughput Degradations

  • Primary Impact: Same unicode operator function
    • Self-time increase: +0.08%
    • Function role: Unicode lowercase conversion for token processing

Bottleneck Analysis

  • Most Significant: _ZSt10_ConstructISt6vectorI21llama_grammar_elementSaIS1_EEJRKS3_EEvPT_DpOT0_ (_Construct)
    • Change: +0.13% (20 ns vs 20 ns base)
    • Location: STL vector construction for grammar elements
    • Context: Grammar processing within tokenization

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Impact Assessment: Minimal degradation expected

Affected Functions:

  • llama_tokenize(): Indirectly affected through unicode processing degradation
    • Unicode operator function shows 0.08% increase in processing time
    • Grammar element construction shows 0.13% bottleneck increase

Quantified Impact:

  • Based on reference data (7% tokens/sec reduction for 2ms llama_decode slowdown)
  • Current unicode function degradation: 0.015 ns increase
  • Estimated impact: <0.001% reduction in tokens per second
  • Reasoning: Unicode processing represents small fraction of total tokenization time

Critical Functions Status:

  • llama_decode(): No performance data available in current analysis
  • llama_encode(): No performance data available in current analysis
  • Core inference functions appear unaffected by measured changes

2. Power Consumption

Binary-Level Analysis:

Impacted Binaries:

  • build.bin.libllama.so: -0.82 nJ decrease (305,211.61 nJ vs 305,212.44 nJ base)
  • build.bin.libggml-base.so: No change (90,434.19 nJ)
  • build.bin.libggml-cpu.so: No change (151,692.17 nJ)
  • build.bin.libggml.so: No change (6,339.24 nJ)

Net Effect: Negligible power consumption improvement (-0.0% overall change)

3. Quantization Efficiency

Analysis: No direct impact detected

Reasoning:

  • Performance changes isolated to unicode processing and STL operations
  • Core quantization functions (llama_model_quantize(), quantization backends) not affected
  • No changes in GGML tensor operations or quantization algorithms

4. Memory Usage

Affected Areas:

  • Grammar Element Allocation: STL vector construction bottleneck (+0.13%)
    • Function: _Construct for llama_grammar_element vectors
    • Impact: Slight increase in memory allocation overhead
    • Scope: Grammar processing during tokenization setup

Memory Management Functions:

  • Core memory functions (llama_memory_clear(), llama_memory_seq_rm()) not directly impacted
  • KV cache operations appear unaffected

5. Batch Processing

Analysis: No measurable impact

Reasoning:

  • Batch processing functions (llama_batch_init(), llama_decode() with batches) not affected
  • Performance changes limited to preprocessing stages
  • Parallel token processing efficiency maintained

Root Cause Analysis

Unicode Processing Degradation

  • Assembly Analysis: Identical instruction sequences between versions
  • Likely Cause: Compiler optimization differences or memory layout changes
  • Function Behavior: No logical changes detected

Grammar Construction Bottleneck

  • Pattern: STL template instantiation performance variation
  • Context: Grammar element vector allocation during tokenization setup
  • Scope: Initialization phase, not runtime inference

Actionable Recommendations

Immediate Actions

  1. Build Environment Validation

    • Verify identical compiler versions and optimization flags between builds
    • Ensure consistent build environment to eliminate measurement noise
  2. Unicode Processing Optimization

    • Consider inlining unicode operator functions to eliminate function call overhead
    • Review compiler optimization flags specifically for unicode processing modules
  3. Grammar Element Management

    • Evaluate custom allocators for llama_grammar_element vectors
    • Consider pre-allocation strategies to reduce dynamic allocation overhead

Code-Level Optimizations

  1. Function Inlining

    // Consider marking unicode operators as inline or constexpr
    inline FLOAT_TYPE unicode_operator(...) { /* implementation */ }
  2. Memory Allocation Strategy

    // Pre-allocate grammar element vectors with expected capacity
    std::vector<llama_grammar_element> elements;
    elements.reserve(expected_size);

Build System Improvements

  1. Optimization Flag Review

    • Ensure -O3 or equivalent optimization levels for unicode processing
    • Consider profile-guided optimization (PGO) for hot paths
  2. Compiler-Specific Tuning

    • Evaluate architecture-specific optimizations (-march=native)
    • Review template instantiation optimization settings

Performance Impact Summary

Overall Assessment: Changes represent measurement noise rather than functional regressions

Critical Insights:

  • Core inference functions (llama_decode(), llama_encode()) remain unaffected
  • Power consumption shows slight improvement despite localized degradations
  • Tokenization preprocessing shows minimal overhead increases
  • Batch processing and quantization efficiency maintained

Priority Focus: Build environment consistency and compiler optimization validation rather than algorithmic changes, as performance variations appear to be compilation-related rather than code logic changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.