UPSTREAM PR #16899: vulkan: Fix multi_add invalid descriptor usage #26

DajanaV · 2025-10-31T18:06:59Z

This is similar to what I did for the K/V bindings in the FA shader, just a lot more bindings. I did it this way to be able to keep using a single descriptor set layout, for simplicity. In the long run I think we should switch all shaders to use a single binding with N array elements, to make it possible to do indexing. But that's a bigger change than I want to do to fix this bug.

I removed the code that disabled this path for Intel, presumably it'll work after this.

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

* ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <[email protected]> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: cailinxi <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

…locks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <[email protected]>

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <[email protected]>

* metal : dynamic simdgroups for MV kernels * cont : minor

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

`test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <[email protected]>

* Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <[email protected]> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <[email protected]>

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

* common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <[email protected]> * Add missing headers Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

…#16359) * Make a few GLM tensors not required layer.nextn.shared_head_head and layer.nextn.embed_tokens are both excluded from GLM 4.6 resulting in the model not loading after conversion/quantization, this marks those tensors as not required which makes it work * Update llama-model.cpp layer.nextn.shared_head_norm also not required in case of future models

…6363)

…(#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

…onditional rendering for Actions Dropdown for Chat Conversation Items (#16369) * fix: Render Conversation action dialogs as singletons from Chat Sidebar level * chore: update webui build output * fix: Render Actions Dropdown conditionally only when user hovers conversation item + remove unused markup * chore: Update webui static build * fix: Always truncate conversation names * chore: Update webui static build

* common: introduce http.h for httplib-based client This change moves cpp-httplib based URL parsing and client setup into a new header `common/http.h`, and integrates it in `arg.cpp` and `run.cpp`. It is an iteration towards removing libcurl, while intentionally minimizing changes to existing code to guarantee the same behavior when `LLAMA_CURL` is used. Signed-off-by: Adrien Gallouët <[email protected]> * tools : add missing WIN32_LEAN_AND_MEAN Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]> Signed-off-by: Adrien Gallouët <[email protected]>

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

* vulkan: Update topk_moe fusion to handle gpt's late softmax Based on #16649. * Add ggml_check_edges * Add sync logging to show fusion effects * handle clamp added in #16655 * Update ggml/src/ggml-impl.h Co-authored-by: Diego Devesa <[email protected]>

* llama: store mrope data in KV cell * correct x,y ordering * address review comments * add consistency checks * Update src/llama-kv-cache.cpp Co-authored-by: Georgi Gerganov <[email protected]> * add TODO * fix asan error * kv-cells : improve ext handling * cont : fix headers --------- Co-authored-by: Georgi Gerganov <[email protected]>

This pattern appears in a lot of models, the rope operation is applied right before storing into the KV cache (usually on the K tensor). Add a path to some of the rope shaders that computes the destination address based on the set_rows tensor. Compile variants of the shader with D_TYPE of f16 (the usual KV cache type). Add a src3 operand to ggml_vk_op_f32 - sometimes rope uses three srcs and needs the fourth for the row indices. Add fused_ops_write_mask to indicate which intermediate tensors need to write their results to memory. Skipping writing the roped K value helps to allow more nodes to run concurrently. Add logic to ggml_vk_graph_optimize to make ROPE+VIEW+SET_ROWS consecutive. It rarely starts out that way in the graph. Add new backend tests.

loci-agentic-ai · 2025-10-31T19:21:41Z

Access the complete analysis in the LOCI Dashboard

Based on the performance analysis data from the previous analysis, I'll provide a comprehensive summary focusing on the critical functions and their impact on key performance indicators.

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Changes

Response Time Degradations

Worst Impact: unicode.cpp__ZZ15unicode_tolowerjENKUlRKSt4pairIjjEjE_clES2_j (operator function)
- Change: +0.08% (19 ns vs 19 ns base)
- Location: Unicode processing within tokenization pipeline
- Impact: Affects text preprocessing in llama_tokenize() workflow

Throughput Degradations

Primary Impact: Same unicode operator function
- Self-time increase: +0.08%
- Function role: Unicode lowercase conversion for token processing

Bottleneck Analysis

Most Significant: _ZSt10_ConstructISt6vectorI21llama_grammar_elementSaIS1_EEJRKS3_EEvPT_DpOT0_ (_Construct)
- Change: +0.13% (20 ns vs 20 ns base)
- Location: STL vector construction for grammar elements
- Context: Grammar processing within tokenization

Key Performance Indicator Impact Analysis

1. Tokens Per Second

Impact Assessment: Minimal degradation expected

Affected Functions:

llama_tokenize(): Indirectly affected through unicode processing degradation
- Unicode operator function shows 0.08% increase in processing time
- Grammar element construction shows 0.13% bottleneck increase

Quantified Impact:

Based on reference data (7% tokens/sec reduction for 2ms llama_decode slowdown)
Current unicode function degradation: 0.015 ns increase
Estimated impact: <0.001% reduction in tokens per second
Reasoning: Unicode processing represents small fraction of total tokenization time

Critical Functions Status:

llama_decode(): No performance data available in current analysis
llama_encode(): No performance data available in current analysis
Core inference functions appear unaffected by measured changes

2. Power Consumption

Binary-Level Analysis:

Impacted Binaries:

build.bin.libllama.so: -0.82 nJ decrease (305,211.61 nJ vs 305,212.44 nJ base)
build.bin.libggml-base.so: No change (90,434.19 nJ)
build.bin.libggml-cpu.so: No change (151,692.17 nJ)
build.bin.libggml.so: No change (6,339.24 nJ)

Net Effect: Negligible power consumption improvement (-0.0% overall change)

3. Quantization Efficiency

Analysis: No direct impact detected

Reasoning:

Performance changes isolated to unicode processing and STL operations
Core quantization functions (llama_model_quantize(), quantization backends) not affected
No changes in GGML tensor operations or quantization algorithms

4. Memory Usage

Affected Areas:

Grammar Element Allocation: STL vector construction bottleneck (+0.13%)
- Function: _Construct for llama_grammar_element vectors
- Impact: Slight increase in memory allocation overhead
- Scope: Grammar processing during tokenization setup

Memory Management Functions:

Core memory functions (llama_memory_clear(), llama_memory_seq_rm()) not directly impacted
KV cache operations appear unaffected

5. Batch Processing

Analysis: No measurable impact

Reasoning:

Batch processing functions (llama_batch_init(), llama_decode() with batches) not affected
Performance changes limited to preprocessing stages
Parallel token processing efficiency maintained

Root Cause Analysis

Unicode Processing Degradation

Assembly Analysis: Identical instruction sequences between versions
Likely Cause: Compiler optimization differences or memory layout changes
Function Behavior: No logical changes detected

Grammar Construction Bottleneck

Pattern: STL template instantiation performance variation
Context: Grammar element vector allocation during tokenization setup
Scope: Initialization phase, not runtime inference

Actionable Recommendations

Immediate Actions

Build Environment Validation
- Verify identical compiler versions and optimization flags between builds
- Ensure consistent build environment to eliminate measurement noise
Unicode Processing Optimization
- Consider inlining unicode operator functions to eliminate function call overhead
- Review compiler optimization flags specifically for unicode processing modules
Grammar Element Management
- Evaluate custom allocators for llama_grammar_element vectors
- Consider pre-allocation strategies to reduce dynamic allocation overhead

Code-Level Optimizations

Function Inlining

// Consider marking unicode operators as inline or constexpr
inline FLOAT_TYPE unicode_operator(...) { /* implementation */ }

Memory Allocation Strategy

// Pre-allocate grammar element vectors with expected capacity
std::vector<llama_grammar_element> elements;
elements.reserve(expected_size);

Build System Improvements

Optimization Flag Review
- Ensure -O3 or equivalent optimization levels for unicode processing
- Consider profile-guided optimization (PGO) for hot paths
Compiler-Specific Tuning
- Evaluate architecture-specific optimizations (-march=native)
- Review template instantiation optimization settings

Performance Impact Summary

Overall Assessment: Changes represent measurement noise rather than functional regressions

Critical Insights:

Core inference functions (llama_decode(), llama_encode()) remain unaffected
Power consumption shows slight improvement despite localized degradations
Tokenization preprocessing shows minimal overhead increases
Batch processing and quantization efficiency maintained

Priority Focus: Build environment consistency and compiler optimization validation rather than algorithmic changes, as performance variations appear to be compilation-related rather than code logic changes.

danbev and others added 30 commits September 29, 2025 17:43

ggml : remove -dev suffix from release version (ggml/1355)

c9b1c06

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

sync : whisper.cpp (ggml/1359)

4d3d455

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

sync : ggml

2ddd3f2

ci : add AMD runners and workflows (#16249)

d72f5f7

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

tests: override test_set_rows::max_nmse_err to allow for occasional r…

a74a0d6

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

codeowners: add codeowners for opencl backend (#16344)

de41f2b

kleidiai : fix work size and threads sync for fp16 (#16246)

f1eb1cb

metal : dynamic simdgroups for MV kernels (#16340)

35fb824

* metal : dynamic simdgroups for MV kernels * cont : minor

cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328)

a014310

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

ggml : bump version to 0.9.4 (ggml/1363)

075c015

ci : disable ccache for android (#16348)

2df5bcf

opencl: support ne3 in get_rows (#15866)

d1c84a6

Chatapi ignore empty sampling (#16330)

16b0ca0

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

opencl: support pad_ext (#15888)

7c156df

common : disable progress bar without a tty (#16352)

bf6f3b3

* common : disable progress bar without a tty Signed-off-by: Adrien Gallouët <[email protected]> * Add missing headers Signed-off-by: Adrien Gallouët <[email protected]> --------- Signed-off-by: Adrien Gallouët <[email protected]>

ci : fix ccache key for ubuntu-cpu-cmake (#16355)

b2ba81d

* fix ccache key for ubuntu-cpu-cmake * set it for release as well [no ci]

webui: Remove running llama-server within WebUI dev.sh script (#1…

aa9538a

…6363)

vulkan: make ggml_vk_default_dispatcher support older vulkan headers …

132d673

…(#16345) * make ggml_vk_default_dispatcher support older vulkan headers * simpilfy with using

Add optional setting for showing "Model used:" information (#16337)

4f15759

* feat: Add a setting to include model name used to generate the message * feat: UI improvements * feat: Save model info along with the database message entry creation * chore: Build webui static output

ci : use registry cache for docker builds (#16366)

1104ca1

Improve code block color theming (#16325)

2a9b633

* feat: Improve code block theming * chore: update webui build output * chore: Update webui static build

ci: Properly install rocwmma for hip builds (#16305)

1fe4e38

* CI: Properly install rocwmma for hip builds on windows we now windows install rocwmma from ubuntu pacakges * CI: update linux rocm docker build to use rocm 7.0

jeffbolznv and others added 5 commits October 29, 2025 14:44

llama: fix ASAN error with M-RoPE (#16848)

3464bda

vulkan: Fix multi_add invalid descriptor usage

a5d791c

DajanaV temporarily deployed to PROD__AL_DEMO October 31, 2025 18:07 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 20 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

DajanaV mentioned this pull request Nov 18, 2025

UPSTREAM PR #17342: Throughput improvement for small batch sizes #248

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16899: vulkan: Fix multi_add invalid descriptor usage #26

UPSTREAM PR #16899: vulkan: Fix multi_add invalid descriptor usage #26

Uh oh!

DajanaV commented Oct 31, 2025

Uh oh!

loci-agentic-ai bot commented Oct 31, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

82 participants