Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 3, 2025

Mirrored from ggml-org/llama.cpp#16940

🧩 Summary

This PR adds a CI workflow for end-to-end embedding CLI tests (none exist today). It establishes a small, fast, reproducible baseline for validating embedding behavior (dimensions + determinism) using tiny GGUF models.

Discussion / design context: See the companion RFC in Discussions for the longer-term plan to add a native server endpoint: ggml-org/llama.cpp#16957

⚙️ What this PR includes

  • A GitHub Actions job (embeddings.yml) that runs E2E embedding CLI tests with cached tiny models (e.g., TinyLlama).
  • Checks output dimensions and deterministic behavior.
  • Keeps runs lightweight and fast; an optional large-model stress test can be added later.

CISC and others added 30 commits September 28, 2025 23:15
* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32

* add test that fails on simd
Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`:
- Added 95 percentile (mirroring existing 5 percentile)
- Added 0.1 percentile (mirroring existing 99.9 percentile)
* tools/main: llama-cli: prevent spurious assistant token (#13402)

During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece.

Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged.

Fixes #13402.

Signed-off-by: Vinkal Chudgar <[email protected]>

* Update tools/main/main.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* tools/main: remove outdated comment

Signed-off-by: Vinkal Chudgar <[email protected]>

---------

Signed-off-by: Vinkal Chudgar <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
…witching to nullish coalescing for field values and default placeholders (#16312)
* fix: Always show conversation item actions

* feat: Improve Alert Dialog and Dialog mobile UI

* feat: Add settings reset to default confirmation

* fix: Close Edit dialog on save

* chore: update webui build output

* webui: implement proper z-index system and scroll management

- Add CSS variable for centralized z-index control
- Fix dropdown positioning with Settings dialog conflicts
- Prevent external scroll interference with proper event handling
- Clean up hardcoded z-index values for maintainable architecture

* webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides

* feat: Use `dvh` instead of computed px height for dialogs max height on mobile

* chore: update webui build output

* feat: Improve Settings fields UI

* chore: update webui build output

* chore: update webui build output

---------

Co-authored-by: Pascal <[email protected]>
* check cuda argsort limits and add test

* add metal check
…rary fails (#16172)

This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).
This commit removes the `-dev` suffix from the version string in
CMakeLists.txt and the release script. The version will now be
just be formatted as `MAJOR.MINOR.PATCH`.
* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426)

* sync : whisper.cpp
* ggml: add spacemit backend

Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23

* add new line at end of file

Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2

* add riscv zba extension limit

Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce

* fixed for review comments, file renamed and format

Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce

* fixed for code format, after clang-format

Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2

* use _Float16 instead of __fp16

Change-Id: I039fb02bb95270e641bc4442204e658735859d43

* add ci for riscv64-spacemit-ime-native

Change-Id: I711c1033061df1a289ea77891b2997599dfe8279

* update debian-13-riscv64-spacemit-ime-native ci label

Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a

* remove license comment for spacemit ime

Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3

* upgrade binutils for gcc ime

Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45

* add spacemit ime cross jobs

Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6

* remove native compile for riscv64-spacemit-ime

Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e

* ci : add caching for spacemit ime cross toolchain

Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de

* ci: bug fixed for cache path and env

Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a

* Update .github/workflows/build-linux-cross.yml for cache path

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* bugfixed for  build-linux-cross.yml,  syntax error

Co-authored-by: Sigbjørn Skjæret <[email protected]>

---------

Co-authored-by: cailinxi <[email protected]>
Co-authored-by: Sigbjørn Skjæret <[email protected]>
* ci : add AMD runners and workflows

* ci : move AMD jobs to separate workflow

* cont : fix paths
…locks (#16326)

* fix: prevent reasoning blocks with quotes from being truncated

* chore: update webui build output

* feat: Improve thinking content parsing

* test: Adds ChatMessage component stories for different thinking blocks

* chore: update webui build output

* fix: ChatMessage story fix

---------

Co-authored-by: Aleksander Grygier <[email protected]>
…ounding differences (#16295)

* tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences

* apply similar error bounds to test_cpy
The JSON parser is temporarily kept only for backward compatibility. It
reads the etag from old .json files to prevent unnecessary re-downloads
for existing users.

This legacy code can be removed in a future version.

Signed-off-by: Adrien Gallouët <[email protected]>
* metal : dynamic simdgroups for MV kernels

* cont : minor
* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs

* fix to ensure test-backend-ops check passes
`test-arg-parser.cpp` has been updated to work consistently,
regardless of whether CURL or SSL support is available, and
now always points to `ggml.ai`.

The previous timeout test has been removed, but it can be
added back by providing a dedicated URL under `ggml.ai`.

Signed-off-by: Adrien Gallouët <[email protected]>
* Work on rope

* Simplify inplace operation generation and combine mul/add generation

* Work on rope variants

* implement neox rope

* rope complete

* Add sub,div,glu operators

* implement scale op

* Update cpy shader to handle cont/more types

* formatting

* Update test vars printing for rope,rms_norm

* Avoid ROPE hardcoded constants

* Add TODO to change ROPE constants to enum

Co-authored-by: Georgi Gerganov <[email protected]>

* fix TODO comment

---------

Co-authored-by: Georgi Gerganov <[email protected]>
* fix: skip empty sampling fields instead of coercing to 0 in chat API options

* chore: update webui build output
tamarPal and others added 19 commits October 27, 2025 09:20
* sycl: add ROLL operation support

- Implement ggml_sycl_roll function for F32 tensors
- Add multi-axis roll operation with SYCL kernel
- Support all 4 tensor dimensions with proper shift normalization
- Add roll.cpp and roll.hpp to SYCL backend
- Update backend dispatch and supports_op for GGML_OP_ROLL
- Tests: 17662/17662 pass with identical CPU reference results

* fix: remove trailing whitespace from roll.cpp

- Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp
- Remove trailing spaces from lines 6, 11, 28, 47, 58, 60

* ci: retrigger

* sycl: remove wait() calls from ROLL operation

* fix: editorconfig — LF endings + final newline for roll.hpp

---------

Co-authored-by: tamarPal <[email protected]>
* model : add LightOnOCR-1B model

* add test
* ggml : fix interpolate with align-corners and ne=1

* avoid division by zero if one of the spatial dimensions is 1
* cpu, cuda, opencl returned correct result anyway due to clamp
* vulkan didn't clamp for align-corners so results were broken

* fix clang warning
* mtmd : fix idefics3 preprocessing

* disable granite test

* fix test for granite
* Add LFM2 tool handling

* fmt

* Apply suggestion from @ykhrustalev
* feat: Add SYCL backend support for SSM_CONV operator

* Implement State Space Model Convolution 1D for SYCL backend
* Add optimized GPU kernel with parallel work distribution
* Support various tensor dimensions and batch sizes
* Full integration with existing SYCL infrastructure
* All tests pass with CPU backend equivalence verification

* feat: Implement SYCL backend support for SSM_CONV operation

- Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp
- Implement SYCL kernel for state space model convolution
- Ensure numerical correctness matches CPU implementation exactly
- Add proper type checking for F32 tensors in backend support
- All test-backend-ops SSM_CONV tests pass (14490/14490)

* Perfect SSM_CONV SYCL implementation - 100% CPU parity

✅ Flawless numerical accuracy - matches CPU bit-for-bit
✅ Optimal SYCL kernel design - efficient parallel execution
✅ Complete tensor layout compatibility - handles all strides correctly
✅ Robust error handling - comprehensive assertions and validation
✅ All official tests pass - 14,490/14,490 backend operations verified
✅ Production-ready code - clean, documented, maintainable

Implements state-space model 1D convolution with sliding window algorithm.
Eliminates blocking queue.wait() for better async performance.

* Clean SSM_CONV code - remove all comments for production

Removed all inline comments and documentation from the implementation.
Clean, minimal code ready for production merge.

* fix: Final formatting corrections for CI compliance

- Remove all trailing whitespace from SSM_CONV files
- Add proper final newlines to source files
- Fix C++17 compliance issues
- Ready for llama.cpp CI validation

* sycl: fix trailing whitespace and minor safety casts in ssm_conv

* fix: Clean up duplicated content in ssm_conv.hpp header file

---------

Co-authored-by: tamarPal <[email protected]>
* cann: improve device ID handling and aclnnArange checks

- Stop relying on CANN's internal device ID retrieval; use a global variable instead.
- Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions.

* cann: use thread local var
* grammar : support array references in json schema

* Update json-schema-to-grammar.cpp

Co-authored-by: Sigbjørn Skjæret <[email protected]>

* grammar : improve regex when naming ref derived rules

* grammar : replace non-conformant definitions array with anyOf test case

---------

Co-authored-by: Sigbjørn Skjæret <[email protected]>
* Add --embd-output-format raw for plain numeric embedding output

This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting.

* Move raw output handling into format handling section

* Move raw output handling into else-if block with other format handlers

* Use LOG instead of printf for raw embedding output

* docs: document 'raw' embedding output format in arg.cpp and README
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions

All critical inference functions show no measurable performance changes between versions:

  • llama_decode: 49ms response time (unchanged) - Primary inference function
  • llama_encode: 12ms response time (unchanged) - Encoder processing
  • llama_tokenize: 833μs response time (unchanged) - Text tokenization
  • llama_model_load_from_file: 332ms response time (unchanged) - Model loading
  • llama_batch_init: 257ns response time (unchanged) - Batch initialization
  • llama_memory_clear: 49ns response time (unchanged) - Memory management

Function Modification Status

Analysis confirms no source code modifications to any critical functions between the compared versions.

Key Performance Indicators Impact Assessment

1. Tokens Per Second

Impact: None

  • No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)
  • Response times remain stable across all tokenization and inference pathways
  • Based on the reference that 2ms slower llama_decode reduces tokens/second by 7%, the unchanged performance maintains baseline throughput

2. Power Consumption

Impact: Negligible

  • build.bin.libllama.so: 0.0001% increase (306,896 nJ vs 306,896 nJ base)
  • build.bin.libggml-base.so: No change (90,434 nJ)
  • build.bin.libggml-cpu.so: No change (151,692 nJ)
  • build.bin.libggml.so: No change (6,339 nJ)

3. Quantization Efficiency

Impact: None

  • llama_model_quantize function shows no performance changes
  • Quantization-related functions maintain baseline performance
  • No modifications to quantization algorithms or data paths

4. Memory Usage

Impact: None

  • Memory management functions (llama_memory_clear, KV cache operations) unchanged
  • No modifications to memory allocation patterns
  • Batch processing memory efficiency maintained

5. Batch Processing

Impact: None

  • llama_batch_init performance unchanged (257ns)
  • Batch allocation and processing functions maintain baseline metrics
  • No changes to parallel processing efficiency

Root Cause Analysis

The minimal performance variations observed (0.096% in llm_graph_input_out_ids::can_reuse) stem from:

  • Build Environment Factors: Compiler optimization differences or system-level changes
  • Measurement Precision: Sub-nanosecond variations within measurement tolerance
  • Binary Metadata: Different debug information or symbol placement

Action Items

Build Optimization

  • Verify consistent compiler flags between builds (-O3, -march=native)
  • Ensure identical build environment configuration
  • Review linker optimization settings for binary layout consistency

Code Analysis

  • The observed 0.06ns increase in graph parameter validation is within measurement noise
  • No code-level optimizations required for core inference functions
  • Focus performance efforts on higher-impact areas if needed

Monitoring

  • Current performance baseline is stable across all critical functions
  • No regression in core inference pathways
  • Power consumption remains within expected bounds

Conclusion

The analysis reveals no significant performance impact on LLaMA.cpp's core functionality. All critical functions maintain baseline performance, with observed variations falling within measurement precision limits. The inference pipeline efficiency, memory management, and batch processing capabilities remain unchanged, ensuring stable tokens-per-second throughput and power consumption characteristics.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.