UPSTREAM PR #16940: Add e2e tests for embedding raw flag #50

DajanaV · 2025-11-03T09:06:17Z

🧩 Summary

This PR adds a CI workflow for end-to-end embedding CLI tests (none exist today). It establishes a small, fast, reproducible baseline for validating embedding behavior (dimensions + determinism) using tiny GGUF models.

Discussion / design context: See the companion RFC in Discussions for the longer-term plan to add a native server endpoint: ggml-org/llama.cpp#16957

⚙️ What this PR includes

A GitHub Actions job (embeddings.yml) that runs E2E embedding CLI tests with cached tiny models (e.g., TinyLlama).
Checks output dimensions and deterministic behavior.
Keeps runs lightweight and fast; an optional large-model stress test can be added later.

* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)

* tools/main: llama-cli: prevent spurious assistant token (#13402) During prompt ingestion, prompt tokens are accepted into the sampler history (for repetition penalties). The conversation-mode path then appended `common_sampler_last(smpl)` to `assistant_ss` before any new token was sampled. At that point, "last" was a prompt-side token (e.g., an input prefix), so the assistant chat message began with an extra piece. Fix: append to `assistant_ss` only for a newly sampled (non-EOG) token. This affects only chat message assembly (`assistant_ss` / `chat_msgs` / `common_chat_format_single`); terminal stdout is unchanged. Sampling order/logits are unchanged. Fixes #13402. Signed-off-by: Vinkal Chudgar <[email protected]> * Update tools/main/main.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * tools/main: remove outdated comment Signed-off-by: Vinkal Chudgar <[email protected]> --------- Signed-off-by: Vinkal Chudgar <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

…witching to nullish coalescing for field values and default placeholders (#16312)

* fix: Always show conversation item actions * feat: Improve Alert Dialog and Dialog mobile UI * feat: Add settings reset to default confirmation * fix: Close Edit dialog on save * chore: update webui build output * webui: implement proper z-index system and scroll management - Add CSS variable for centralized z-index control - Fix dropdown positioning with Settings dialog conflicts - Prevent external scroll interference with proper event handling - Clean up hardcoded z-index values for maintainable architecture * webui: ensured the settings dialog enforces dynamic viewport height on mobile while retaining existing desktop sizing overrides * feat: Use `dvh` instead of computed px height for dialogs max height on mobile * chore: update webui build output * feat: Improve Settings fields UI * chore: update webui build output * chore: update webui build output --------- Co-authored-by: Pascal <[email protected]>

* check cuda argsort limits and add test * add metal check

…rary fails (#16172) This PR adds additional information to an error message when loading backend library via ld_load_library() fails. This helps spotting why backend library did not load (missing library, missing dependency or unresolved symbol etc.).

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

* ggml: add spacemit backend Change-Id: I249bdc043485d815a9c351867137bc1e27cc2e23 * add new line at end of file Change-Id: I889ed1c85fb45e62350ecde0c06f70450cadfbe2 * add riscv zba extension limit Change-Id: I321eb200f859751727afe5cae13074dfce2bb0ce * fixed for review comments, file renamed and format Change-Id: Ia20b6ec24a36638e62e0fe07cf100916a7cce3ce * fixed for code format, after clang-format Change-Id: I5dc33a0412da3d3f2d77075d8939185d3009eca2 * use _Float16 instead of __fp16 Change-Id: I039fb02bb95270e641bc4442204e658735859d43 * add ci for riscv64-spacemit-ime-native Change-Id: I711c1033061df1a289ea77891b2997599dfe8279 * update debian-13-riscv64-spacemit-ime-native ci label Change-Id: Ifb2b891e2fca57b5da604fce2ac255f27731179a * remove license comment for spacemit ime Change-Id: If0dc3ca30a958631ccca0a28b62e0b825f9fb0c3 * upgrade binutils for gcc ime Change-Id: Ibf2fa74c1064408974cb5b45f044d40987e5fb45 * add spacemit ime cross jobs Change-Id: I80d74909941d41cb9cd09e51d8baf01c985cbfc6 * remove native compile for riscv64-spacemit-ime Change-Id: I01920afafdc73fa7424014fd648d243f8ec9e25e * ci : add caching for spacemit ime cross toolchain Change-Id: Ic54a192019a2fd982bbd58225ce3bbc38f4053de * ci: bug fixed for cache path and env Change-Id: I28c42e10b6fff053bb6580926ca2353448cb042a * Update .github/workflows/build-linux-cross.yml for cache path Co-authored-by: Sigbjørn Skjæret <[email protected]> * bugfixed for build-linux-cross.yml, syntax error Co-authored-by: Sigbjørn Skjæret <[email protected]> --------- Co-authored-by: cailinxi <[email protected]> Co-authored-by: Sigbjørn Skjæret <[email protected]>

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

…locks (#16326) * fix: prevent reasoning blocks with quotes from being truncated * chore: update webui build output * feat: Improve thinking content parsing * test: Adds ChatMessage component stories for different thinking blocks * chore: update webui build output * fix: ChatMessage story fix --------- Co-authored-by: Aleksander Grygier <[email protected]>

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

The JSON parser is temporarily kept only for backward compatibility. It reads the etag from old .json files to prevent unnecessary re-downloads for existing users. This legacy code can be removed in a future version. Signed-off-by: Adrien Gallouët <[email protected]>

* metal : dynamic simdgroups for MV kernels * cont : minor

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

`test-arg-parser.cpp` has been updated to work consistently, regardless of whether CURL or SSL support is available, and now always points to `ggml.ai`. The previous timeout test has been removed, but it can be added back by providing a dedicated URL under `ggml.ai`. Signed-off-by: Adrien Gallouët <[email protected]>

* Work on rope * Simplify inplace operation generation and combine mul/add generation * Work on rope variants * implement neox rope * rope complete * Add sub,div,glu operators * implement scale op * Update cpy shader to handle cont/more types * formatting * Update test vars printing for rope,rms_norm * Avoid ROPE hardcoded constants * Add TODO to change ROPE constants to enum Co-authored-by: Georgi Gerganov <[email protected]> * fix TODO comment --------- Co-authored-by: Georgi Gerganov <[email protected]>

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

* sycl: add ROLL operation support - Implement ggml_sycl_roll function for F32 tensors - Add multi-axis roll operation with SYCL kernel - Support all 4 tensor dimensions with proper shift normalization - Add roll.cpp and roll.hpp to SYCL backend - Update backend dispatch and supports_op for GGML_OP_ROLL - Tests: 17662/17662 pass with identical CPU reference results * fix: remove trailing whitespace from roll.cpp - Fix EditorConfig violations in ggml/src/ggml-sycl/roll.cpp - Remove trailing spaces from lines 6, 11, 28, 47, 58, 60 * ci: retrigger * sycl: remove wait() calls from ROLL operation * fix: editorconfig — LF endings + final newline for roll.hpp --------- Co-authored-by: tamarPal <[email protected]>

* model : add LightOnOCR-1B model * add test

* ggml : fix interpolate with align-corners and ne=1 * avoid division by zero if one of the spatial dimensions is 1 * cpu, cuda, opencl returned correct result anyway due to clamp * vulkan didn't clamp for align-corners so results were broken * fix clang warning

…ls (#16748)

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

@ykhrustalev

* Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev

* feat: Add SYCL backend support for SSM_CONV operator * Implement State Space Model Convolution 1D for SYCL backend * Add optimized GPU kernel with parallel work distribution * Support various tensor dimensions and batch sizes * Full integration with existing SYCL infrastructure * All tests pass with CPU backend equivalence verification * feat: Implement SYCL backend support for SSM_CONV operation - Add ggml-sycl/ssm_conv.cpp and ssm_conv.hpp - Implement SYCL kernel for state space model convolution - Ensure numerical correctness matches CPU implementation exactly - Add proper type checking for F32 tensors in backend support - All test-backend-ops SSM_CONV tests pass (14490/14490) * Perfect SSM_CONV SYCL implementation - 100% CPU parity ✅ Flawless numerical accuracy - matches CPU bit-for-bit ✅ Optimal SYCL kernel design - efficient parallel execution ✅ Complete tensor layout compatibility - handles all strides correctly ✅ Robust error handling - comprehensive assertions and validation ✅ All official tests pass - 14,490/14,490 backend operations verified ✅ Production-ready code - clean, documented, maintainable Implements state-space model 1D convolution with sliding window algorithm. Eliminates blocking queue.wait() for better async performance. * Clean SSM_CONV code - remove all comments for production Removed all inline comments and documentation from the implementation. Clean, minimal code ready for production merge. * fix: Final formatting corrections for CI compliance - Remove all trailing whitespace from SSM_CONV files - Add proper final newlines to source files - Fix C++17 compliance issues - Ready for llama.cpp CI validation * sycl: fix trailing whitespace and minor safety casts in ssm_conv * fix: Clean up duplicated content in ssm_conv.hpp header file --------- Co-authored-by: tamarPal <[email protected]>

* cann: improve device ID handling and aclnnArange checks - Stop relying on CANN's internal device ID retrieval; use a global variable instead. - Enforce stricter dimension validation in aclnnArange for better compatibility across CANN versions. * cann: use thread local var

* grammar : support array references in json schema * Update json-schema-to-grammar.cpp Co-authored-by: Sigbjørn Skjæret <[email protected]> * grammar : improve regex when naming ref derived rules * grammar : replace non-conformant definitions array with anyOf test case --------- Co-authored-by: Sigbjørn Skjæret <[email protected]>

* Add --embd-output-format raw for plain numeric embedding output This new option outputs embeddings as raw space-separated floats, without JSON or 'embedding N:' prefixes. Useful for downstream vector pipelines and scripting. * Move raw output handling into format handling section * Move raw output handling into else-if block with other format handlers * Use LOG instead of printf for raw embedding output * docs: document 'raw' embedding output format in arg.cpp and README

loci-agentic-ai · 2025-11-03T10:21:30Z

Access the complete analysis in the LOCI Dashboard

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions

All critical inference functions show no measurable performance changes between versions:

llama_decode: 49ms response time (unchanged) - Primary inference function
llama_encode: 12ms response time (unchanged) - Encoder processing
llama_tokenize: 833μs response time (unchanged) - Text tokenization
llama_model_load_from_file: 332ms response time (unchanged) - Model loading
llama_batch_init: 257ns response time (unchanged) - Batch initialization
llama_memory_clear: 49ns response time (unchanged) - Memory management

Function Modification Status

Analysis confirms no source code modifications to any critical functions between the compared versions.

Key Performance Indicators Impact Assessment

1. Tokens Per Second

Impact: None

No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)
Response times remain stable across all tokenization and inference pathways
Based on the reference that 2ms slower llama_decode reduces tokens/second by 7%, the unchanged performance maintains baseline throughput

2. Power Consumption

Impact: Negligible

build.bin.libllama.so: 0.0001% increase (306,896 nJ vs 306,896 nJ base)
build.bin.libggml-base.so: No change (90,434 nJ)
build.bin.libggml-cpu.so: No change (151,692 nJ)
build.bin.libggml.so: No change (6,339 nJ)

3. Quantization Efficiency

Impact: None

llama_model_quantize function shows no performance changes
Quantization-related functions maintain baseline performance
No modifications to quantization algorithms or data paths

4. Memory Usage

Impact: None

Memory management functions (llama_memory_clear, KV cache operations) unchanged
No modifications to memory allocation patterns
Batch processing memory efficiency maintained

5. Batch Processing

Impact: None

llama_batch_init performance unchanged (257ns)
Batch allocation and processing functions maintain baseline metrics
No changes to parallel processing efficiency

Root Cause Analysis

The minimal performance variations observed (0.096% in llm_graph_input_out_ids::can_reuse) stem from:

Build Environment Factors: Compiler optimization differences or system-level changes
Measurement Precision: Sub-nanosecond variations within measurement tolerance
Binary Metadata: Different debug information or symbol placement

Action Items

Build Optimization

Verify consistent compiler flags between builds (-O3, -march=native)
Ensure identical build environment configuration
Review linker optimization settings for binary layout consistency

Code Analysis

The observed 0.06ns increase in graph parameter validation is within measurement noise
No code-level optimizations required for core inference functions
Focus performance efforts on higher-impact areas if needed

Monitoring

Current performance baseline is stable across all critical functions
No regression in core inference pathways
Power consumption remains within expected bounds

Conclusion

The analysis reveals no significant performance impact on LLaMA.cpp's core functionality. All critical functions maintain baseline performance, with observed variations falling within measurement precision limits. The inference pipeline efficiency, memory management, and batch processing capabilities remain unchanged, ensuring stable tokens-per-second throughput and power consumption characteristics.

CISC and others added 30 commits September 28, 2025 23:15

ggml : fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 (#16307)

b887d2f

* fix GGML_F32_VEC_FMA argument order in ggml_vec_mad1_f32 * add test that fails on simd

vulkan: Fix validation failure in quantized flash attention (#16292)

92cd103

ggml : fix dependencies for ggml_set_rows (#16318)

a4a0aa5

perplexity : show more kl-divergence data (#16321)

3ffd0fa

Adds additional percentile data for displayed in the output of `llama-perplexity --kl-divergence`: - Added 95 percentile (mirroring existing 5 percentile) - Added 0.1 percentile (mirroring existing 99.9 percentile)

fix: preserved zero values in chat settings inputs and textareas by s…

66bb798

…witching to nullish coalescing for field values and default placeholders (#16312)

ggml : check cuda and metal argsort limits and add test (#16323)

adc7634

* check cuda argsort limits and add test * add metal check

ggml : bump version to 0.9.1

2db78c7

ggml : prepare for development of 0.9.2-dev

b6dff20

ggml : bump version to 0.9.3 (ggml/1353)

b6ae75a

ggml : remove -dev suffix from release version (ggml/1355)

c9b1c06

This commit removes the `-dev` suffix from the version string in CMakeLists.txt and the release script. The version will now be just be formatted as `MAJOR.MINOR.PATCH`.

sync : whisper.cpp (ggml/1359)

4d3d455

* ggml : Fix MKL detection by quoting BLAS_INCLUDE_DIRS (whisper/3426) * sync : whisper.cpp

sync : ggml

2ddd3f2

ci : add AMD runners and workflows (#16249)

d72f5f7

* ci : add AMD runners and workflows * ci : move AMD jobs to separate workflow * cont : fix paths

tests: override test_set_rows::max_nmse_err to allow for occasional r…

a74a0d6

…ounding differences (#16295) * tests: override test_set_rows::max_nmse_err to allow for occasional rounding differences * apply similar error bounds to test_cpy

codeowners: add codeowners for opencl backend (#16344)

de41f2b

kleidiai : fix work size and threads sync for fp16 (#16246)

f1eb1cb

metal : dynamic simdgroups for MV kernels (#16340)

35fb824

* metal : dynamic simdgroups for MV kernels * cont : minor

cuda : Enable CUDA Graph usage for Nemotron Nano v2 (NemotronH) (#16328)

a014310

* Fix Nemotron Nano v2 9B not executing as CUDA Graph on NVIDIA GPUs * fix to ensure test-backend-ops check passes

ggml : bump version to 0.9.4 (ggml/1363)

075c015

ci : disable ccache for android (#16348)

2df5bcf

opencl: support ne3 in get_rows (#15866)

d1c84a6

Chatapi ignore empty sampling (#16330)

16b0ca0

* fix: skip empty sampling fields instead of coercing to 0 in chat API options * chore: update webui build output

tamarPal and others added 19 commits October 27, 2025 09:20

test-backend-ops: print failed tests at the end (#16785)

75cbdd3

llama: fix leaked buffers for mmap + split files (#16765)

945501f

model : add LightOnOCR-1B model (#16764)

c55d53a

* model : add LightOnOCR-1B model * add test

HIP: fix AMDGPU_TARGETS, update documentation (#16803)

80d28f1

llama : disable pipeline parallelism if compute buffer allocation fai…

5a4ff43

…ls (#16748)

mtmd : fix idefics3 preprocessing (#16806)

e1ab084

* mtmd : fix idefics3 preprocessing * disable granite test * fix test for granite

chat: Add LFM2 tool handling (#16763)

c053e18

* Add LFM2 tool handling * fmt * Apply suggestion from @ykhrustalev

CUDA: add unused vars to mmvf and mmvq (#16807)

463bbf2

llama: consistent ctx <-> buf order for KV cache (#16746)

7a0e900

initialise buffer.device in ggml_hexagon_session (#16816)

8284efc

Add e2e tests for embedding raw flag

c1c3d99

Increase scope of embedding cli tests

2de1e68

Update test and workflow to match new RFC

5ce810e

DajanaV temporarily deployed to PROD__AL_DEMO November 3, 2025 09:06 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 7 times, most recently from b655780 to 94ec54d Compare November 3, 2025 20:09

DajanaV closed this Nov 3, 2025

DajanaV force-pushed the main branch from 94ec54d to 92c0c2f Compare November 3, 2025 23:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16940: Add e2e tests for embedding raw flag #50

UPSTREAM PR #16940: Add e2e tests for embedding raw flag #50

Uh oh!

DajanaV commented Nov 3, 2025

Uh oh!

loci-agentic-ai bot commented Nov 3, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

81 participants

UPSTREAM PR #16940: Add e2e tests for embedding raw flag #50

UPSTREAM PR #16940: Add e2e tests for embedding raw flag #50

Uh oh!

Conversation

DajanaV commented Nov 3, 2025

🧩 Summary

⚙️ What this PR includes

Uh oh!

loci-agentic-ai bot commented Nov 3, 2025

LLaMA.cpp Performance Analysis Summary

Critical Function Performance Status

Core Inference Functions

Function Modification Status

Key Performance Indicators Impact Assessment

1. Tokens Per Second

2. Power Consumption

3. Quantization Efficiency

4. Memory Usage

5. Batch Processing

Root Cause Analysis

Action Items

Build Optimization

Code Analysis

Monitoring

Conclusion

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

81 participants