Skip to content

Conversation

@DajanaV
Copy link
Contributor

@DajanaV DajanaV commented Nov 4, 2025

Mirrored from ggml-org/llama.cpp#17005

Forgot to update ops added by me.
@pwilkin @am17an
ref: ggml-org/llama.cpp#16917
ref: ggml-org/llama.cpp#15635

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 971c7425 compared to baseline 523c96f3 reveals minimal performance variations within statistical noise thresholds. The highest observed changes were in non-core functions with negligible impact on inference performance.

Key Findings

Performance Metrics:

  • Highest Response Time Change: std::codecvt_abstract_base::in() in build.bin.llama-tts with +0.068% increase (29.43 ns vs 29.41 ns baseline)
  • Highest Throughput Change: std::make_unique<llm_graph_input_attn_no_cache>() in build.bin.libllama.so with +0.111% increase (70.34 ns vs 70.26 ns baseline)

Core Function Impact:
No changes detected in critical inference functions (llama_decode, llama_encode, llama_tokenize). The affected functions are standard library utilities for character encoding and memory allocation, not part of the core tokenization/inference pipeline.

Inference Performance Impact:
Given the reference that 2ms slower llama_decode results in 7% fewer tokens per second, the observed nanosecond-level changes in non-core functions have zero measurable impact on tokens per second performance for the smollm:135m model on the specified hardware configuration.

Power Consumption Analysis:
Negligible changes across all binaries:

  • build.bin.libllama.so: -0.0004% reduction (280,665.56 nJ vs 280,666.64 nJ)
  • build.bin.llama-tts: -0.0001% reduction (322,782.38 nJ vs 322,782.77 nJ)
  • All other binaries: No measurable change (0.0%)

Flame Graph and CFG Analysis:
The std::codecvt_abstract_base::in() function shows a single-frame, leaf-node execution pattern with 100% self-time execution. CFG comparison reveals identical assembly code between versions, confirming that timing differences represent measurement variance rather than code changes.

GitHub Code Review:
PR #84 contains only documentation updates to CUDA operations support status. No source code modifications were made, confirming that observed timing variations are unrelated to code changes.

Conclusion:
Version 971c7425 maintains performance parity with the baseline. All observed changes fall within measurement precision limits and do not affect core inference functionality or overall system performance.

@DajanaV DajanaV force-pushed the main branch 16 times, most recently from 40efe8b to 3e9b10f Compare November 7, 2025 11:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants