Skip to content

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 13, 2025

Mirrored from ggml-org/llama.cpp#16618

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option
1 2

Close ggml-org/llama.cpp#16597

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 3545ef8a-43a2-4dc6-bddd-ddbf7cb4fc06 against baseline 80eddb5a-e7bc-47b4-ae95-1998086194aa reveals minimal performance variations with no impact on core inference functions. The changes are primarily related to WebUI tool call visualization features that do not affect the LLaMA.cpp inference engine.

Key Findings

Performance Metrics:

  • Highest Response Time change: std::vector<llm_bigram_spm>::pop_back() with +0.10% increase (67 ns vs 66.87 ns)
  • Highest Throughput change: std::make_unique<llm_graph_input_attn_no_cache>() with +0.11% increase (70 ns self-time)
  • No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)

Core Function Impact:
The performance changes affect auxiliary functions in the tokenization subsystem but do not impact critical inference paths. Since core functions like llama_decode show no measurable changes, there is no expected impact on tokens per second performance for inference workloads.

Power Consumption Analysis:
Minimal power consumption changes across binaries:

  • build.bin.libllama.so: +0.001% increase (281,186 nJ vs 281,185 nJ)
  • build.bin.llama-cvector-generator: -0.0% decrease
  • Other binaries show negligible changes within measurement precision

Assembly and Control Flow Analysis:
CFG comparison revealed identical assembly code between versions for the affected functions, indicating the 0.06 ns performance difference stems from environmental factors (memory layout, compiler metadata) rather than algorithmic changes.

GitHub Code Review Insights:
The PR implements WebUI tool call visualization features with:

  • Well-architected optional functionality (disabled by default)
  • No modifications to core inference pipelines
  • Minimal overhead limited to UI components when enabled
  • Proper error handling and graceful fallbacks

Conclusion:
The analysis shows no meaningful performance impact on the LLaMA.cpp inference engine. The observed micro-variations are within normal measurement variance and do not affect core functionality or inference performance.

@DajanaV DajanaV force-pushed the main branch 6 times, most recently from ef7ca13 to c65ae84 Compare November 14, 2025 15:09
ServeurpersoCom and others added 5 commits November 14, 2025 21:32
…and persistence in chat UI

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option
…atMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <[email protected]>
…atMessageAssistant.svelte

Co-authored-by: Aleksander Grygier <[email protected]>
@DajanaV DajanaV force-pushed the upstream-PR16618-branch_ServeurpersoCom-harmony-toolcall-debug-option branch from 0ba18eb to 73e4023 Compare November 14, 2025 20:36
@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version d25c62ba-0d04-4c19-bfbd-c5c6f09619dc against baseline 1cdba291-d66d-4e7a-b133-996d29ab9acc reveals minimal performance variations within measurement tolerance. The highest percentage changes occur in non-core functions with negligible absolute impact.

Performance Metrics

Highest Response Time Change:

  • Function: llm_graph_input_out_ids::can_reuse() (+0.096%, +0.063 ns)
  • Absolute values: 65.164 ns vs 65.101 ns

Highest Throughput Change:

  • Function: std::make_unique<llm_graph_input_attn_no_cache>() (+0.111%, +0.078 ns)
  • Absolute values: 70.342 ns vs 70.264 ns

Power Consumption Analysis:
All binaries maintain identical power consumption profiles with 0.0% change across libllama.so, libggml.so, and all executable binaries, indicating stable energy efficiency.

Core Function Impact Assessment

No Core Function Changes Detected:

  • Critical inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications
  • Memory management functions (llama_memory_*) remain unchanged
  • Model processing functions (llama_model_*) exhibit no variations
  • Token processing pipeline maintains identical performance characteristics

Inference Performance Impact:
Given the reference that a 2ms slowdown in llama_decode reduces tokens per second by 7% on the test configuration (smollm:135m, Intel i7-1255U), the observed sub-nanosecond changes in non-core functions will have zero measurable impact on inference throughput.

Technical Analysis

Flame Graph Analysis:
The affected function exhibits a single-node execution pattern with 65 ns self-contained execution, indicating optimal call structure with no complex dependencies.

CFG Comparison:
Identical assembly code across versions confirms that performance variations stem from measurement noise rather than functional changes.

Code Review Findings:
GitHub analysis reveals the changes are purely frontend WebUI enhancements for tool call visualization, completely separate from the C++ performance metrics being measured.

Conclusion

The performance analysis indicates stable system behavior with variations well within measurement tolerance. No actionable optimizations are required as the detected changes represent measurement variance rather than functional regressions.

@loci-agentic-ai
Copy link

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 57daa6aa-3505-412f-8494-7c7ee25cfb88 compared to baseline 1cdba291-d66d-4e7a-b133-996d29ab9acc reveals minimal performance variations with no meaningful impact on inference capabilities. The changes are limited to WebUI enhancements for tool call visualization with no modifications to core C++ inference components.

Key Findings

Highest Performance Changes:

  • Response Time: llm_graph_input_out_ids::can_reuse shows +0.096% (+0.06 ns) increase
  • Throughput Time: std::__detail::_Executor::_M_match_multiline shows -0.109% (-0.04 ns) improvement

Core Function Impact Assessment:
The identified functions with performance changes are not part of the core inference pipeline. Critical functions like llama_decode(), llama_encode(), and llama_tokenize() remain unaffected. Based on the reference that a 2ms slower llama_decode() reduces tokens per second by 7%, the observed nanosecond-level changes (0.06 ns) represent negligible impact on inference throughput.

Power Consumption Analysis:
All 15 analyzed binaries show stable power consumption (0.0% change), including core libraries build.bin.libllama.so (280.7 kJ) and build.bin.libggml.so. No energy efficiency regression detected across the system.

Flame Graph and CFG Analysis:
The can_reuse function exhibits identical assembly code between versions with single-node execution (65 ns self-time). CFG comparison reveals no structural or instruction-level differences, confirming the 0.06 ns variation represents measurement noise rather than algorithmic changes.

GitHub Code Review Insights:
PR #200 implements WebUI tool call visualization features with no C++ core modifications. Changes are purely frontend enhancements (TypeScript/Svelte) for debugging and diagnostic purposes, maintaining API compatibility and system stability.

Conclusion:
The performance variations detected are within measurement noise tolerance and unrelated to the actual code changes. The WebUI enhancements provide valuable debugging capabilities without impacting inference performance or system efficiency.

@DajanaV DajanaV force-pushed the main branch 10 times, most recently from 0f3e62f to a483926 Compare November 15, 2025 21:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 2baff0f to 92ef8cd Compare November 26, 2025 14:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants