UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #200

DajanaV · 2025-11-13T19:33:42Z

- Purely visual and diagnostic change, no effect on model context, prompt
  construction, or inference behavior

- Captured assistant tool call payloads during streaming and non-streaming
  completions, and persisted them in chat state and storage for downstream use

- Exposed parsed tool call labels beneath the assistant's model info line
  with graceful fallback when parsing fails

- Added tool call badges beneath assistant responses that expose JSON tooltips
  and copy their payloads when clicked, matching the existing model badge styling

- Added a user-facing setting to toggle tool call visibility to the Developer
  settings section directly under the model selector option

Close ggml-org/llama.cpp#16597

loci-agentic-ai · 2025-11-13T20:12:29Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 3545ef8a-43a2-4dc6-bddd-ddbf7cb4fc06 against baseline 80eddb5a-e7bc-47b4-ae95-1998086194aa reveals minimal performance variations with no impact on core inference functions. The changes are primarily related to WebUI tool call visualization features that do not affect the LLaMA.cpp inference engine.

Key Findings

Performance Metrics:

Highest Response Time change: std::vector<llm_bigram_spm>::pop_back() with +0.10% increase (67 ns vs 66.87 ns)
Highest Throughput change: std::make_unique<llm_graph_input_attn_no_cache>() with +0.11% increase (70 ns self-time)
No changes detected in core inference functions (llama_decode, llama_encode, llama_tokenize)

Core Function Impact:
The performance changes affect auxiliary functions in the tokenization subsystem but do not impact critical inference paths. Since core functions like llama_decode show no measurable changes, there is no expected impact on tokens per second performance for inference workloads.

Power Consumption Analysis:
Minimal power consumption changes across binaries:

build.bin.libllama.so: +0.001% increase (281,186 nJ vs 281,185 nJ)
build.bin.llama-cvector-generator: -0.0% decrease
Other binaries show negligible changes within measurement precision

Assembly and Control Flow Analysis:
CFG comparison revealed identical assembly code between versions for the affected functions, indicating the 0.06 ns performance difference stems from environmental factors (memory layout, compiler metadata) rather than algorithmic changes.

GitHub Code Review Insights:
The PR implements WebUI tool call visualization features with:

Well-architected optional functionality (disabled by default)
No modifications to core inference pipelines
Minimal overhead limited to UI components when enabled
Proper error handling and graceful fallbacks

Conclusion:
The analysis shows no meaningful performance impact on the LLaMA.cpp inference engine. The observed micro-variations are within normal measurement variance and do not affect core functionality or inference performance.

…and persistence in chat UI - Purely visual and diagnostic change, no effect on model context, prompt construction, or inference behavior - Captured assistant tool call payloads during streaming and non-streaming completions, and persisted them in chat state and storage for downstream use - Exposed parsed tool call labels beneath the assistant's model info line with graceful fallback when parsing fails - Added tool call badges beneath assistant responses that expose JSON tooltips and copy their payloads when clicked, matching the existing model badge styling - Added a user-facing setting to toggle tool call visibility to the Developer settings section directly under the model selector option

…el selector)

…atMessageAssistant.svelte Co-authored-by: Aleksander Grygier <[email protected]>

loci-agentic-ai · 2025-11-14T21:20:14Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version d25c62ba-0d04-4c19-bfbd-c5c6f09619dc against baseline 1cdba291-d66d-4e7a-b133-996d29ab9acc reveals minimal performance variations within measurement tolerance. The highest percentage changes occur in non-core functions with negligible absolute impact.

Performance Metrics

Highest Response Time Change:

Function: llm_graph_input_out_ids::can_reuse() (+0.096%, +0.063 ns)
Absolute values: 65.164 ns vs 65.101 ns

Highest Throughput Change:

Function: std::make_unique<llm_graph_input_attn_no_cache>() (+0.111%, +0.078 ns)
Absolute values: 70.342 ns vs 70.264 ns

Power Consumption Analysis:
All binaries maintain identical power consumption profiles with 0.0% change across libllama.so, libggml.so, and all executable binaries, indicating stable energy efficiency.

Core Function Impact Assessment

No Core Function Changes Detected:

Critical inference functions (llama_decode, llama_encode, llama_tokenize) show no modifications
Memory management functions (llama_memory_*) remain unchanged
Model processing functions (llama_model_*) exhibit no variations
Token processing pipeline maintains identical performance characteristics

Inference Performance Impact:
Given the reference that a 2ms slowdown in llama_decode reduces tokens per second by 7% on the test configuration (smollm:135m, Intel i7-1255U), the observed sub-nanosecond changes in non-core functions will have zero measurable impact on inference throughput.

Technical Analysis

Flame Graph Analysis:
The affected function exhibits a single-node execution pattern with 65 ns self-contained execution, indicating optimal call structure with no complex dependencies.

CFG Comparison:
Identical assembly code across versions confirms that performance variations stem from measurement noise rather than functional changes.

Code Review Findings:
GitHub analysis reveals the changes are purely frontend WebUI enhancements for tool call visualization, completely separate from the C++ performance metrics being measured.

Conclusion

The performance analysis indicates stable system behavior with variations well within measurement tolerance. No actionable optimizations are required as the detected changes represent measurement variance rather than functional regressions.

loci-agentic-ai · 2025-11-14T22:14:03Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 57daa6aa-3505-412f-8494-7c7ee25cfb88 compared to baseline 1cdba291-d66d-4e7a-b133-996d29ab9acc reveals minimal performance variations with no meaningful impact on inference capabilities. The changes are limited to WebUI enhancements for tool call visualization with no modifications to core C++ inference components.

Key Findings

Highest Performance Changes:

Response Time: llm_graph_input_out_ids::can_reuse shows +0.096% (+0.06 ns) increase
Throughput Time: std::__detail::_Executor::_M_match_multiline shows -0.109% (-0.04 ns) improvement

Core Function Impact Assessment:
The identified functions with performance changes are not part of the core inference pipeline. Critical functions like llama_decode(), llama_encode(), and llama_tokenize() remain unaffected. Based on the reference that a 2ms slower llama_decode() reduces tokens per second by 7%, the observed nanosecond-level changes (0.06 ns) represent negligible impact on inference throughput.

Power Consumption Analysis:
All 15 analyzed binaries show stable power consumption (0.0% change), including core libraries build.bin.libllama.so (280.7 kJ) and build.bin.libggml.so. No energy efficiency regression detected across the system.

Flame Graph and CFG Analysis:
The can_reuse function exhibits identical assembly code between versions with single-node execution (65 ns self-time). CFG comparison reveals no structural or instruction-level differences, confirming the 0.06 ns variation represents measurement noise rather than algorithmic changes.

GitHub Code Review Insights:
PR #200 implements WebUI tool call visualization features with no C++ core modifications. Changes are purely frontend enhancements (TypeScript/Svelte) for debugging and diagnostic purposes, maintaining API compatibility and system stability.

Conclusion:
The performance variations detected are within measurement noise tolerance and unrelated to the actual code changes. The WebUI enhancements provide valuable debugging capabilities without impacting inference performance or system efficiency.

DajanaV temporarily deployed to PROD__AL_DEMO November 13, 2025 19:33 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from b50c0de to 8798da8 Compare November 13, 2025 20:09

DajanaV force-pushed the main branch 6 times, most recently from ef7ca13 to c65ae84 Compare November 14, 2025 15:09

ServeurpersoCom and others added 5 commits November 14, 2025 21:32

webui: remove scroll listener causing unnecessary layout updates (mod…

a773caf

…el selector)

Update tools/server/webui/src/lib/components/app/chat/ChatMessages/Ch…

9f1cdb3

…atMessageAssistant.svelte Co-authored-by: Aleksander Grygier <[email protected]>

Update tools/server/webui/src/lib/components/app/chat/ChatMessages/Ch…

291edb0

…atMessageAssistant.svelte Co-authored-by: Aleksander Grygier <[email protected]>

chore: npm run format & update webui build output

73e4023

DajanaV force-pushed the upstream-PR16618-branch_ServeurpersoCom-harmony-toolcall-debug-option branch from 0ba18eb to 73e4023 Compare November 14, 2025 20:36

DajanaV temporarily deployed to PROD__AL_DEMO November 14, 2025 20:36 — with GitHub Actions Inactive

chore: update webui build output

b1b7ecf

DajanaV temporarily deployed to PROD__AL_DEMO November 14, 2025 21:34 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 10 times, most recently from 0f3e62f to a483926 Compare November 15, 2025 21:07

loci-dev force-pushed the main branch 30 times, most recently from 2baff0f to 92ef8cd Compare November 26, 2025 14:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #200

UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #200

Uh oh!

DajanaV commented Nov 13, 2025

Uh oh!

loci-agentic-ai bot commented Nov 13, 2025

Uh oh!

loci-agentic-ai bot commented Nov 14, 2025

Uh oh!

loci-agentic-ai bot commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #200

Are you sure you want to change the base?

UPSTREAM PR #16618: webui: add OAI-Compat Harmony tool-call streaming visualization and persistence in chat UI #200

Uh oh!

Conversation

DajanaV commented Nov 13, 2025

Uh oh!

loci-agentic-ai bot commented Nov 13, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

loci-agentic-ai bot commented Nov 14, 2025

Performance Analysis Summary

Overview

Performance Metrics

Core Function Impact Assessment

Technical Analysis

Conclusion

Uh oh!

loci-agentic-ai bot commented Nov 14, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants