Skip to content

UPSTREAM PR #17554: New llama-run#349

Open
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17554-branch_ericcurtin-llama-server-chat
Open

UPSTREAM PR #17554: New llama-run#349
loci-dev wants to merge 1 commit intomainfrom
upstream-PR17554-branch_ericcurtin-llama-server-chat

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17554

  • Added readline.cpp include
  • Created run_chat_mode():
    • Initializes readline with command history
    • Maintains conversation history
    • Applies chat templates to format messages
    • Submits completion tasks to the server queue
    • Displays assistant responses interactively

@loci-review
Copy link

loci-review bot commented Nov 28, 2025

Explore the complete analysis inside the Version Insights

Based on the PR analysis, this change represents a major architectural refactoring of the llama-run tool. The code has been restructured from a monolithic implementation into a modular design that reuses llama-server infrastructure, with the addition of a new readline library for interactive input handling.

Key Findings

Performance-Critical Areas:
The refactoring extracts server-context.h (2474 lines) from server.cpp and introduces run-chat.cpp (210 lines) for interactive chat functionality. The core inference path remains unchanged as the same server_context infrastructure is used. No modifications were made to llama_decode, llama_encode, or llama_tokenize functions, meaning token generation throughput is unaffected.

Inference Impact:
No impact on tokens per second. The inference pipeline functions (llama_decode, llama_encode, llama_tokenize) show no changes in response time or throughput. The refactoring affects only the application layer structure and user interaction handling, not the model execution path.

Power Consumption:
The llama-run binary shows changes due to code reorganization and the replacement of linenoise with readline.cpp. However, since the core inference functions remain identical and no changes were made to the computation-intensive paths, power consumption during model inference remains constant. The architectural changes affect only initialization and I/O handling.

Code Structure:
The PR removes 5994 lines and adds 4316 lines across 20 files. The main changes include extracting server-context.h as a reusable component, replacing linenoise.cpp (1995 lines) with readline.cpp (multiple smaller files totaling approximately 1400 lines), and creating dedicated run-chat.cpp for interactive mode. The run.cpp file was reduced from approximately 1400 lines to 95 lines by delegating functionality to the extracted components.

@loci-dev loci-dev force-pushed the main branch 14 times, most recently from 1854a53 to 1b177fe Compare November 30, 2025 15:08
- Added readline.cpp include
- Created run_chat_mode():
  - Initializes readline with command history
  - Maintains conversation history
  - Applies chat templates to format messages
  - Submits completion tasks to the server queue
  - Displays assistant responses interactively

Signed-off-by: Eric Curtin <eric.curtin@docker.com>
@loci-dev loci-dev force-pushed the upstream-PR17554-branch_ericcurtin-llama-server-chat branch from a449065 to cded4e3 Compare November 30, 2025 15:34
@loci-review
Copy link

loci-review bot commented Nov 30, 2025

Explore the complete analysis inside the Version Insights

Performance Review Summary: PR #349 - New llama-run

Project: llama.cpp
Change Scope: 19 files modified, major architectural refactoring
Binary: build.bin.llama-run


Overview

This PR refactors llama-run from a standalone implementation to a server-infrastructure-based architecture. The changes replace 1,381 lines in run.cpp with 86 lines that delegate to server_context, add run-chat.cpp (178 lines) for interactive functionality, and replace linenoise with readline.cpp library (1,419 lines). The refactoring introduces server-style request handling with JSON serialization and streaming response parsing.


Key Findings

Performance-Critical Areas Impact

main() Function:

  • Response time increased by 268,000,000 ns (from 217,000,000 ns to 485,000,000 ns)
  • Throughput increased by 523 ns (from 284 ns to 808 ns)
  • The response time increase stems from server_context initialization, task queue setup, and thread management overhead

JSON Processing Functions:

  • to_chars: throughput increased by 19 ns (from 27 ns to 46 ns)
  • grisu2: throughput increased by 24 ns (from 39 ns to 63 ns)
  • handle_value: throughput increased by 17 ns (from 33 ns to 50 ns)
  • sub: throughput increased by 13 ns (from 25 ns to 38 ns)

The new architecture uses nlohmann::json for message handling where the old code used direct string manipulation. Each user message and assistant response now requires JSON serialization, and streaming responses require SSE format parsing with JSON deserialization per token.

Hash Table Operations:

  • _M_bucket_index: throughput increased by 29 ns (from 28 ns to 57 ns), response time increased by 37 ns (from 43 ns to 80 ns)

This indicates increased hash table usage in server infrastructure for request routing and parameter lookups.

Inference Performance Impact

Tokenization/Inference Functions:
No direct changes to llama_decode, llama_encode, or llama_tokenize functions were detected in this PR. These functions are now called indirectly through server_context and server_routes layers. The architectural change adds overhead in request preparation and response handling but does not modify the core inference loop. Therefore, tokens per second should remain largely unchanged for the generation phase itself, though the end-to-end latency increases due to initialization overhead.

Power Consumption Analysis

build.bin.llama-run:

  • Power consumption increased by 43.16%
  • The increase is driven by the 268,000,000 ns response time increase in main() and the additional computational overhead from JSON processing functions
  • The server infrastructure maintains persistent task queues and thread pools, contributing to baseline power consumption even during idle periods

Summary

The refactoring achieves code consolidation and maintainability improvements by reusing server infrastructure. The 268,000,000 ns increase in main() response time represents the cost of server initialization and abstraction layers. JSON processing adds 13-24 ns per operation across formatting functions. The core inference functions remain unmodified, preserving generation performance. Power consumption increases by 43.16% due to the cumulative effect of initialization overhead and persistent thread management.

@loci-dev loci-dev force-pushed the main branch 10 times, most recently from a2a0d0e to 8c4a3c3 Compare December 2, 2025 00:36
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 806b364 to ca4155f Compare December 5, 2025 21:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants