Skip to content

UPSTREAM PR #17595: server: move server-context to its own cpp|h#364

Open
loci-dev wants to merge 8 commits intomainfrom
upstream-PR17595-branch_ngxson-xsn/create_server_context
Open

UPSTREAM PR #17595: server: move server-context to its own cpp|h#364
loci-dev wants to merge 8 commits intomainfrom
upstream-PR17595-branch_ngxson-xsn/create_server_context

Conversation

@loci-dev
Copy link

Mirrored from ggml-org/llama.cpp#17595

Extracted part of the changes in ggml-org/llama.cpp#17554 into this dedicated PR, just in case something goes wrong it's easier to trace back.

Compare to the proposed approach in the mentioned PR, which simply move everything to .h, this PR do some extra thing:

  • Moving code via a dedicated commit using git mv, so that auto-merge can be more happy (I hope so, will need to test)
  • Expose only a subset of infrastructure via server-context.h; so for example, server_slot is now a private implementation
  • Simplify the public API of server_context, consolidate everything into 4 main functions: init(), load_model(), start_loop(), terminate()

This should allow easier integration of server inside CLI, while allow downstream to incorporate server as a library (cc @bandoti , probably pre-cursor to llamax)

@loci-review
Copy link

loci-review bot commented Nov 29, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary: PR #364

Overview

This PR performs a pure code refactoring that extracts server context management from server.cpp into dedicated server-context.cpp and server-context.h files. The changes reorganize 3,610 lines of code without modifying any algorithmic logic or performance-critical paths.

Performance Impact

Analysis across all 16 binaries shows zero measurable performance impact:

  • Response Time: No changes detected
  • Throughput Time: No changes detected
  • Power Consumption: 0.0% change across all binaries

The refactoring does not touch inference or tokenization functions (llama_decode, llama_encode, llama_tokenize). No functions within the Performance-Critical Areas (Model Processing, Token Processing, Memory Management, Batch Processing) were modified. The changes are limited to code organization and API surface definition.

Tokens per Second Impact: None. The inference pipeline remains unchanged as no tokenization or decoding functions were modified.

Power Consumption: All binaries maintain identical power consumption profiles, confirming the refactoring produces functionally equivalent machine code.

This is a maintenance-focused change that improves code organization without affecting runtime characteristics.

@loci-dev loci-dev force-pushed the upstream-PR17595-branch_ngxson-xsn/create_server_context branch from 22039aa to 239c7a2 Compare November 29, 2025 18:40
@loci-review
Copy link

loci-review bot commented Nov 29, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #364

Overview

This PR performs a code refactoring that extracts server context implementation into dedicated files (server-context.cpp and server-context.h). The changes reorganize 3,719 lines of code without modifying any algorithmic logic or performance-critical execution paths.

Performance Impact

Zero measurable performance impact detected across all metrics:

  • Response Time: No function-level changes detected
  • Throughput Time: No function-level changes detected
  • Power Consumption: 0.0% change across all 16 binaries
    • libllama.so: +0.50 nJ absolute (within measurement noise)
    • All other binaries: 0.00 nJ delta

Inference Performance: No impact on tokens per second. The refactoring does not modify any tokenization or inference functions (llama_decode, llama_encode, llama_tokenize). All performance-critical paths remain unchanged.

Code Changes

The PR implements architectural improvements through the pimpl idiom, moving implementation details from server.cpp (reduced from 3,671 to 15 lines) into server-context.cpp. The public API is simplified to four methods: init(), load_model(), start_loop(), and terminate(). Internal structures (server_slot, server_metrics) are now encapsulated as private implementation details. This is purely a code organization change with identical compiled output.

@loci-review
Copy link

loci-review bot commented Nov 29, 2025

Explore the complete analysis inside the Version Insights

Performance Analysis Summary - PR #364

Analysis Type: Code Refactoring
Scope: Server infrastructure reorganization
Performance Impact: None

Summary

This PR implements a pure code refactoring that extracts 3,619 lines from server.cpp into dedicated server-context.cpp and server-context.h files using the Pimpl design pattern. The change reorganizes server context management without modifying any algorithmic logic or data structures.

Performance measurements show zero impact across all metrics. No functions within the Performance-Critical Areas (Model Processing, Token Processing, Memory Management, Batch Processing) were modified. The refactoring affects only server infrastructure code responsible for HTTP request handling and task queue management, which operates outside the inference pipeline.

Power consumption analysis confirms negligible variance (< 0.001%) across all binaries, with the maximum observed change being 1.09 nJ in libllama.so, well within measurement noise.

Inference Impact: None. Token processing functions (llama_decode, llama_encode, llama_tokenize) remain unchanged. No impact on tokens per second throughput.

@loci-dev loci-dev force-pushed the main branch 14 times, most recently from 82b1c0b to 8c7587c Compare December 1, 2025 21:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from df48f9e to cb46586 Compare December 6, 2025 12:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants