Skip to content

UPSTREAM PR #17216: server: split HTTP into its own interface#208

Open
DajanaV wants to merge 17 commits intomainfrom
upstream-PR17216-branch_ngxson-xsn/split_http_server_context
Open

UPSTREAM PR #17216: server: split HTTP into its own interface#208
DajanaV wants to merge 17 commits intomainfrom
upstream-PR17216-branch_ngxson-xsn/split_http_server_context

Conversation

@DajanaV
Copy link
Copy Markdown
Collaborator

@DajanaV DajanaV commented Nov 14, 2025

Mirrored from ggml-org/llama.cpp#17216

Fix ggml-org/llama.cpp#16488

How it works:

sequenceDiagram
    participant User
    participant server_http_context
    participant server_http_res
    
    User->>server_http_context: request
    server_http_context->>server_http_req: create request
    server_http_req->>handler:
    handler->>server_http_res: create response
    
    loop for each result
        server_http_res->>server_http_context: response chunk
        server_http_context->>User: response chunk
        server_http_context->>server_http_res: next()
    end

    server_http_res->>server_http_context: terminate
    server_http_context->>User: close connection
Loading
  • Each endpoint handler returns a server_res_generator, which is a derived class from server_http_res
  • The server_res_generator indicates one of 2 modes: stream or non-stream
    • In non-stream mode, we simply return the data back to user
    • In stream mode, we call server_res_generator::next() until it returns false. Each time we call next(), we get a new chunk of data

TODO:

  • fix error handling
  • add exception handler at server_routes level

Testing:

  • passed automated tests.sh
  • test normal usage with web UI (with multimodal input)
  • test usage with web UI, with concurrent requests and random interruptions

@loci-review
Copy link
Copy Markdown

loci-review bot commented Nov 14, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 5a826a3f-1520-4c95-b83c-f7d1e9a34016 compared to baseline db6b13d2-721e-4c28-bff5-4329efb2d26b reveals minimal performance changes. The highest percentage change occurred in the linenoiseBeep function with a 0.17% Response Time improvement (0.13 ns reduction from 76 ns to 75 ns) and 0.21% Throughput improvement.

Key Findings

Performance Metrics:

  • Highest Response Time Change: linenoiseBeep function (-0.17%, -0.13 ns)
  • Highest Throughput Change: linenoiseBeep function (-0.21%, -0.12 ns)
  • Core Function Impact: None of the critical inference functions (llama_decode, llama_encode, llama_tokenize) show performance changes
  • Tokens per Second Impact: No impact expected as tokenization/inference functions remain unchanged

Power Consumption Analysis:
All binaries show negligible power consumption changes (±0.0%):

  • build.bin.libllama.so: -0.0% (280,731 nJ)
  • build.bin.llama-run: +0.0% (282,850 nJ)
  • All other binaries: 0.0% change

Flame Graph Analysis:
The linenoiseBeep function exhibits a simple linear execution structure with 75 time units total execution cost. The function makes two sequential GLIBC system calls (fputc and fflush) with 81% of execution time spent in function overhead rather than I/O operations.

CFG Comparison:
Control flow graphs show identical structure between versions with no assembly code differences. The 0.13ns improvement stems from build-time or runtime environmental factors rather than code modifications.

GitHub Code Review:
The PR implements a major HTTP server interface refactoring, introducing server_http_context abstraction and generator-pattern streaming responses. The changes enhance code maintainability and separation of concerns without affecting core inference performance.

Conclusion:
The observed performance changes are within normal measurement variance and represent stable build characteristics. The architectural improvements in HTTP handling provide better code organization without impacting inference performance.

@DajanaV DajanaV force-pushed the main branch 10 times, most recently from 0f3e62f to a483926 Compare November 15, 2025 21:07
@loci-dev loci-dev force-pushed the main branch 30 times, most recently from 3163acc to 409b78f Compare November 26, 2025 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants