UPSTREAM PR #17216: server: split HTTP into its own interface by DajanaV · Pull Request #208 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-14T14:36:41Z

How it works:

sequenceDiagram
    participant User
    participant server_http_context
    participant server_http_res
    
    User->>server_http_context: request
    server_http_context->>server_http_req: create request
    server_http_req->>handler:
    handler->>server_http_res: create response
    
    loop for each result
        server_http_res->>server_http_context: response chunk
        server_http_context->>User: response chunk
        server_http_context->>server_http_res: next()
    end

    server_http_res->>server_http_context: terminate
    server_http_context->>User: close connection

Each endpoint handler returns a server_res_generator, which is a derived class from server_http_res
The server_res_generator indicates one of 2 modes: stream or non-stream
- In non-stream mode, we simply return the data back to user
- In stream mode, we call server_res_generator::next() until it returns false. Each time we call next(), we get a new chunk of data

TODO:

fix error handling
add exception handler at server_routes level

Testing:

passed automated tests.sh
test normal usage with web UI (with multimodal input)
test usage with web UI, with concurrent requests and random interruptions

loci-review · 2025-11-14T15:14:58Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary

Overview

Analysis of version 5a826a3f-1520-4c95-b83c-f7d1e9a34016 compared to baseline db6b13d2-721e-4c28-bff5-4329efb2d26b reveals minimal performance changes. The highest percentage change occurred in the linenoiseBeep function with a 0.17% Response Time improvement (0.13 ns reduction from 76 ns to 75 ns) and 0.21% Throughput improvement.

Key Findings

Performance Metrics:

Highest Response Time Change: linenoiseBeep function (-0.17%, -0.13 ns)
Highest Throughput Change: linenoiseBeep function (-0.21%, -0.12 ns)
Core Function Impact: None of the critical inference functions (llama_decode, llama_encode, llama_tokenize) show performance changes
Tokens per Second Impact: No impact expected as tokenization/inference functions remain unchanged

Power Consumption Analysis:
All binaries show negligible power consumption changes (±0.0%):

build.bin.libllama.so: -0.0% (280,731 nJ)
build.bin.llama-run: +0.0% (282,850 nJ)
All other binaries: 0.0% change

Flame Graph Analysis:
The linenoiseBeep function exhibits a simple linear execution structure with 75 time units total execution cost. The function makes two sequential GLIBC system calls (fputc and fflush) with 81% of execution time spent in function overhead rather than I/O operations.

CFG Comparison:
Control flow graphs show identical structure between versions with no assembly code differences. The 0.13ns improvement stems from build-time or runtime environmental factors rather than code modifications.

GitHub Code Review:
The PR implements a major HTTP server interface refactoring, introducing server_http_context abstraction and generator-pattern streaming responses. The changes enhance code maintainability and separation of concerns without affecting core inference performance.

Conclusion:
The observed performance changes are within normal measurement variance and represent stable build characteristics. The architectural improvements in HTTP handling provide better code organization without impacting inference performance.

ngxson added 17 commits November 12, 2025 18:53

server: split HTTP into its own interface

45b2fe1

move server-http and httplib to its own file

fe98058

add the remaining endpoints

473b0e5

fix exception/error handling

a2e6a00

renaming

66c6fe2

missing header

92a150f

fix missing windows header

d990534

fix error responses from http layer

f428fe5

fix slot save/restore handler

25cc7eb

fix case where only one stream chunk is returned

3be8a3a

add NOMINMAX

9917e04

do not call sink.write on empty data

fc35e91

use safe_json_to_str for SSE

8c7fbec

clean up

da458d6

add some comments

cd10470

Merge branch 'master' into xsn/split_http_server_context

8dbe547

improve usage of next()

1bc41f6

DajanaV temporarily deployed to PROD__AL_DEMO November 14, 2025 14:36 — with GitHub Actions Inactive

DajanaV force-pushed the main branch from ef7ca13 to c65ae84 Compare November 14, 2025 15:09

DajanaV force-pushed the main branch 10 times, most recently from 0f3e62f to a483926 Compare November 15, 2025 21:07

loci-dev force-pushed the main branch 30 times, most recently from 3163acc to 409b78f Compare November 26, 2025 22:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17216: server: split HTTP into its own interface#208

UPSTREAM PR #17216: server: split HTTP into its own interface#208
DajanaV wants to merge 17 commits intomainfrom
upstream-PR17216-branch_ngxson-xsn/split_http_server_context

DajanaV commented Nov 14, 2025

Uh oh!

loci-review bot commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 14, 2025

Uh oh!

loci-review bot commented Nov 14, 2025

Performance Analysis Summary

Overview

Key Findings

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants