UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting by DajanaV · Pull Request #114 · auroralabs-loci/llama.cpp

DajanaV · 2025-11-07T11:34:13Z

Hello, I would like to discuss if this work could be integrated in the llama.cpp codebase.

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

The code is currently a POC, I will refine it after the first round of feedback.

Some serialization functions have been borrowed from ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.
the supports_op method is implemented in a hacky way: I've copied the ggml-metal definition to the frontend library, and I expose the few properties required to compute it from the ggml-metal backend. IIRC, this was only needed for the micro-benchmark to work correctly (the ggml-rpc simply returns true to avoid this bottleneck).

Here is the context behind this PR:

How we improved AI inference on macOS Podman containers --> the performance of ggml-Vulkan on Mac is 75-80% of ggml-metal
Reach native speed with MacOS llama.cpp container inference --> with API Remoting, the llama.cpp in a VM container runs at nearly 100% of ggml-metal

…timize

loci-review · 2025-11-07T12:09:38Z

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: API Remoting Backend Integration

Overview

Pull Request #114 introduces a comprehensive API Remoting backend system for VM-based GPU acceleration, adding 7,423 lines across 55 files. The changes implement virtualization-aware transport layers enabling native GPU performance in containerized environments.

Key Findings

Performance Improvements:

Highest Response Time Change: std::basic_string::operator[] improved by 12.6% (151 ns → 132 ns)
Highest Throughput Change: std::basic_string::back() improved by 17.6% (108 ns → 89 ns)
Impact Assessment: These are C++ standard library string operations, not core LLaMA inference functions

Core Function Impact:
No direct changes to critical inference functions (llama_decode, llama_encode, llama_tokenize). The performance improvements are indirect effects from build system optimizations and binary layout reorganization.

Tokens Per Second Impact:
Zero impact on inference throughput. The optimized functions are utility string operations used in text processing pipelines, not the primary inference path. Core tokenization and model processing functions remain unchanged.

Power Consumption Analysis:
Minimal power impact across binaries:

build.bin.libggml.so: 0.081% reduction (5636 nJ → 5632 nJ)
All other binaries show negligible changes
Total system power consumption effectively unchanged

Flame Graph & CFG Analysis:
The improved string operations show optimized memory addressing patterns with better string literal placement. CFG comparison reveals identical control flow structures with enhanced assembly-level optimizations in assertion handling paths (44.9% improvement in debug infrastructure).

Code Review Insights:
The remoting implementation introduces sophisticated buffer management and serialization mechanisms. The architecture maintains clean separation between frontend/backend with minimal impact on existing GGML code paths.

Recommendations

Buffer Management: Implement comprehensive bounds checking in tensor deserialization to prevent memory corruption across VM boundaries.

Error Handling: Add robust error propagation mechanisms for cross-VM communication failures.

The changes successfully achieve VM GPU acceleration goals while delivering beneficial side effects through build optimizations, with no negative impact on core inference performance.

kpouget added 2 commits November 7, 2025 11:06

ggml: add the ggml-remotingfrontend and ggml-remotingbackend libraries

9b592d5

ggml: src: ggml-remotingfrontend/ggml-backend: add stub for .graph_op…

f28602d

…timize

DajanaV temporarily deployed to PROD__AL_DEMO November 7, 2025 11:34 — with GitHub Actions Inactive

DajanaV force-pushed the main branch 26 times, most recently from 6aa5dc2 to 81cedf2 Compare November 10, 2025 16:10

DajanaV force-pushed the main branch 30 times, most recently from 35c840d to 0f3e62f Compare November 15, 2025 20:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting#114

UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting#114
DajanaV wants to merge 2 commits intomainfrom
upstream-PR17072-branch_kpouget-up/remoting

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

DajanaV commented Nov 7, 2025

Uh oh!

loci-review bot commented Nov 7, 2025

Performance Analysis Summary: API Remoting Backend Integration

Overview

Key Findings

Recommendations

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants