Skip to content

UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting#114

Open
DajanaV wants to merge 2 commits intomainfrom
upstream-PR17072-branch_kpouget-up/remoting
Open

UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting#114
DajanaV wants to merge 2 commits intomainfrom
upstream-PR17072-branch_kpouget-up/remoting

Conversation

@DajanaV
Copy link
Collaborator

@DajanaV DajanaV commented Nov 7, 2025

Mirrored from ggml-org/llama.cpp#17072

Hello, I would like to discuss if this work could be integrated in the llama.cpp codebase.

The API Remoting backend/frontend allow escaping the VM isolation, with the help of the virt-gpu paravirtualization (and the virglrenderer library on the host side).

  • ggml-remotingfrontend is a GGML API implementation, which intercepts the GGML API calls and forwards them to the virt-gpu virtual device
  • ggml-remotingbackend is library loaded by virglrenderer (PR will be opened soon for discussion), which opens a GGML library and forwards the call received from virglrenderer.

The code is currently a POC, I will refine it after the first round of feedback.

  • Some serialization functions have been borrowed from ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.
  • the supports_op method is implemented in a hacky way: I've copied the ggml-metal definition to the frontend library, and I expose the few properties required to compute it from the ggml-metal backend. IIRC, this was only needed for the micro-benchmark to work correctly (the ggml-rpc simply returns true to avoid this bottleneck).

Here is the context behind this PR:

image

@loci-review
Copy link

loci-review bot commented Nov 7, 2025

Access the complete analysis in the LOCI Dashboard

Performance Analysis Summary: API Remoting Backend Integration

Overview

Pull Request #114 introduces a comprehensive API Remoting backend system for VM-based GPU acceleration, adding 7,423 lines across 55 files. The changes implement virtualization-aware transport layers enabling native GPU performance in containerized environments.

Key Findings

Performance Improvements:

  • Highest Response Time Change: std::basic_string::operator[] improved by 12.6% (151 ns → 132 ns)
  • Highest Throughput Change: std::basic_string::back() improved by 17.6% (108 ns → 89 ns)
  • Impact Assessment: These are C++ standard library string operations, not core LLaMA inference functions

Core Function Impact:
No direct changes to critical inference functions (llama_decode, llama_encode, llama_tokenize). The performance improvements are indirect effects from build system optimizations and binary layout reorganization.

Tokens Per Second Impact:
Zero impact on inference throughput. The optimized functions are utility string operations used in text processing pipelines, not the primary inference path. Core tokenization and model processing functions remain unchanged.

Power Consumption Analysis:
Minimal power impact across binaries:

  • build.bin.libggml.so: 0.081% reduction (5636 nJ → 5632 nJ)
  • All other binaries show negligible changes
  • Total system power consumption effectively unchanged

Flame Graph & CFG Analysis:
The improved string operations show optimized memory addressing patterns with better string literal placement. CFG comparison reveals identical control flow structures with enhanced assembly-level optimizations in assertion handling paths (44.9% improvement in debug infrastructure).

Code Review Insights:
The remoting implementation introduces sophisticated buffer management and serialization mechanisms. The architecture maintains clean separation between frontend/backend with minimal impact on existing GGML code paths.

Recommendations

Buffer Management: Implement comprehensive bounds checking in tensor deserialization to prevent memory corruption across VM boundaries.

Error Handling: Add robust error propagation mechanisms for cross-VM communication failures.

The changes successfully achieve VM GPU acceleration goals while delivering beneficial side effects through build optimizations, with no negative impact on core inference performance.

@DajanaV DajanaV force-pushed the main branch 26 times, most recently from 6aa5dc2 to 81cedf2 Compare November 10, 2025 16:10
@DajanaV DajanaV force-pushed the main branch 30 times, most recently from 35c840d to 0f3e62f Compare November 15, 2025 20:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants