UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting#114
UPSTREAM PR #17072: [RFC] ggml: new backend for API Remoting#114
Conversation
|
Access the complete analysis in the LOCI Dashboard Performance Analysis Summary: API Remoting Backend IntegrationOverviewPull Request #114 introduces a comprehensive API Remoting backend system for VM-based GPU acceleration, adding 7,423 lines across 55 files. The changes implement virtualization-aware transport layers enabling native GPU performance in containerized environments. Key FindingsPerformance Improvements:
Core Function Impact: Tokens Per Second Impact: Power Consumption Analysis:
Flame Graph & CFG Analysis: Code Review Insights: RecommendationsBuffer Management: Implement comprehensive bounds checking in tensor deserialization to prevent memory corruption across VM boundaries. Error Handling: Add robust error propagation mechanisms for cross-VM communication failures. The changes successfully achieve VM GPU acceleration goals while delivering beneficial side effects through build optimizations, with no negative impact on core inference performance. |
6aa5dc2 to
81cedf2
Compare
35c840d to
0f3e62f
Compare
Mirrored from ggml-org/llama.cpp#17072
Hello, I would like to discuss if this work could be integrated in the
llama.cppcodebase.The API Remoting backend/frontend allow escaping the VM isolation, with the help of the
virt-gpuparavirtualization (and thevirglrendererlibrary on the host side).ggml-remotingfrontendis a GGML API implementation, which intercepts the GGML API calls and forwards them to thevirt-gpuvirtual deviceggml-remotingbackendis library loaded byvirglrenderer(PR will be opened soon for discussion), which opens a GGML library and forwards the call received fromvirglrenderer.The code is currently a POC, I will refine it after the first round of feedback.
ggml-RPC. The overall idea is the same, but the transport layer is virtualization aware, which helps limiting the buffer copies.supports_opmethod is implemented in a hacky way: I've copied theggml-metaldefinition to the frontend library, and I expose the few properties required to compute it from theggml-metalbackend. IIRC, this was only needed for the micro-benchmark to work correctly (theggml-rpcsimply returnstrueto avoid this bottleneck).Here is the context behind this PR: