Conversation
|
Hello, in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported. Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here, A few other questions:
Thank you for your work! |
|
Hi @SearchSavior ,
Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)
The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.
The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.
We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details. |
|
Hey @ravi9 , Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem. |
|
I can't wait for openVINO support to get upstreamed |
e180b86 to
80f0969
Compare
76ab76e to
2e1dd8d
Compare
e727c65 to
66e503b
Compare
|
Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header |
.github/workflows/build.yml
Outdated
| sudo mkdir -p /opt/intel | ||
| wget -O openvino_${OPENVINO_VERSION_MAJOR}.tgz https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz | ||
| tar -xf openvino_${OPENVINO_VERSION_MAJOR}.tgz | ||
| sudo mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} | ||
| rm openvino_${OPENVINO_VERSION_MAJOR}.tgz | ||
| cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} | ||
| echo "Y" | sudo -E ./install_dependencies/install_openvino_dependencies.sh && cd - | ||
| sudo ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino | ||
|
|
||
| - name: Build | ||
| id: cmake_build | ||
| run: | | ||
| source /opt/intel/openvino/setupvars.sh |
There was a problem hiding this comment.
Please cache this similarly to vulkan and spacemit SDKs:
llama.cpp/.github/workflows/build.yml
Lines 449 to 466 in 8415f61
llama.cpp/.github/workflows/build-cache.yml
Lines 26 to 38 in 8415f61
llama.cpp/.github/actions/linux-setup-vulkan/action.yml
Lines 14 to 20 in 8415f61
(add
type: z for gzip)
@slaren We have a fix to support Ubuntu25.04, will update soon. |
|
@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue. |
ade4a2d to
f89292d
Compare
Thanks. I was able to build it now, but I get different exceptions when trying to run it. Details |
|
@slaren Thanks for testing.
|
Co-authored-by: Yamini Nimmagadda <yamini.nimmagadda@intel.com>
…d node retrieval inside guarded block to prevent missing-key access
…ding Fix VIEW op, which slice the input node
Fix missing issue key handling
Fix for stateful execution bug in llama-bench
|
@CISC Regarding #20446 (comment), are you able to approve this workflow now? |
Ah, no, it's gone on new PRs now. :( |
ggerganov
left a comment
There was a problem hiding this comment.
@ravi9 I've disabled the test-llama-archs test as it was failing. Recommend to try to get it fixed as a priority because this will increase confidence that the OpenVINO backend is general enough to support all LLMs.
Waiting for CI to pass and merging.
|
Hello Intel Arc A770 Command |
|
Hi @savvadesogle , I just tested it on an Intel Core Ultra Series 2 with 32GB RAM and it runs fine. Thank you for testing this PR. We will continue to improve this backend with more validation and on more hardware and extensive testing. |
|
I tried it on my B580 inside a container, but llama.cpp didnt detect my gpu: |
|
Hi @WizardlyBump17 , If you plan to test it now, use this openvino.Dockerfile You could also verify if the GPU is detected by running |
|
Hi @ravi9 |






Overview
This PR introduces an OpenVINO backend for
llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.Key Features:
New backend implementation
ggml/src/ggml-openvino.Supported precisions
Supported devices
For NPU: currently prompt processing is slow, a smaller context size is recommended for better performance, e.g.,
-c 512.For llama-bench:
-fa 1is required.Tested Models
The following models are validated for functionality.
Accuracy and performance are WIP.
Llama-3.2-1B-Instruct-GGUFLlama-3.1-8B-Instructmicrosoft/Phi-3-mini-4k-instruct-ggufQwen/Qwen2.5-1.5B-Instruct-GGUFQwen/Qwen3-8Bopenbmb/MiniCPM-1B-sft-bf16tencent/Hunyuan-7B-Instructmistralai/Mistral-7B-Instruct-v0.3Work in Progress
Notes on quantization support
CPU
GPU
NPU
Other notes:
NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).