Skip to content

Add OpenVINO backend#15307

Merged
ggerganov merged 320 commits intoggml-org:masterfrom
ravi9:dev_backend_openvino
Mar 14, 2026
Merged

Add OpenVINO backend#15307
ggerganov merged 320 commits intoggml-org:masterfrom
ravi9:dev_backend_openvino

Conversation

@wine99
Copy link
Contributor

@wine99 wine99 commented Aug 14, 2025

Overview

This PR introduces an OpenVINO backend for llama.cpp, enabling hardware-accelerated inference on Intel® CPUs, GPUs, and NPUs. The backend leverages OpenVINO to deliver optimized inference with the existing llama.cpp GGUF model ecosystem. Enables performance improvements via OpenVINO’s graph compilation and kernel fusion.

Key Features:

  • New backend implementation

    • Added OpenVINO backend in ggml/src/ggml-openvino.
    • Implemented translations for core GGML operations
  • Supported precisions

    • FP16/BF16 GGUF models supported.
    • Q4_0, Q4_1, Q4_K_M, Q6_K models partially supported. (See notes below)
  • Supported devices

    • Intel CPUs
    • Intel integrated and discrete GPUs
    • Intel NPUs (requires UD32+ driver).

For NPU: currently prompt processing is slow, a smaller context size is recommended for better performance, e.g., -c 512.

For llama-bench: -fa 1 is required.

Tested Models

The following models are validated for functionality.

Accuracy and performance are WIP.

Work in Progress

  • Performance and memory optimizations
  • Broader quantization coverage.
  • Support for additional model architectures.
  • Extensive accuracy testing.

Notes on quantization support

CPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are converted to int8 gs16 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

GPU

  • Q4_0, Q4_1, Q4_K_M and Q6_K models are supported.
  • Q6_K tensors (6bit gs16 sym) are requantized to int8 gs32 sym.
  • Q5_K tensors (5bit gs32 asym) are converted to int8 gs32 asym.

NPU

  • Main quantization scheme for the supported models in this PR is Q4_0.
  • Q4_0 and Q4_1 tensors are requantized to int4 gs128 sym.
  • Q6_K tensors are dequantized to fp16.

Other notes:

  • Both Q4_0 and Q4_1 models use Q6_K for the token_embedding tensor and the weight tensor in the last matmul (in most models it is the same tensor as token_emb).
  • Q4_0 models will produce some Q4_1 tensors if imatrix is provided as part of the quantization of the model using llama-quantize utility.
  • Q4_K_M models additionally have Q6_K tensors and Q5_K tensors (only in Phi3 in the validated model list of this PR).

NOTE: Optimum-intel converts the fp16/bf16 token embedding tensor and the weight tensor in the last matmul to int8 asym channel-wise (config code).

@wine99 wine99 marked this pull request as draft August 14, 2025 09:09
@github-actions github-actions bot added documentation Improvements or additions to documentation testing Everything test related devops improvements to build systems and github actions ggml changes relating to the ggml tensor library for machine learning labels Aug 14, 2025
@SearchSavior
Copy link

SearchSavior commented Aug 19, 2025

Hello,

in this repo https://github.com/yangsu2022/GGUF-to-OpenVINO and the article https://blog.openvino.ai/blog-posts/openvino-genai-supports-gguf-models only a small set of models are supported.

Will this feature in llama.cpp offer wider gguf coverage via something like the parameter mapping described here,

https://github.com/yangsu2022/GGUF-to-OpenVINO/blob/405a95e300f8307fb4b779a12d46cf86adf5a441/convert_llama3.1_gguf_to_torch.py#L14

A few other questions:

  • What parts of OpenVINO feature set are intended to be brought into llama.cpp?

  • Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

  • Pipeline parallel is coming in next release (I think), will that be implemented here for heterogeneous execution in llama.cpp?

Thank you for your work!

@ravi9
Copy link
Contributor

ravi9 commented Aug 21, 2025

Hi @SearchSavior ,

Q: Will this feature in llama.cpp offer wider GGUF coverage via something like parameter mapping?

Instead of converting GGUF models to PyTorch format with parameter mapping, this implementation uses OpenVINO's GGML frontend to directly translate GGML computation graphs to OpenVINO operations at runtime. The translation happens through a comprehensive operation mapping system that covers the core GGML operations. Since it works at the GGML operation level, it should support any model architecture that llama.cpp supports (assuming we map/translate all the GGML operators to OpenVINO.)

Q: What parts of the OpenVINO feature set are intended to be brought into llama.cpp?

The immediate focus is on runtime acceleration: kernel fusion, optimized graph execution,memory optimizations and hardware scheduling on CPU, GPU, and NPU.

Q: Is this PR trying to bring in only performance from openvino runtime to support llama.cpp usecase?

The scope of this PR is primarily performance enablement using OpenVINO runtime to accelerate llama.cpp inference while preserving compatibility with the GGUF ecosystem. It’s not introducing a new model conversion flow, so everything remains driven by GGUF models in llama.cpp.

Q: Will pipeline parallel / heterogeneous execution be supported here?

We are currently reviewing this. llama.cpp already has infrastructure for pipeline parallelism, and the OpenVINO backend exposes async operations and events, so it should be possible. Further evaluation is needed to confirm integration details.

@SearchSavior
Copy link

Hey @ravi9 ,

Thanks for the detailed answer. It's nice to see more serious work bringing OpenVINO to the rest of the ecosystem.

@Bionic-Squash
Copy link

I can't wait for openVINO support to get upstreamed

@wine99 wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from e180b86 to 80f0969 Compare September 5, 2025 08:36
@wine99 wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from 76ab76e to 2e1dd8d Compare September 28, 2025 03:25
@wine99 wine99 force-pushed the dev_backend_openvino branch from e727c65 to 66e503b Compare October 11, 2025 05:45
@slaren
Copy link
Member

slaren commented Oct 11, 2025

Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header tbb/blocked_range.h. The file exists, but the include directory does not seem to be setup correctly.

@wine99 wine99 marked this pull request as ready for review October 14, 2025 00:03
@wine99 wine99 requested review from CISC and slaren as code owners October 14, 2025 00:03
Comment on lines +691 to +703
sudo mkdir -p /opt/intel
wget -O openvino_${OPENVINO_VERSION_MAJOR}.tgz https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz
tar -xf openvino_${OPENVINO_VERSION_MAJOR}.tgz
sudo mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR}
rm openvino_${OPENVINO_VERSION_MAJOR}.tgz
cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR}
echo "Y" | sudo -E ./install_dependencies/install_openvino_dependencies.sh && cd -
sudo ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino

- name: Build
id: cmake_build
run: |
source /opt/intel/openvino/setupvars.sh
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please cache this similarly to vulkan and spacemit SDKs:

- name: Use Vulkan SDK Cache
uses: actions/cache@v4
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}
- name: Build
id: cmake_build
run: |
source ./vulkan_sdk/setup-env.sh

- name: Setup Cache
uses: actions/cache@v4
id: cache-sdk
with:
path: ./vulkan_sdk
key: vulkan-sdk-${{ env.VULKAN_SDK_VERSION }}-${{ runner.os }}
- name: Setup Vulkan SDK
if: steps.cache-sdk.outputs.cache-hit != 'true'
uses: ./.github/actions/linux-setup-vulkan
with:
path: ./vulkan_sdk
version: ${{ env.VULKAN_SDK_VERSION }}

- name: Setup Vulkan SDK
id: setup
uses: ./.github/actions/unarchive-tar
with:
url: https://sdk.lunarg.com/sdk/download/${{ inputs.version }}/linux/vulkan_sdk.tar.xz
path: ${{ inputs.path }}
strip: 1

(add type: z for gzip)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@CISC Thanks for the suggestion. I have made the changes. @ravi9 Please also review this

@ggerganov
Copy link
Member

@wine99 Could you address the comment by @slaren earlier?

@ravi9
Copy link
Contributor

ravi9 commented Oct 14, 2025

Is there any way to build this on Ubuntu 25.04? OpenVINO doesn't seem to support this version of Ubuntu. I tried anyway, installing the dependencies manually, but the build fails due to a missing header tbb/blocked_range.h. The file exists, but the include directory does not seem to be setup correctly.

@slaren We have a fix to support Ubuntu25.04, will update soon.

@ravi9
Copy link
Contributor

ravi9 commented Oct 15, 2025

@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue.
Also, created a patch to install OV-2025.3 on Ubuntu25.04.

# Script to Install OpenVINO from archive
wget https://raw.githubusercontent.com/ravi9/misc-scripts/main/openvino/ov-archive-install/install-openvino-from-archive.sh
chmod +x install-openvino-from-archive.sh
./install-openvino-from-archive.sh

@wine99 wine99 force-pushed the dev_backend_openvino branch 2 times, most recently from ade4a2d to f89292d Compare October 15, 2025 08:06
@slaren
Copy link
Member

slaren commented Oct 15, 2025

@slaren : Could you try again. Fixed CMakeLists.txt to resolve TBB issue.
Also, created a patch to install OV-2025.3 on Ubuntu25.04.

Thanks. I was able to build it now, but I get different exceptions when trying to run it.

Details
$ llama-bench -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf
| model                          |       size |     params | backend    | ngl |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | --------------: | -------------------: |
Function(s) ^std::(move|forward|as_const|(__)?addressof) will be skipped when stepping.
Function(s) ^std::(shared|unique)_ptr<.*>::(get|operator) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|(forward_)?list|(unordered_|flat_)?(multi)?(map|set)|span)<.*>::(c?r?(begin|end)|front|back|data|size|empty) will be skipped when stepping.
Function(s) ^std::(basic_string|vector|array|deque|span)<.*>::operator.] will be skipped when stepping.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007fa0509c1b63 in __internal_syscall_cancel (a1=2057930, a2=0, a3=0, a4=0, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=a1@entry=2057930, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007fa050a3de9f in __GI___wait4 (pid=pid@entry=2057930, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007fa050a3deeb in __GI___waitpid (pid=pid@entry=2057930, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38     ./posix/waitpid.c: No such file or directory
#5  0x00007fa050f1cf23 in ggml_print_backtrace () at /home/diego/code/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#6  0x00007fa050f2b3af in ggml_uncaught_exception () at /home/diego/code/llama.cpp/ggml/src/ggml.cpp:9
9           ggml_print_backtrace();
#7  0x00007fa050d290aa in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007fa050d12a9e in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007fa050d29361 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007fa04f9d7d38 in ov::frontend::NotImplementedFailure::create(char const*, int, char const*, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#11 0x00007fa050797e99 in ov::frontend::ggml::op::translate_permute (context=...) at /usr/include/c++/14/bits/basic_string.tcc:242
242               ~_Guard() { if (_M_guarded) _M_guarded->_M_dispose(); }
#12 0x00007fa0507afa52 in std::__invoke_impl<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >, std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*&)(ov::frontend::ggml::NodeContext const&), ov::frontend::ggml::NodeContext const&> (__f=<optimized out>) at /usr/include/c++/14/bits/invoke.h:60
60          __invoke_impl(__invoke_other, _Fn&& __f, _Args&&... __args)
#13 std::__invoke_r<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >, std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*&)(ov::frontend::ggml::NodeContext const&), ov::frontend::ggml::NodeContext const&> (__fn=<optimized out>) at /usr/include/c++/14/bits/invoke.h:116
116                                               std::forward<_Args>(__args)...);
#14 std::_Function_handler<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >(ov::frontend::ggml::NodeContext const&), std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > > (*)(ov::frontend::ggml::NodeContext const&)>::_M_invoke (__functor=..., __args#0=...) at /usr/include/c++/14/bits/std_function.h:291
291                                          std::forward<_ArgTypes>(__args)...);
#15 0x00007fa0507c3bbb in std::function<std::vector<ov::Output<ov::Node>, std::allocator<ov::Output<ov::Node> > >(ov::frontend::ggml::NodeContext const&)>::operator() (this=0x560333e2cb18, __args#0=...) at /usr/include/c++/14/bits/std_function.h:591
591             return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#16 operator() (__closure=0x7ffc0bc79240, node=std::shared_ptr<ov::frontend::ggml::GgmlDecoder> (use count 3, weak count 0) = {...}) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:207
207             converted_outputs = it->second(node_context);
#17 0x00007fa0507c43ed in std::__invoke_impl<void, ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>&, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> > (__f=...) at /usr/include/c++/14/bits/shared_ptr_base.h:1095
1095          _M_swap(__shared_count& __r) noexcept
#18 std::__invoke_r<void, ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>&, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> > (__fn=...) at /usr/include/c++/14/bits/invoke.h:111
111             std::__invoke_impl<__type>(__tag{}, std::forward<_Callable>(__fn),
#19 std::_Function_handler<void(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>), ov::frontend::ggml::TranslateSession::translate_graph(const ov::frontend::InputModel::Ptr&)::<lambda(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)> >::_M_invoke(const std::_Any_data &, std::shared_ptr<ov::frontend::ggml::GgmlDecoder> &&) (__functor=..., __args#0=...) at /usr/include/c++/14/bits/std_function.h:290
290             return std::__invoke_r<_Res>(*_Base::_M_get_pointer(__functor),
#20 0x00007fa05076935c in std::function<void(std::shared_ptr<ov::frontend::ggml::GgmlDecoder>)>::operator() (this=0x7ffc0bc79240, __args#0=std::shared_ptr<ov::frontend::ggml::GgmlDecoder> (empty) = {...}) at /usr/include/c++/14/bits/std_function.h:591
591             return _M_invoker(_M_functor, std::forward<_ArgTypes>(__args)...);
#21 GgmlOvDecoder::visit_subgraph (this=0x5603373a9440, node_visitor=...) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-decoder.cpp:761
761             node_visitor(decoder);
#22 0x00007fa0507c7143 in ov::frontend::ggml::TranslateSession::translate_graph (this=this@entry=0x7ffc0bc79450, input_model=std::shared_ptr<ov::frontend::InputModel> (use count 5, weak count 0) = {...}) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:230
230         ggml_model_decoder->visit_subgraph(node_visitor);
#23 0x00007fa0507c8d46 in ov::frontend::ggml::TranslateSession::get_converted_model (this=this@entry=0x7ffc0bc79450) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/translate_session.cpp:167
167         m_ov_model = translate_graph(m_input_model);
#24 0x00007fa05078127b in ov::frontend::ggml::FrontEnd::convert (model=std::shared_ptr<ov::frontend::InputModel> (use count 5, weak count 0) = {...}, naive=naive@entry=false) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/openvino/frontend.cpp:20
20              converted_model = translate_session.get_converted_model();
#25 0x00007fa0507db371 in openvino_frontend_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/utils.cpp:173
173                     model = ov::frontend::ggml::FrontEnd::convert(input_model);
#26 0x00007fa05076d75d in ggml_backend_openvino_graph_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-openvino.cpp:54
54          openvino_frontend_compute(backend, cgraph);
#27 0x00007fa050f32e60 in ggml_backend_sched_compute_splits (sched=0x560333e04ec0) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1553
1553                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#28 ggml_backend_sched_graph_compute_async (sched=0x560333e04ec0, graph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1753
1753        return ggml_backend_sched_compute_splits(sched);
#29 0x00007fa05102bf31 in llama_context::graph_compute (this=this@entry=0x5603373ab970, gf=0x5603342642f0, batched=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:193
193           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#30 0x00007fa05102cf8a in llama_context::process_ubatch (this=this@entry=0x5603373ab970, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0x560333e03000, ret=@0x7ffc0bc7db58: 1360825900) at /home/diego/code/llama.cpp/src/llama-context.cpp:784
784         const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#31 0x00007fa0510303bf in llama_context::decode (this=0x5603373ab970, batch_inp=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:1088
1088            const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#32 0x00007fa05103126f in llama_decode (ctx=<optimized out>, batch=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:2747
2747        const int ret = ctx->decode(batch);
#33 0x0000560326cc51c1 in test_prompt (ctx=ctx@entry=0x5603373ab970, n_prompt=512, n_batch=2048, n_threads=<optimized out>) at /home/diego/code/llama.cpp/tools/llama-bench/llama-bench.cpp:1939
1939            int res = llama_decode(ctx, llama_batch_get_one(tokens.data(), n_tokens));
#34 0x0000560326cc0131 in main (argc=<optimized out>, argv=<optimized out>) at /home/diego/code/llama.cpp/tools/llama-bench/llama-bench.cpp:2115
2115                    bool res = test_prompt(ctx, t.n_prompt, t.n_batch, t.n_threads);
[Inferior 1 (process 2057893) detached]
terminate called after throwing an instance of 'ov::frontend::NotImplementedFailure'
  what():  Check '(op_case == 1 || op_case == 2 || op_case == 3)' failed at openvino/op/permute.cpp:25:
FrontEnd API failed with NotImplementedFailure:
"Unsupported PERMUTE case" is not implemented for this FrontEnd class


$ llama-cli -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf
[...]
__syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
warning: 56     ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S: No such file or directory
#0  __syscall_cancel_arch () at ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S:56
56      in ../sysdeps/unix/sysv/linux/x86_64/syscall_cancel.S
#1  0x00007ff5bae77b63 in __internal_syscall_cancel (a1=2058374, a2=0, a3=0, a4=0, a5=0, a6=0, nr=61) at ./nptl/cancellation.c:49
warning: 49     ./nptl/cancellation.c: No such file or directory
#2  __syscall_cancel (a1=a1@entry=2058374, a2=a2@entry=0, a3=a3@entry=0, a4=a4@entry=0, a5=a5@entry=0, a6=a6@entry=0, nr=61) at ./nptl/cancellation.c:75
75      in ./nptl/cancellation.c
#3  0x00007ff5baef3e9f in __GI___wait4 (pid=pid@entry=2058374, stat_loc=stat_loc@entry=0x0, options=options@entry=0, usage=usage@entry=0x0) at ../sysdeps/unix/sysv/linux/wait4.c:30
warning: 30     ../sysdeps/unix/sysv/linux/wait4.c: No such file or directory
#4  0x00007ff5baef3eeb in __GI___waitpid (pid=pid@entry=2058374, stat_loc=stat_loc@entry=0x0, options=options@entry=0) at ./posix/waitpid.c:38
warning: 38     ./posix/waitpid.c: No such file or directory
#5  0x00007ff5bb4cff23 in ggml_print_backtrace () at /home/diego/code/llama.cpp/ggml/src/ggml.c:196
196             waitpid(child_pid, NULL, 0);
#6  0x00007ff5bb4de3af in ggml_uncaught_exception () at /home/diego/code/llama.cpp/ggml/src/ggml.cpp:9
9           ggml_print_backtrace();
#7  0x00007ff5bb1df0aa in ?? () from /lib/x86_64-linux-gnu/libstdc++.so.6
#8  0x00007ff5bb1c8a9e in std::terminate() () from /lib/x86_64-linux-gnu/libstdc++.so.6
#9  0x00007ff5bb1df361 in __cxa_throw () from /lib/x86_64-linux-gnu/libstdc++.so.6
#10 0x00007ff5b936f710 in ov::Exception::create(char const*, int, std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const&) () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#11 0x00007ff5b93dee79 in ?? () from /opt/intel/openvino_2025.3/runtime/lib/intel64/libopenvino.so.2530
#12 0x00007ff5bac8fde7 in openvino_frontend_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/utils.cpp:223
223         infer_request.infer();
#13 0x00007ff5bac2375d in ggml_backend_openvino_graph_compute (backend=<optimized out>, cgraph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-openvino/ggml-openvino.cpp:54
54          openvino_frontend_compute(backend, cgraph);
#14 0x00007ff5bb4e5e60 in ggml_backend_sched_compute_splits (sched=0x56052ad349e0) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1553
1553                enum ggml_status ec = ggml_backend_graph_compute_async(split_backend, &split->graph);
#15 ggml_backend_sched_graph_compute_async (sched=0x56052ad349e0, graph=<optimized out>) at /home/diego/code/llama.cpp/ggml/src/ggml-backend.cpp:1753
1753        return ggml_backend_sched_compute_splits(sched);
#16 0x00007ff5bb5def31 in llama_context::graph_compute (this=this@entry=0x56052aef8b60, gf=0x56052b1fabb0, batched=<optimized out>) at /usr/include/c++/14/bits/unique_ptr.h:193
193           pointer    _M_ptr() const noexcept { return std::get<0>(_M_t); }
#17 0x00007ff5bb5dff8a in llama_context::process_ubatch (this=this@entry=0x56052aef8b60, ubatch=..., gtype=gtype@entry=LLM_GRAPH_TYPE_DECODER, mctx=mctx@entry=0x560532946fc0, ret=@0x7ffdfd4892e8: -45570912) at /home/diego/code/llama.cpp/src/llama-context.cpp:784
784         const auto status = graph_compute(res->get_gf(), ubatch.n_tokens > 1);
#18 0x00007ff5bb5e33bf in llama_context::decode (this=0x56052aef8b60, batch_inp=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:1088
1088            const auto * res = process_ubatch(ubatch, LLM_GRAPH_TYPE_DECODER, mctx.get(), status);
#19 0x00007ff5bb5e426f in llama_decode (ctx=<optimized out>, batch=...) at /home/diego/code/llama.cpp/src/llama-context.cpp:2747
2747        const int ret = ctx->decode(batch);
#20 0x00005604f48c2e16 in main (argc=<optimized out>, argv=<optimized out>) at /home/diego/code/llama.cpp/tools/main/main.cpp:671
671                     if (llama_decode(ctx, llama_batch_get_one(&embd[i], n_eval))) {
[Inferior 1 (process 2058328) detached]
terminate called after throwing an instance of 'ov::Exception'
  what():  Exception from src/inference/src/cpp/infer_request.cpp:223:
Exception from src/plugins/intel_cpu/src/node.cpp:792:
[CPU] Add node with name 'Add_19058' Check 'input_shape[j] == 1' failed at src/plugins/intel_cpu/src/shape_inference/custom/eltwise.cpp:52:
Eltwise shape infer input shapes dim index: 3 mismatch

@ravi9
Copy link
Contributor

ravi9 commented Oct 15, 2025

@slaren Thanks for testing.

  • We are working on fixing llama-bench.
  • For llama-cli and llama-server, please run with --no-warmup for now. The input shapes for the warmup need to be fixed. We are working on a solution for it. llama-cli -m models/Llama-3.2-1B-Instruct.Q4_K_M.gguf --no-warmup
  • llama-simple should work fine.

@ggerganov
Copy link
Member

@CISC Regarding #20446 (comment), are you able to approve this workflow now?

@CISC
Copy link
Member

CISC commented Mar 13, 2026

@CISC Regarding #20446 (comment), are you able to approve this workflow now?

Someone already did, but I still see the Approval button on other PRs...

Ah, no, it's gone on new PRs now. :(

Copy link
Member

@ggerganov ggerganov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ravi9 I've disabled the test-llama-archs test as it was failing. Recommend to try to get it fixed as a priority because this will increase confidence that the OpenVINO backend is general enough to support all LLMs.

Waiting for CI to pass and merging.

@savvadesogle
Copy link

savvadesogle commented Mar 13, 2026

Hello

Intel Arc A770
Windows 11
driver: 8509

Command

llama-bench -m T:\models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 100 -fa 0,1 --verbose
attach_threadpool: call
set_n_threads: n_threads = 36, n_threads_batch = 36
GGML OpenVINO backend std::exception: might be outdated RESHAPE case
graph_compute: ggml_backend_sched_graph_compute_async failed with error -1
process_ubatch: failed to compute graph, compute status: -1
decode: removing memory module entries for seq_id = 0, pos = [0, +inf)
llama_decode: failed to decode, ret = -3
test_prompt: failed to decode prompt batch, res = -3
main: error: failed to run prompt warmup
~llama_context:  OPENVINO0 compute buffer size is 258.5000 MiB, matches expectation of 258.5000 MiB
~llama_context: OPENVINO0_HOST compute buffer size is  55.0098 MiB, matches expectation of  55.0098 MiB
C:\llm\llama-cpp\OpenVINO\openvino_llama\build\ReleaseOV\bin>llama-cli --version
OpenVINO: using device GPU.0
version: 8508 (5237965bb)
built with Clang 20.1.0 for Windows AMD64

The same is for llama-2-7b.Q4_0.gguf
изображение

@ravi9
Copy link
Contributor

ravi9 commented Mar 13, 2026

Hi @savvadesogle ,
Could you try with just -fa 1 .
Apologies for not being very clear about the the flag. We did mentioned for llama-bench to use -fa 1 in the PR description and in our docs . We will improve the documentation.

I just tested it on an Intel Core Ultra Series 2 with 32GB RAM and it runs fine.
Also, as mentioned in the docs, Although OpenVINO supports a wide range of Intel hardware, the llama.cpp OpenVINO backend has been validated specifically on AI PCs such as the Intel® Core™ Ultra Series 1 and Series 2.

Thank you for testing this PR. We will continue to improve this backend with more validation and on more hardware and extensive testing.

llamauser@pdx ~/ravi/llama.cpp [dev_backend_openvino]$ ./build/ReleaseOV/bin/llama-bench -m ~/models/Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -fa 1
OpenVINO: using device GPU
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | OPENVINO   |  99 |  1 |           pp512 |       498.80 ± 42.18 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | OPENVINO   |  99 |  1 |           tg128 |          9.45 ± 0.63 |

build: 5237965bb (8508)

@savvadesogle
Copy link

savvadesogle commented Mar 13, 2026

We did mentioned for llama-bench to use -fa 1

Thank you ❤️

Now is working 🔥🔥🔥🔥
The PP is great, but with TG there are a lot of copy operations and the speed is low

изображение изображение
C:\llm\llama-cpp\OpenVINO\openvino_llama\build\ReleaseOV\bin>llama-bench -m T:\models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 100 -fa 1
OpenVINO: using device GPU.0
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | OPENVINO   | 100 |  1 |           pp512 |     2246.27 + 118.90 |
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | OPENVINO   | 100 |  1 |           tg128 |          8.05 + 0.09 |

build: 5237965bb (8508)
изображение
C:\llm\llama-cpp\OpenVINO\openvino_llama\build\ReleaseOV\bin>llama-bench -m T:\models\TheBloke\Llama-2-7B-GGUF\llama-2-7b.Q4_0.gguf -ngl 100 -fa 1
OpenVINO: using device GPU.0
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | OPENVINO   | 100 |  1 |           pp512 |       1548.81 + 9.83 |
| llama 7B Q4_0                  |   3.56 GiB |     6.74 B | OPENVINO   | 100 |  1 |           tg128 |          6.38 + 0.06 |

build: 5237965bb (8508)

I would also like to add that you need to specify GPU.0 or GPU.1 instead of just GPU
set GGML_OPENVINO_DEVICE=GPU.0
Otherwise, the backend does not see the GPU.

LOG --verbose

➡️ Open me
C:\llm\llama-cpp\OpenVINO\openvino_llama\build\ReleaseOV\bin>llama-bench -m T:\models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf -ngl 100 -fa 1 --verbose
OpenVINO: using device GPU.0
| model                          |       size |     params | backend    | ngl | fa |            test |                  t/s |
| ------------------------------ | ---------: | ---------: | ---------- | --: | -: | --------------: | -------------------: |
llama_model_load_from_file_impl: using device OPENVINO0 (OpenVINO Runtime) (unknown id) - 119701 MiB free
llama_model_loader: loaded meta data with 33 key-value pairs and 292 tensors from T:\models\lmstudio-community\Meta-Llama-3.1-8B-Instruct-GGUF\Meta-Llama-3.1-8B-Instruct-Q4_K_M.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llama_model_loader: - kv   0:                       general.architecture str              = llama
llama_model_loader: - kv   1:                               general.type str              = model
llama_model_loader: - kv   2:                               general.name str              = Meta Llama 3.1 8B Instruct
llama_model_loader: - kv   3:                           general.finetune str              = Instruct
llama_model_loader: - kv   4:                           general.basename str              = Meta-Llama-3.1
llama_model_loader: - kv   5:                         general.size_label str              = 8B
llama_model_loader: - kv   6:                            general.license str              = llama3.1
llama_model_loader: - kv   7:                               general.tags arr[str,6]       = ["facebook", "meta", "pytorch", "llam...
llama_model_loader: - kv   8:                          general.languages arr[str,8]       = ["en", "de", "fr", "it", "pt", "hi", ...
llama_model_loader: - kv   9:                          llama.block_count u32              = 32
llama_model_loader: - kv  10:                       llama.context_length u32              = 131072
llama_model_loader: - kv  11:                     llama.embedding_length u32              = 4096
llama_model_loader: - kv  12:                  llama.feed_forward_length u32              = 14336
llama_model_loader: - kv  13:                 llama.attention.head_count u32              = 32
llama_model_loader: - kv  14:              llama.attention.head_count_kv u32              = 8
llama_model_loader: - kv  15:                       llama.rope.freq_base f32              = 500000.000000
llama_model_loader: - kv  16:     llama.attention.layer_norm_rms_epsilon f32              = 0.000010
llama_model_loader: - kv  17:                          general.file_type u32              = 15
llama_model_loader: - kv  18:                           llama.vocab_size u32              = 128256
llama_model_loader: - kv  19:                 llama.rope.dimension_count u32              = 128
llama_model_loader: - kv  20:                       tokenizer.ggml.model str              = gpt2
llama_model_loader: - kv  21:                         tokenizer.ggml.pre str              = llama-bpe
llama_model_loader: - kv  22:                      tokenizer.ggml.tokens arr[str,128256]  = ["!", "\"", "#", "$", "%", "&", "'", ...
llama_model_loader: - kv  23:                  tokenizer.ggml.token_type arr[i32,128256]  = [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...
llama_model_loader: - kv  24:                      tokenizer.ggml.merges arr[str,280147]  = ["G G", "G GGG", "GG GG", "...
llama_model_loader: - kv  25:                tokenizer.ggml.bos_token_id u32              = 128000
llama_model_loader: - kv  26:                tokenizer.ggml.eos_token_id u32              = 128009
llama_model_loader: - kv  27:                    tokenizer.chat_template str              = {{- bos_token }}\n{%- if custom_tools ...
llama_model_loader: - kv  28:               general.quantization_version u32              = 2
llama_model_loader: - kv  29:                      quantize.imatrix.file str              = /models_out/Meta-Llama-3.1-8B-Instruc...
llama_model_loader: - kv  30:                   quantize.imatrix.dataset str              = /training_dir/calibration_datav3.txt
llama_model_loader: - kv  31:             quantize.imatrix.entries_count i32              = 224
llama_model_loader: - kv  32:              quantize.imatrix.chunks_count i32              = 125
llama_model_loader: - type  f32:   66 tensors
llama_model_loader: - type q4_K:  193 tensors
llama_model_loader: - type q6_K:   33 tensors
print_info: file format = GGUF V3 (latest)
print_info: file type   = Q4_K - Medium
print_info: file size   = 4.58 GiB (4.89 BPW)
init_tokenizer: initializing tokenizer for type 2
load: 0 unused tokens
load: control token: 128098 '<|reserved_special_token_90|>' is not marked as EOG
load: control token: 128191 '<|reserved_special_token_183|>' is not marked as EOG
load: control token: 128130 '<|reserved_special_token_122|>' is not marked as EOG
load: control token: 128119 '<|reserved_special_token_111|>' is not marked as EOG
load: control token: 128136 '<|reserved_special_token_128|>' is not marked as EOG
load: control token: 128155 '<|reserved_special_token_147|>' is not marked as EOG
load: control token: 128196 '<|reserved_special_token_188|>' is not marked as EOG
load: control token: 128101 '<|reserved_special_token_93|>' is not marked as EOG
load: control token: 128138 '<|reserved_special_token_130|>' is not marked as EOG
load: control token: 128181 '<|reserved_special_token_173|>' is not marked as EOG
load: control token: 128034 '<|reserved_special_token_26|>' is not marked as EOG
load: control token: 128209 '<|reserved_special_token_201|>' is not marked as EOG
load: control token: 128031 '<|reserved_special_token_23|>' is not marked as EOG
load: control token: 128050 '<|reserved_special_token_42|>' is not marked as EOG
load: control token: 128244 '<|reserved_special_token_236|>' is not marked as EOG
load: control token: 128148 '<|reserved_special_token_140|>' is not marked as EOG
load: control token: 128198 '<|reserved_special_token_190|>' is not marked as EOG
load: control token: 128229 '<|reserved_special_token_221|>' is not marked as EOG
load: control token: 128165 '<|reserved_special_token_157|>' is not marked as EOG
load: control token: 128246 '<|reserved_special_token_238|>' is not marked as EOG
load: control token: 128017 '<|reserved_special_token_9|>' is not marked as EOG
load: control token: 128216 '<|reserved_special_token_208|>' is not marked as EOG
load: control token: 128161 '<|reserved_special_token_153|>' is not marked as EOG
load: control token: 128224 '<|reserved_special_token_216|>' is not marked as EOG
load: control token: 128082 '<|reserved_special_token_74|>' is not marked as EOG
load: control token: 128004 '<|finetune_right_pad_id|>' is not marked as EOG
load: control token: 128249 '<|reserved_special_token_241|>' is not marked as EOG
load: control token: 128107 '<|reserved_special_token_99|>' is not marked as EOG
load: control token: 128079 '<|reserved_special_token_71|>' is not marked as EOG
load: control token: 128225 '<|reserved_special_token_217|>' is not marked as EOG
load: control token: 128175 '<|reserved_special_token_167|>' is not marked as EOG
load: control token: 128223 '<|reserved_special_token_215|>' is not marked as EOG
load: control token: 128182 '<|reserved_special_token_174|>' is not marked as EOG
load: control token: 128068 '<|reserved_special_token_60|>' is not marked as EOG
load: control token: 128252 '<|reserved_special_token_244|>' is not marked as EOG
load: control token: 128178 '<|reserved_special_token_170|>' is not marked as EOG
load: control token: 128221 '<|reserved_special_token_213|>' is not marked as EOG
load: control token: 128052 '<|reserved_special_token_44|>' is not marked as EOG
load: control token: 128122 '<|reserved_special_token_114|>' is not marked as EOG
load: control token: 128151 '<|reserved_special_token_143|>' is not marked as EOG
load: control token: 128121 '<|reserved_special_token_113|>' is not marked as EOG
load: control token: 128158 '<|reserved_special_token_150|>' is not marked as EOG
load: control token: 128096 '<|reserved_special_token_88|>' is not marked as EOG
load: control token: 128090 '<|reserved_special_token_82|>' is not marked as EOG
load: control token: 128238 '<|reserved_special_token_230|>' is not marked as EOG
load: control token: 128139 '<|reserved_special_token_131|>' is not marked as EOG
load: control token: 128176 '<|reserved_special_token_168|>' is not marked as EOG
load: control token: 128077 '<|reserved_special_token_69|>' is not marked as EOG
load: control token: 128214 '<|reserved_special_token_206|>' is not marked as EOG
load: control token: 128171 '<|reserved_special_token_163|>' is not marked as EOG
load: control token: 128112 '<|reserved_special_token_104|>' is not marked as EOG
load: control token: 128180 '<|reserved_special_token_172|>' is not marked as EOG
load: control token: 128060 '<|reserved_special_token_52|>' is not marked as EOG
load: control token: 128000 '<|begin_of_text|>' is not marked as EOG
load: control token: 128152 '<|reserved_special_token_144|>' is not marked as EOG
load: control token: 128116 '<|reserved_special_token_108|>' is not marked as EOG
load: control token: 128072 '<|reserved_special_token_64|>' is not marked as EOG
load: control token: 128059 '<|reserved_special_token_51|>' is not marked as EOG
load: control token: 128094 '<|reserved_special_token_86|>' is not marked as EOG
load: control token: 128187 '<|reserved_special_token_179|>' is not marked as EOG
load: control token: 128103 '<|reserved_special_token_95|>' is not marked as EOG
load: control token: 128127 '<|reserved_special_token_119|>' is not marked as EOG
load: control token: 128023 '<|reserved_special_token_15|>' is not marked as EOG
load: control token: 128037 '<|reserved_special_token_29|>' is not marked as EOG
load: control token: 128228 '<|reserved_special_token_220|>' is not marked as EOG
load: control token: 128002 '<|reserved_special_token_0|>' is not marked as EOG
load: control token: 128006 '<|start_header_id|>' is not marked as EOG
load: control token: 128091 '<|reserved_special_token_83|>' is not marked as EOG
load: control token: 128044 '<|reserved_special_token_36|>' is not marked as EOG
load: control token: 128218 '<|reserved_special_token_210|>' is not marked as EOG
load: control token: 128211 '<|reserved_special_token_203|>' is not marked as EOG
load: control token: 128073 '<|reserved_special_token_65|>' is not marked as EOG
load: control token: 128168 '<|reserved_special_token_160|>' is not marked as EOG
load: control token: 128183 '<|reserved_special_token_175|>' is not marked as EOG
load: control token: 128234 '<|reserved_special_token_226|>' is not marked as EOG
load: control token: 128235 '<|reserved_special_token_227|>' is not marked as EOG
load: control token: 128067 '<|reserved_special_token_59|>' is not marked as EOG
load: control token: 128039 '<|reserved_special_token_31|>' is not marked as EOG
load: control token: 128106 '<|reserved_special_token_98|>' is not marked as EOG
load: control token: 128250 '<|reserved_special_token_242|>' is not marked as EOG
load: control token: 128173 '<|reserved_special_token_165|>' is not marked as EOG
load: control token: 128126 '<|reserved_special_token_118|>' is not marked as EOG
load: control token: 128047 '<|reserved_special_token_39|>' is not marked as EOG
load: control token: 128240 '<|reserved_special_token_232|>' is not marked as EOG
load: control token: 128045 '<|reserved_special_token_37|>' is not marked as EOG
load: control token: 128195 '<|reserved_special_token_187|>' is not marked as EOG
load: control token: 128078 '<|reserved_special_token_70|>' is not marked as EOG
load: control token: 128137 '<|reserved_special_token_129|>' is not marked as EOG
load: control token: 128186 '<|reserved_special_token_178|>' is not marked as EOG
load: control token: 128048 '<|reserved_special_token_40|>' is not marked as EOG
load: control token: 128076 '<|reserved_special_token_68|>' is not marked as EOG
load: control token: 128029 '<|reserved_special_token_21|>' is not marked as EOG
load: control token: 128013 '<|reserved_special_token_5|>' is not marked as EOG
load: control token: 128197 '<|reserved_special_token_189|>' is not marked as EOG
load: control token: 128056 '<|reserved_special_token_48|>' is not marked as EOG
load: control token: 128123 '<|reserved_special_token_115|>' is not marked as EOG
load: control token: 128095 '<|reserved_special_token_87|>' is not marked as EOG
load: control token: 128089 '<|reserved_special_token_81|>' is not marked as EOG
load: control token: 128057 '<|reserved_special_token_49|>' is not marked as EOG
load: control token: 128163 '<|reserved_special_token_155|>' is not marked as EOG
load: control token: 128011 '<|reserved_special_token_3|>' is not marked as EOG
load: control token: 128053 '<|reserved_special_token_45|>' is not marked as EOG
load: control token: 128160 '<|reserved_special_token_152|>' is not marked as EOG
load: control token: 128222 '<|reserved_special_token_214|>' is not marked as EOG
load: control token: 128035 '<|reserved_special_token_27|>' is not marked as EOG
load: control token: 128162 '<|reserved_special_token_154|>' is not marked as EOG
load: control token: 128205 '<|reserved_special_token_197|>' is not marked as EOG
load: control token: 128109 '<|reserved_special_token_101|>' is not marked as EOG
load: control token: 128185 '<|reserved_special_token_177|>' is not marked as EOG
load: control token: 128114 '<|reserved_special_token_106|>' is not marked as EOG
load: control token: 128159 '<|reserved_special_token_151|>' is not marked as EOG
load: control token: 128179 '<|reserved_special_token_171|>' is not marked as EOG
load: control token: 128115 '<|reserved_special_token_107|>' is not marked as EOG
load: control token: 128087 '<|reserved_special_token_79|>' is not marked as EOG
load: control token: 128113 '<|reserved_special_token_105|>' is not marked as EOG
load: control token: 128054 '<|reserved_special_token_46|>' is not marked as EOG
load: control token: 128030 '<|reserved_special_token_22|>' is not marked as EOG
load: control token: 128170 '<|reserved_special_token_162|>' is not marked as EOG
load: control token: 128012 '<|reserved_special_token_4|>' is not marked as EOG
load: control token: 128064 '<|reserved_special_token_56|>' is not marked as EOG
load: control token: 128118 '<|reserved_special_token_110|>' is not marked as EOG
load: control token: 128206 '<|reserved_special_token_198|>' is not marked as EOG
load: control token: 128099 '<|reserved_special_token_91|>' is not marked as EOG
load: control token: 128133 '<|reserved_special_token_125|>' is not marked as EOG
load: control token: 128190 '<|reserved_special_token_182|>' is not marked as EOG
load: control token: 128097 '<|reserved_special_token_89|>' is not marked as EOG
load: control token: 128086 '<|reserved_special_token_78|>' is not marked as EOG
load: control token: 128120 '<|reserved_special_token_112|>' is not marked as EOG
load: control token: 128193 '<|reserved_special_token_185|>' is not marked as EOG
load: control token: 128049 '<|reserved_special_token_41|>' is not marked as EOG
load: control token: 128242 '<|reserved_special_token_234|>' is not marked as EOG
load: control token: 128142 '<|reserved_special_token_134|>' is not marked as EOG
load: control token: 128188 '<|reserved_special_token_180|>' is not marked as EOG
load: control token: 128144 '<|reserved_special_token_136|>' is not marked as EOG
load: control token: 128247 '<|reserved_special_token_239|>' is not marked as EOG
load: control token: 128065 '<|reserved_special_token_57|>' is not marked as EOG
load: control token: 128117 '<|reserved_special_token_109|>' is not marked as EOG
load: control token: 128033 '<|reserved_special_token_25|>' is not marked as EOG
load: control token: 128184 '<|reserved_special_token_176|>' is not marked as EOG
load: control token: 128040 '<|reserved_special_token_32|>' is not marked as EOG
load: control token: 128204 '<|reserved_special_token_196|>' is not marked as EOG
load: control token: 128210 '<|reserved_special_token_202|>' is not marked as EOG
load: control token: 128245 '<|reserved_special_token_237|>' is not marked as EOG
load: control token: 128135 '<|reserved_special_token_127|>' is not marked as EOG
load: control token: 128071 '<|reserved_special_token_63|>' is not marked as EOG
load: control token: 128153 '<|reserved_special_token_145|>' is not marked as EOG
load: control token: 128194 '<|reserved_special_token_186|>' is not marked as EOG
load: control token: 128177 '<|reserved_special_token_169|>' is not marked as EOG
load: control token: 128236 '<|reserved_special_token_228|>' is not marked as EOG
load: control token: 128248 '<|reserved_special_token_240|>' is not marked as EOG
load: control token: 128241 '<|reserved_special_token_233|>' is not marked as EOG
load: control token: 128212 '<|reserved_special_token_204|>' is not marked as EOG
load: control token: 128207 '<|reserved_special_token_199|>' is not marked as EOG
load: control token: 128003 '<|reserved_special_token_1|>' is not marked as EOG
load: control token: 128005 '<|reserved_special_token_2|>' is not marked as EOG
load: control token: 128007 '<|end_header_id|>' is not marked as EOG
load: control token: 128010 '<|python_tag|>' is not marked as EOG
load: control token: 128014 '<|reserved_special_token_6|>' is not marked as EOG
load: control token: 128015 '<|reserved_special_token_7|>' is not marked as EOG
load: control token: 128016 '<|reserved_special_token_8|>' is not marked as EOG
load: control token: 128018 '<|reserved_special_token_10|>' is not marked as EOG
load: control token: 128019 '<|reserved_special_token_11|>' is not marked as EOG
load: control token: 128020 '<|reserved_special_token_12|>' is not marked as EOG
load: control token: 128021 '<|reserved_special_token_13|>' is not marked as EOG
load: control token: 128022 '<|reserved_special_token_14|>' is not marked as EOG
load: control token: 128024 '<|reserved_special_token_16|>' is not marked as EOG
load: control token: 128025 '<|reserved_special_token_17|>' is not marked as EOG
load: control token: 128026 '<|reserved_special_token_18|>' is not marked as EOG
load: control token: 128027 '<|reserved_special_token_19|>' is not marked as EOG
load: control token: 128028 '<|reserved_special_token_20|>' is not marked as EOG
load: control token: 128032 '<|reserved_special_token_24|>' is not marked as EOG
load: control token: 128036 '<|reserved_special_token_28|>' is not marked as EOG
load: control token: 128038 '<|reserved_special_token_30|>' is not marked as EOG
load: control token: 128041 '<|reserved_special_token_33|>' is not marked as EOG
load: control token: 128042 '<|reserved_special_token_34|>' is not marked as EOG
load: control token: 128043 '<|reserved_special_token_35|>' is not marked as EOG
load: control token: 128046 '<|reserved_special_token_38|>' is not marked as EOG
load: control token: 128051 '<|reserved_special_token_43|>' is not marked as EOG
load: control token: 128055 '<|reserved_special_token_47|>' is not marked as EOG
load: control token: 128058 '<|reserved_special_token_50|>' is not marked as EOG
load: control token: 128061 '<|reserved_special_token_53|>' is not marked as EOG
load: control token: 128062 '<|reserved_special_token_54|>' is not marked as EOG
load: control token: 128063 '<|reserved_special_token_55|>' is not marked as EOG
load: control token: 128066 '<|reserved_special_token_58|>' is not marked as EOG
load: control token: 128069 '<|reserved_special_token_61|>' is not marked as EOG
load: control token: 128070 '<|reserved_special_token_62|>' is not marked as EOG
load: control token: 128074 '<|reserved_special_token_66|>' is not marked as EOG
load: control token: 128075 '<|reserved_special_token_67|>' is not marked as EOG
load: control token: 128080 '<|reserved_special_token_72|>' is not marked as EOG
load: control token: 128081 '<|reserved_special_token_73|>' is not marked as EOG
load: control token: 128083 '<|reserved_special_token_75|>' is not marked as EOG
load: control token: 128084 '<|reserved_special_token_76|>' is not marked as EOG
load: control token: 128085 '<|reserved_special_token_77|>' is not marked as EOG
load: control token: 128088 '<|reserved_special_token_80|>' is not marked as EOG
load: control token: 128092 '<|reserved_special_token_84|>' is not marked as EOG
load: control token: 128093 '<|reserved_special_token_85|>' is not marked as EOG
load: control token: 128100 '<|reserved_special_token_92|>' is not marked as EOG
load: control token: 128102 '<|reserved_special_token_94|>' is not marked as EOG
load: control token: 128104 '<|reserved_special_token_96|>' is not marked as EOG
load: control token: 128105 '<|reserved_special_token_97|>' is not marked as EOG
load: control token: 128108 '<|reserved_special_token_100|>' is not marked as EOG
load: control token: 128110 '<|reserved_special_token_102|>' is not marked as EOG
load: control token: 128111 '<|reserved_special_token_103|>' is not marked as EOG
load: control token: 128124 '<|reserved_special_token_116|>' is not marked as EOG
load: control token: 128125 '<|reserved_special_token_117|>' is not marked as EOG
load: control token: 128128 '<|reserved_special_token_120|>' is not marked as EOG
load: control token: 128129 '<|reserved_special_token_121|>' is not marked as EOG
load: control token: 128131 '<|reserved_special_token_123|>' is not marked as EOG
load: control token: 128132 '<|reserved_special_token_124|>' is not marked as EOG
load: control token: 128134 '<|reserved_special_token_126|>' is not marked as EOG
load: control token: 128140 '<|reserved_special_token_132|>' is not marked as EOG
load: control token: 128141 '<|reserved_special_token_133|>' is not marked as EOG
load: control token: 128143 '<|reserved_special_token_135|>' is not marked as EOG
load: control token: 128145 '<|reserved_special_token_137|>' is not marked as EOG
load: control token: 128146 '<|reserved_special_token_138|>' is not marked as EOG
load: control token: 128147 '<|reserved_special_token_139|>' is not marked as EOG
load: control token: 128149 '<|reserved_special_token_141|>' is not marked as EOG
load: control token: 128150 '<|reserved_special_token_142|>' is not marked as EOG
load: control token: 128154 '<|reserved_special_token_146|>' is not marked as EOG
load: control token: 128156 '<|reserved_special_token_148|>' is not marked as EOG
load: control token: 128157 '<|reserved_special_token_149|>' is not marked as EOG
load: control token: 128164 '<|reserved_special_token_156|>' is not marked as EOG
load: control token: 128166 '<|reserved_special_token_158|>' is not marked as EOG
load: control token: 128167 '<|reserved_special_token_159|>' is not marked as EOG
load: control token: 128169 '<|reserved_special_token_161|>' is not marked as EOG
load: control token: 128172 '<|reserved_special_token_164|>' is not marked as EOG
load: control token: 128174 '<|reserved_special_token_166|>' is not marked as EOG
load: control token: 128189 '<|reserved_special_token_181|>' is not marked as EOG
load: control token: 128192 '<|reserved_special_token_184|>' is not marked as EOG
load: control token: 128199 '<|reserved_special_token_191|>' is not marked as EOG
load: control token: 128200 '<|reserved_special_token_192|>' is not marked as EOG
load: control token: 128201 '<|reserved_special_token_193|>' is not marked as EOG
load: control token: 128202 '<|reserved_special_token_194|>' is not marked as EOG
load: control token: 128203 '<|reserved_special_token_195|>' is not marked as EOG
load: control token: 128208 '<|reserved_special_token_200|>' is not marked as EOG
load: control token: 128213 '<|reserved_special_token_205|>' is not marked as EOG
load: control token: 128215 '<|reserved_special_token_207|>' is not marked as EOG
load: control token: 128217 '<|reserved_special_token_209|>' is not marked as EOG
load: control token: 128219 '<|reserved_special_token_211|>' is not marked as EOG
load: control token: 128220 '<|reserved_special_token_212|>' is not marked as EOG
load: control token: 128226 '<|reserved_special_token_218|>' is not marked as EOG
load: control token: 128227 '<|reserved_special_token_219|>' is not marked as EOG
load: control token: 128230 '<|reserved_special_token_222|>' is not marked as EOG
load: control token: 128231 '<|reserved_special_token_223|>' is not marked as EOG
load: control token: 128232 '<|reserved_special_token_224|>' is not marked as EOG
load: control token: 128233 '<|reserved_special_token_225|>' is not marked as EOG
load: control token: 128237 '<|reserved_special_token_229|>' is not marked as EOG
load: control token: 128239 '<|reserved_special_token_231|>' is not marked as EOG
load: control token: 128243 '<|reserved_special_token_235|>' is not marked as EOG
load: control token: 128251 '<|reserved_special_token_243|>' is not marked as EOG
load: control token: 128253 '<|reserved_special_token_245|>' is not marked as EOG
load: control token: 128254 '<|reserved_special_token_246|>' is not marked as EOG
load: control token: 128255 '<|reserved_special_token_247|>' is not marked as EOG
load: printing all EOG tokens:
load:   - 128001 ('<|end_of_text|>')
load:   - 128008 ('<|eom_id|>')
load:   - 128009 ('<|eot_id|>')
load: special tokens cache size = 256
load: token to piece cache size = 0.7999 MB
print_info: arch                  = llama
print_info: vocab_only            = 0
print_info: no_alloc              = 0
print_info: n_ctx_train           = 131072
print_info: n_embd                = 4096
print_info: n_embd_inp            = 4096
print_info: n_layer               = 32
print_info: n_head                = 32
print_info: n_head_kv             = 8
print_info: n_rot                 = 128
print_info: n_swa                 = 0
print_info: is_swa_any            = 0
print_info: n_embd_head_k         = 128
print_info: n_embd_head_v         = 128
print_info: n_gqa                 = 4
print_info: n_embd_k_gqa          = 1024
print_info: n_embd_v_gqa          = 1024
print_info: f_norm_eps            = 0.0e+00
print_info: f_norm_rms_eps        = 1.0e-05
print_info: f_clamp_kqv           = 0.0e+00
print_info: f_max_alibi_bias      = 0.0e+00
print_info: f_logit_scale         = 0.0e+00
print_info: f_attn_scale          = 0.0e+00
print_info: n_ff                  = 14336
print_info: n_expert              = 0
print_info: n_expert_used         = 0
print_info: n_expert_groups       = 0
print_info: n_group_used          = 0
print_info: causal attn           = 1
print_info: pooling type          = 0
print_info: rope type             = 0
print_info: rope scaling          = linear
print_info: freq_base_train       = 500000.0
print_info: freq_scale_train      = 1
print_info: n_ctx_orig_yarn       = 131072
print_info: rope_yarn_log_mul     = 0.0000
print_info: rope_finetuned        = unknown
print_info: model type            = 8B
print_info: model params          = 8.03 B
print_info: general.name          = Meta Llama 3.1 8B Instruct
print_info: vocab type            = BPE
print_info: n_vocab               = 128256
print_info: n_merges              = 280147
print_info: BOS token             = 128000 '<|begin_of_text|>'
print_info: EOS token             = 128009 '<|eot_id|>'
print_info: EOT token             = 128009 '<|eot_id|>'
print_info: EOM token             = 128008 '<|eom_id|>'
print_info: LF token              = 198 'C'
print_info: EOG token             = 128001 '<|end_of_text|>'
print_info: EOG token             = 128008 '<|eom_id|>'
print_info: EOG token             = 128009 '<|eot_id|>'
print_info: max token length      = 256
load_tensors: loading model tensors, this can take a while... (mmap = false, direct_io = false)
load_tensors: layer   0 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   1 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   2 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   3 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   4 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   5 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   6 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   7 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   8 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer   9 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  10 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  11 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  12 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  13 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  14 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  15 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  16 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  17 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  18 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  19 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  20 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  21 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  22 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  23 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  24 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  25 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  26 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  27 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  28 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  29 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  30 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  31 assigned to device OPENVINO0, is_swa = 0
load_tensors: layer  32 assigned to device OPENVINO0, is_swa = 0
create_tensor: loading tensor token_embd.weight
create_tensor: loading tensor output_norm.weight
create_tensor: loading tensor output.weight
create_tensor: loading tensor blk.0.attn_norm.weight
create_tensor: loading tensor blk.0.attn_q.weight
create_tensor: loading tensor blk.0.attn_k.weight
create_tensor: loading tensor blk.0.attn_v.weight
create_tensor: loading tensor blk.0.attn_output.weight
create_tensor: loading tensor blk.0.ffn_norm.weight
create_tensor: loading tensor rope_freqs.weight
create_tensor: loading tensor blk.0.ffn_gate.weight
create_tensor: loading tensor blk.0.ffn_down.weight
create_tensor: loading tensor blk.0.ffn_up.weight
create_tensor: loading tensor blk.1.attn_norm.weight
create_tensor: loading tensor blk.1.attn_q.weight
create_tensor: loading tensor blk.1.attn_k.weight
create_tensor: loading tensor blk.1.attn_v.weight
create_tensor: loading tensor blk.1.attn_output.weight
create_tensor: loading tensor blk.1.ffn_norm.weight
create_tensor: loading tensor blk.1.ffn_gate.weight
create_tensor: loading tensor blk.1.ffn_down.weight
create_tensor: loading tensor blk.1.ffn_up.weight
create_tensor: loading tensor blk.2.attn_norm.weight
create_tensor: loading tensor blk.2.attn_q.weight
create_tensor: loading tensor blk.2.attn_k.weight
create_tensor: loading tensor blk.2.attn_v.weight
create_tensor: loading tensor blk.2.attn_output.weight
create_tensor: loading tensor blk.2.ffn_norm.weight
create_tensor: loading tensor blk.2.ffn_gate.weight
create_tensor: loading tensor blk.2.ffn_down.weight
create_tensor: loading tensor blk.2.ffn_up.weight
create_tensor: loading tensor blk.3.attn_norm.weight
create_tensor: loading tensor blk.3.attn_q.weight
create_tensor: loading tensor blk.3.attn_k.weight
create_tensor: loading tensor blk.3.attn_v.weight
create_tensor: loading tensor blk.3.attn_output.weight
create_tensor: loading tensor blk.3.ffn_norm.weight
create_tensor: loading tensor blk.3.ffn_gate.weight
create_tensor: loading tensor blk.3.ffn_down.weight
create_tensor: loading tensor blk.3.ffn_up.weight
create_tensor: loading tensor blk.4.attn_norm.weight
create_tensor: loading tensor blk.4.attn_q.weight
create_tensor: loading tensor blk.4.attn_k.weight
create_tensor: loading tensor blk.4.attn_v.weight
create_tensor: loading tensor blk.4.attn_output.weight
create_tensor: loading tensor blk.4.ffn_norm.weight
create_tensor: loading tensor blk.4.ffn_gate.weight
create_tensor: loading tensor blk.4.ffn_down.weight
create_tensor: loading tensor blk.4.ffn_up.weight
create_tensor: loading tensor blk.5.attn_norm.weight
create_tensor: loading tensor blk.5.attn_q.weight
create_tensor: loading tensor blk.5.attn_k.weight
create_tensor: loading tensor blk.5.attn_v.weight
create_tensor: loading tensor blk.5.attn_output.weight
create_tensor: loading tensor blk.5.ffn_norm.weight
create_tensor: loading tensor blk.5.ffn_gate.weight
create_tensor: loading tensor blk.5.ffn_down.weight
create_tensor: loading tensor blk.5.ffn_up.weight
create_tensor: loading tensor blk.6.attn_norm.weight
create_tensor: loading tensor blk.6.attn_q.weight
create_tensor: loading tensor blk.6.attn_k.weight
create_tensor: loading tensor blk.6.attn_v.weight
create_tensor: loading tensor blk.6.attn_output.weight
create_tensor: loading tensor blk.6.ffn_norm.weight
create_tensor: loading tensor blk.6.ffn_gate.weight
create_tensor: loading tensor blk.6.ffn_down.weight
create_tensor: loading tensor blk.6.ffn_up.weight
create_tensor: loading tensor blk.7.attn_norm.weight
create_tensor: loading tensor blk.7.attn_q.weight
create_tensor: loading tensor blk.7.attn_k.weight
create_tensor: loading tensor blk.7.attn_v.weight
create_tensor: loading tensor blk.7.attn_output.weight
create_tensor: loading tensor blk.7.ffn_norm.weight
create_tensor: loading tensor blk.7.ffn_gate.weight
create_tensor: loading tensor blk.7.ffn_down.weight
create_tensor: loading tensor blk.7.ffn_up.weight
create_tensor: loading tensor blk.8.attn_norm.weight
create_tensor: loading tensor blk.8.attn_q.weight
create_tensor: loading tensor blk.8.attn_k.weight
create_tensor: loading tensor blk.8.attn_v.weight
create_tensor: loading tensor blk.8.attn_output.weight
create_tensor: loading tensor blk.8.ffn_norm.weight
create_tensor: loading tensor blk.8.ffn_gate.weight
create_tensor: loading tensor blk.8.ffn_down.weight
create_tensor: loading tensor blk.8.ffn_up.weight
create_tensor: loading tensor blk.9.attn_norm.weight
create_tensor: loading tensor blk.9.attn_q.weight
create_tensor: loading tensor blk.9.attn_k.weight
create_tensor: loading tensor blk.9.attn_v.weight
create_tensor: loading tensor blk.9.attn_output.weight
create_tensor: loading tensor blk.9.ffn_norm.weight
create_tensor: loading tensor blk.9.ffn_gate.weight
create_tensor: loading tensor blk.9.ffn_down.weight
create_tensor: loading tensor blk.9.ffn_up.weight
create_tensor: loading tensor blk.10.attn_norm.weight
create_tensor: loading tensor blk.10.attn_q.weight
create_tensor: loading tensor blk.10.attn_k.weight
create_tensor: loading tensor blk.10.attn_v.weight
create_tensor: loading tensor blk.10.attn_output.weight
create_tensor: loading tensor blk.10.ffn_norm.weight
create_tensor: loading tensor blk.10.ffn_gate.weight
create_tensor: loading tensor blk.10.ffn_down.weight
create_tensor: loading tensor blk.10.ffn_up.weight
create_tensor: loading tensor blk.11.attn_norm.weight
create_tensor: loading tensor blk.11.attn_q.weight
create_tensor: loading tensor blk.11.attn_k.weight
create_tensor: loading tensor blk.11.attn_v.weight
create_tensor: loading tensor blk.11.attn_output.weight
create_tensor: loading tensor blk.11.ffn_norm.weight
create_tensor: loading tensor blk.11.ffn_gate.weight
create_tensor: loading tensor blk.11.ffn_down.weight
create_tensor: loading tensor blk.11.ffn_up.weight
create_tensor: loading tensor blk.12.attn_norm.weight
create_tensor: loading tensor blk.12.attn_q.weight
create_tensor: loading tensor blk.12.attn_k.weight
create_tensor: loading tensor blk.12.attn_v.weight
create_tensor: loading tensor blk.12.attn_output.weight
create_tensor: loading tensor blk.12.ffn_norm.weight
create_tensor: loading tensor blk.12.ffn_gate.weight
create_tensor: loading tensor blk.12.ffn_down.weight
create_tensor: loading tensor blk.12.ffn_up.weight
create_tensor: loading tensor blk.13.attn_norm.weight
create_tensor: loading tensor blk.13.attn_q.weight
create_tensor: loading tensor blk.13.attn_k.weight
create_tensor: loading tensor blk.13.attn_v.weight
create_tensor: loading tensor blk.13.attn_output.weight
create_tensor: loading tensor blk.13.ffn_norm.weight
create_tensor: loading tensor blk.13.ffn_gate.weight
create_tensor: loading tensor blk.13.ffn_down.weight
create_tensor: loading tensor blk.13.ffn_up.weight
create_tensor: loading tensor blk.14.attn_norm.weight
create_tensor: loading tensor blk.14.attn_q.weight
create_tensor: loading tensor blk.14.attn_k.weight
create_tensor: loading tensor blk.14.attn_v.weight
create_tensor: loading tensor blk.14.attn_output.weight
create_tensor: loading tensor blk.14.ffn_norm.weight
create_tensor: loading tensor blk.14.ffn_gate.weight
create_tensor: loading tensor blk.14.ffn_down.weight
create_tensor: loading tensor blk.14.ffn_up.weight
create_tensor: loading tensor blk.15.attn_norm.weight
create_tensor: loading tensor blk.15.attn_q.weight
create_tensor: loading tensor blk.15.attn_k.weight
create_tensor: loading tensor blk.15.attn_v.weight
create_tensor: loading tensor blk.15.attn_output.weight
create_tensor: loading tensor blk.15.ffn_norm.weight
create_tensor: loading tensor blk.15.ffn_gate.weight
create_tensor: loading tensor blk.15.ffn_down.weight
create_tensor: loading tensor blk.15.ffn_up.weight
create_tensor: loading tensor blk.16.attn_norm.weight
create_tensor: loading tensor blk.16.attn_q.weight
create_tensor: loading tensor blk.16.attn_k.weight
create_tensor: loading tensor blk.16.attn_v.weight
create_tensor: loading tensor blk.16.attn_output.weight
create_tensor: loading tensor blk.16.ffn_norm.weight
create_tensor: loading tensor blk.16.ffn_gate.weight
create_tensor: loading tensor blk.16.ffn_down.weight
create_tensor: loading tensor blk.16.ffn_up.weight
create_tensor: loading tensor blk.17.attn_norm.weight
create_tensor: loading tensor blk.17.attn_q.weight
create_tensor: loading tensor blk.17.attn_k.weight
create_tensor: loading tensor blk.17.attn_v.weight
create_tensor: loading tensor blk.17.attn_output.weight
create_tensor: loading tensor blk.17.ffn_norm.weight
create_tensor: loading tensor blk.17.ffn_gate.weight
create_tensor: loading tensor blk.17.ffn_down.weight
create_tensor: loading tensor blk.17.ffn_up.weight
create_tensor: loading tensor blk.18.attn_norm.weight
create_tensor: loading tensor blk.18.attn_q.weight
create_tensor: loading tensor blk.18.attn_k.weight
create_tensor: loading tensor blk.18.attn_v.weight
create_tensor: loading tensor blk.18.attn_output.weight
create_tensor: loading tensor blk.18.ffn_norm.weight
create_tensor: loading tensor blk.18.ffn_gate.weight
create_tensor: loading tensor blk.18.ffn_down.weight
create_tensor: loading tensor blk.18.ffn_up.weight
create_tensor: loading tensor blk.19.attn_norm.weight
create_tensor: loading tensor blk.19.attn_q.weight
create_tensor: loading tensor blk.19.attn_k.weight
create_tensor: loading tensor blk.19.attn_v.weight
create_tensor: loading tensor blk.19.attn_output.weight
create_tensor: loading tensor blk.19.ffn_norm.weight
create_tensor: loading tensor blk.19.ffn_gate.weight
create_tensor: loading tensor blk.19.ffn_down.weight
create_tensor: loading tensor blk.19.ffn_up.weight
create_tensor: loading tensor blk.20.attn_norm.weight
create_tensor: loading tensor blk.20.attn_q.weight
create_tensor: loading tensor blk.20.attn_k.weight
create_tensor: loading tensor blk.20.attn_v.weight
create_tensor: loading tensor blk.20.attn_output.weight
create_tensor: loading tensor blk.20.ffn_norm.weight
create_tensor: loading tensor blk.20.ffn_gate.weight
create_tensor: loading tensor blk.20.ffn_down.weight
create_tensor: loading tensor blk.20.ffn_up.weight
create_tensor: loading tensor blk.21.attn_norm.weight
create_tensor: loading tensor blk.21.attn_q.weight
create_tensor: loading tensor blk.21.attn_k.weight
create_tensor: loading tensor blk.21.attn_v.weight
create_tensor: loading tensor blk.21.attn_output.weight
create_tensor: loading tensor blk.21.ffn_norm.weight
create_tensor: loading tensor blk.21.ffn_gate.weight
create_tensor: loading tensor blk.21.ffn_down.weight
create_tensor: loading tensor blk.21.ffn_up.weight
create_tensor: loading tensor blk.22.attn_norm.weight
create_tensor: loading tensor blk.22.attn_q.weight
create_tensor: loading tensor blk.22.attn_k.weight
create_tensor: loading tensor blk.22.attn_v.weight
create_tensor: loading tensor blk.22.attn_output.weight
create_tensor: loading tensor blk.22.ffn_norm.weight
create_tensor: loading tensor blk.22.ffn_gate.weight
create_tensor: loading tensor blk.22.ffn_down.weight
create_tensor: loading tensor blk.22.ffn_up.weight
create_tensor: loading tensor blk.23.attn_norm.weight
create_tensor: loading tensor blk.23.attn_q.weight
create_tensor: loading tensor blk.23.attn_k.weight
create_tensor: loading tensor blk.23.attn_v.weight
create_tensor: loading tensor blk.23.attn_output.weight
create_tensor: loading tensor blk.23.ffn_norm.weight
create_tensor: loading tensor blk.23.ffn_gate.weight
create_tensor: loading tensor blk.23.ffn_down.weight
create_tensor: loading tensor blk.23.ffn_up.weight
create_tensor: loading tensor blk.24.attn_norm.weight
create_tensor: loading tensor blk.24.attn_q.weight
create_tensor: loading tensor blk.24.attn_k.weight
create_tensor: loading tensor blk.24.attn_v.weight
create_tensor: loading tensor blk.24.attn_output.weight
create_tensor: loading tensor blk.24.ffn_norm.weight
create_tensor: loading tensor blk.24.ffn_gate.weight
create_tensor: loading tensor blk.24.ffn_down.weight
create_tensor: loading tensor blk.24.ffn_up.weight
create_tensor: loading tensor blk.25.attn_norm.weight
create_tensor: loading tensor blk.25.attn_q.weight
create_tensor: loading tensor blk.25.attn_k.weight
create_tensor: loading tensor blk.25.attn_v.weight
create_tensor: loading tensor blk.25.attn_output.weight
create_tensor: loading tensor blk.25.ffn_norm.weight
create_tensor: loading tensor blk.25.ffn_gate.weight
create_tensor: loading tensor blk.25.ffn_down.weight
create_tensor: loading tensor blk.25.ffn_up.weight
create_tensor: loading tensor blk.26.attn_norm.weight
create_tensor: loading tensor blk.26.attn_q.weight
create_tensor: loading tensor blk.26.attn_k.weight
create_tensor: loading tensor blk.26.attn_v.weight
create_tensor: loading tensor blk.26.attn_output.weight
create_tensor: loading tensor blk.26.ffn_norm.weight
create_tensor: loading tensor blk.26.ffn_gate.weight
create_tensor: loading tensor blk.26.ffn_down.weight
create_tensor: loading tensor blk.26.ffn_up.weight
create_tensor: loading tensor blk.27.attn_norm.weight
create_tensor: loading tensor blk.27.attn_q.weight
create_tensor: loading tensor blk.27.attn_k.weight
create_tensor: loading tensor blk.27.attn_v.weight
create_tensor: loading tensor blk.27.attn_output.weight
create_tensor: loading tensor blk.27.ffn_norm.weight
create_tensor: loading tensor blk.27.ffn_gate.weight
create_tensor: loading tensor blk.27.ffn_down.weight
create_tensor: loading tensor blk.27.ffn_up.weight
create_tensor: loading tensor blk.28.attn_norm.weight
create_tensor: loading tensor blk.28.attn_q.weight
create_tensor: loading tensor blk.28.attn_k.weight
create_tensor: loading tensor blk.28.attn_v.weight
create_tensor: loading tensor blk.28.attn_output.weight
create_tensor: loading tensor blk.28.ffn_norm.weight
create_tensor: loading tensor blk.28.ffn_gate.weight
create_tensor: loading tensor blk.28.ffn_down.weight
create_tensor: loading tensor blk.28.ffn_up.weight
create_tensor: loading tensor blk.29.attn_norm.weight
create_tensor: loading tensor blk.29.attn_q.weight
create_tensor: loading tensor blk.29.attn_k.weight
create_tensor: loading tensor blk.29.attn_v.weight
create_tensor: loading tensor blk.29.attn_output.weight
create_tensor: loading tensor blk.29.ffn_norm.weight
create_tensor: loading tensor blk.29.ffn_gate.weight
create_tensor: loading tensor blk.29.ffn_down.weight
create_tensor: loading tensor blk.29.ffn_up.weight
create_tensor: loading tensor blk.30.attn_norm.weight
create_tensor: loading tensor blk.30.attn_q.weight
create_tensor: loading tensor blk.30.attn_k.weight
create_tensor: loading tensor blk.30.attn_v.weight
create_tensor: loading tensor blk.30.attn_output.weight
create_tensor: loading tensor blk.30.ffn_norm.weight
create_tensor: loading tensor blk.30.ffn_gate.weight
create_tensor: loading tensor blk.30.ffn_down.weight
create_tensor: loading tensor blk.30.ffn_up.weight
create_tensor: loading tensor blk.31.attn_norm.weight
create_tensor: loading tensor blk.31.attn_q.weight
create_tensor: loading tensor blk.31.attn_k.weight
create_tensor: loading tensor blk.31.attn_v.weight
create_tensor: loading tensor blk.31.attn_output.weight
create_tensor: loading tensor blk.31.ffn_norm.weight
create_tensor: loading tensor blk.31.ffn_gate.weight
create_tensor: loading tensor blk.31.ffn_down.weight
create_tensor: loading tensor blk.31.ffn_up.weight
load_tensors: offloading output layer to GPU
load_tensors: offloading 31 repeating layers to GPU
load_tensors: offloaded 33/33 layers to GPU
load_tensors:    OPENVINO0 model buffer size =  4755.42 MiB
load_tensors: OPENVINO0_HOST model buffer size =   501.24 MiB
load_all_data: device OPENVINO0 does not support async, host buffers or events
......................................................................................load_all_data: buffer type OPENVINO0_HOST is not the default buffer type for device OPENVINO0 for async uploads
.
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 512
llama_context: n_ctx_seq     = 512
llama_context: n_batch       = 512
llama_context: n_ubatch      = 512
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (512) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: OPENVINO0_HOST  output buffer size =     0.49 MiB
llama_kv_cache: layer   0: dev = OPENVINO0
llama_kv_cache: layer   1: dev = OPENVINO0
llama_kv_cache: layer   2: dev = OPENVINO0
llama_kv_cache: layer   3: dev = OPENVINO0
llama_kv_cache: layer   4: dev = OPENVINO0
llama_kv_cache: layer   5: dev = OPENVINO0
llama_kv_cache: layer   6: dev = OPENVINO0
llama_kv_cache: layer   7: dev = OPENVINO0
llama_kv_cache: layer   8: dev = OPENVINO0
llama_kv_cache: layer   9: dev = OPENVINO0
llama_kv_cache: layer  10: dev = OPENVINO0
llama_kv_cache: layer  11: dev = OPENVINO0
llama_kv_cache: layer  12: dev = OPENVINO0
llama_kv_cache: layer  13: dev = OPENVINO0
llama_kv_cache: layer  14: dev = OPENVINO0
llama_kv_cache: layer  15: dev = OPENVINO0
llama_kv_cache: layer  16: dev = OPENVINO0
llama_kv_cache: layer  17: dev = OPENVINO0
llama_kv_cache: layer  18: dev = OPENVINO0
llama_kv_cache: layer  19: dev = OPENVINO0
llama_kv_cache: layer  20: dev = OPENVINO0
llama_kv_cache: layer  21: dev = OPENVINO0
llama_kv_cache: layer  22: dev = OPENVINO0
llama_kv_cache: layer  23: dev = OPENVINO0
llama_kv_cache: layer  24: dev = OPENVINO0
llama_kv_cache: layer  25: dev = OPENVINO0
llama_kv_cache: layer  26: dev = OPENVINO0
llama_kv_cache: layer  27: dev = OPENVINO0
llama_kv_cache: layer  28: dev = OPENVINO0
llama_kv_cache: layer  29: dev = OPENVINO0
llama_kv_cache: layer  30: dev = OPENVINO0
llama_kv_cache: layer  31: dev = OPENVINO0
llama_kv_cache:  OPENVINO0 KV buffer size =    64.00 MiB
llama_kv_cache: size =   64.00 MiB (   512 cells,  32 layers,  1/1 seqs), K (f16):   32.00 MiB, V (f16):   32.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 2336
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 512, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  512, n_seqs =  1, n_outputs =  512
sched_reserve:  OPENVINO0 compute buffer size =   258.50 MiB
sched_reserve: OPENVINO0_HOST compute buffer size =     9.01 MiB
sched_reserve: graph nodes  = 999
sched_reserve: graph splits = 1
sched_reserve: reserve took 71.53 ms, sched copies = 1
attach_threadpool: call
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | OPENVINO   | 100 |  1 |           pp512 |     2241.38 + 131.05 |
llama_perf_context_print:        load time =   71746.34 ms
llama_perf_context_print: prompt eval time =       0.00 ms /  3072 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /     1 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =   72899.41 ms /  3073 tokens
llama_perf_context_print:    graphs reused =          5
~llama_context:  OPENVINO0 compute buffer size is 258.5000 MiB, matches expectation of 258.5000 MiB
~llama_context: OPENVINO0_HOST compute buffer size is   9.0137 MiB, matches expectation of   9.0137 MiB
llama_context: constructing llama_context
llama_context: n_seq_max     = 1
llama_context: n_ctx         = 256
llama_context: n_ctx_seq     = 256
llama_context: n_batch       = 128
llama_context: n_ubatch      = 128
llama_context: causal_attn   = 1
llama_context: flash_attn    = enabled
llama_context: kv_unified    = false
llama_context: freq_base     = 500000.0
llama_context: freq_scale    = 1
llama_context: n_ctx_seq (256) < n_ctx_train (131072) -- the full capacity of the model will not be utilized
set_abort_callback: call
llama_context: OPENVINO0_HOST  output buffer size =     0.49 MiB
llama_kv_cache: layer   0: dev = OPENVINO0
llama_kv_cache: layer   1: dev = OPENVINO0
llama_kv_cache: layer   2: dev = OPENVINO0
llama_kv_cache: layer   3: dev = OPENVINO0
llama_kv_cache: layer   4: dev = OPENVINO0
llama_kv_cache: layer   5: dev = OPENVINO0
llama_kv_cache: layer   6: dev = OPENVINO0
llama_kv_cache: layer   7: dev = OPENVINO0
llama_kv_cache: layer   8: dev = OPENVINO0
llama_kv_cache: layer   9: dev = OPENVINO0
llama_kv_cache: layer  10: dev = OPENVINO0
llama_kv_cache: layer  11: dev = OPENVINO0
llama_kv_cache: layer  12: dev = OPENVINO0
llama_kv_cache: layer  13: dev = OPENVINO0
llama_kv_cache: layer  14: dev = OPENVINO0
llama_kv_cache: layer  15: dev = OPENVINO0
llama_kv_cache: layer  16: dev = OPENVINO0
llama_kv_cache: layer  17: dev = OPENVINO0
llama_kv_cache: layer  18: dev = OPENVINO0
llama_kv_cache: layer  19: dev = OPENVINO0
llama_kv_cache: layer  20: dev = OPENVINO0
llama_kv_cache: layer  21: dev = OPENVINO0
llama_kv_cache: layer  22: dev = OPENVINO0
llama_kv_cache: layer  23: dev = OPENVINO0
llama_kv_cache: layer  24: dev = OPENVINO0
llama_kv_cache: layer  25: dev = OPENVINO0
llama_kv_cache: layer  26: dev = OPENVINO0
llama_kv_cache: layer  27: dev = OPENVINO0
llama_kv_cache: layer  28: dev = OPENVINO0
llama_kv_cache: layer  29: dev = OPENVINO0
llama_kv_cache: layer  30: dev = OPENVINO0
llama_kv_cache: layer  31: dev = OPENVINO0
llama_kv_cache:  OPENVINO0 KV buffer size =    32.00 MiB
llama_kv_cache: size =   32.00 MiB (   256 cells,  32 layers,  1/1 seqs), K (f16):   16.00 MiB, V (f16):   16.00 MiB
llama_context: enumerating backends
llama_context: backend_ptrs.size() = 2
sched_reserve: reserving ...
sched_reserve: max_nodes = 2336
sched_reserve: reserving full memory module
sched_reserve: worst-case: n_tokens = 128, n_seqs = 1, n_outputs = 1
graph_reserve: reserving a graph for ubatch with n_tokens =  128, n_seqs =  1, n_outputs =  128
graph_reserve: reserving a graph for ubatch with n_tokens =    1, n_seqs =  1, n_outputs =    1
graph_reserve: reserving a graph for ubatch with n_tokens =  128, n_seqs =  1, n_outputs =  128
sched_reserve:  OPENVINO0 compute buffer size =    64.62 MiB
sched_reserve: OPENVINO0_HOST compute buffer size =     2.13 MiB
sched_reserve: graph nodes  = 999
sched_reserve: graph splits = 1
sched_reserve: reserve took 30.06 ms, sched copies = 1
attach_threadpool: call
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
set_n_threads: n_threads = 36, n_threads_batch = 36
| llama 8B Q4_K - Medium         |   4.58 GiB |     8.03 B | OPENVINO   | 100 |  1 |           tg128 |          8.01 + 0.08 |
llama_perf_context_print:        load time =   73603.48 ms
llama_perf_context_print: prompt eval time =       0.00 ms /     1 tokens (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:        eval time =       0.00 ms /   641 runs   (    0.00 ms per token,      inf tokens per second)
llama_perf_context_print:       total time =  153525.08 ms /   642 tokens
llama_perf_context_print:    graphs reused =        640
~llama_context:  OPENVINO0 compute buffer size is  64.6250 MiB, matches expectation of  64.6250 MiB
~llama_context: OPENVINO0_HOST compute buffer size is   2.1284 MiB, matches expectation of   2.1284 MiB

build: 5237965bb (8508)

@WizardlyBump17
Copy link

Hello. The default OpenVino versions on .devops/openvino.Dockerfile are not good. There is no 2026.0.0 version on the storage.

image
[1/5] STEP 7/13: RUN mkdir -p /opt/intel &&     wget https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz &&     tar -xf openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz &&     mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} &&     cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} &&     echo "Y" | ./install_dependencies/install_openvino_dependencies.sh &&     cd - &&     ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino
--2026-03-13 18:23:09--  https://storage.openvinotoolkit.org/repositories/openvino/packages/2026.0.0/linux/openvino_toolkit_ubuntu24_2026.0.0.20965.c6d6a13a886_x86_64.tgz
Resolving storage.openvinotoolkit.org (storage.openvinotoolkit.org)... 2600:9000:28b5:fe00:8:6691:6300:93a1, 2600:9000:28b5:6000:8:6691:6300:93a1, 2600:9000:28b5:8400:8:6691:6300:93a1, ...
Connecting to storage.openvinotoolkit.org (storage.openvinotoolkit.org)|2600:9000:28b5:fe00:8:6691:6300:93a1|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1061 (1.0K) [text/html]
Saving to: 'openvino_toolkit_ubuntu24_2026.0.0.20965.c6d6a13a886_x86_64.tgz'

     0K .                                                     100% 1.90G=0s

2026-03-13 18:23:11 (1.90 GB/s) - 'openvino_toolkit_ubuntu24_2026.0.0.20965.c6d6a13a886_x86_64.tgz' saved [1061/1061]


gzip: stdin: not in gzip format
tar: Child returned status 1
tar: Error is not recoverable: exiting now
Error: building at STEP "RUN mkdir -p /opt/intel &&     wget https://storage.openvinotoolkit.org/repositories/openvino/packages/${OPENVINO_VERSION_MAJOR}/linux/openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz &&     tar -xf openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64.tgz &&     mv openvino_toolkit_ubuntu24_${OPENVINO_VERSION_FULL}_x86_64 /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} &&     cd /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} &&     echo "Y" | ./install_dependencies/install_openvino_dependencies.sh &&     cd - &&     ln -s /opt/intel/openvino_${OPENVINO_VERSION_MAJOR} /opt/intel/openvino": while running runtime: exit status 2

@WizardlyBump17
Copy link

I tried it on my B580 inside a container, but llama.cpp didnt detect my gpu:

davi@davi:/tmp/a/llama.cpp$ podman build --file=.devops/openvino.Dockerfile --tag=test --target=full --env=GGML_OPENVINO_DEVICE=GPU --device=/dev/dri/renderD128 .

davi@davi:/tmp/a/llama.cpp$ podman run --interactive --tty --device=/dev/dri/renderD128 --entrypoint=/bin/bash --volume=/home/davi/AI/models/:/models/ --publish=1234:8080 --env=GGML_OPENVINO_DEVICE=GPU localhost/test:latest 
root@c5e2d71d4e5e:/app# ./llama-bench --list-devices
GGML OpenVINO Backend: device GPU is not available, fallback to CPU
OpenVINO: using device CPU
Available devices:
  OPENVINO0: OpenVINO Runtime (30976 MiB, 30976 MiB free)

@ravi9
Copy link
Contributor

ravi9 commented Mar 13, 2026

Hi @WizardlyBump17 ,
Apologies for the lack of clarity.
GPU drivers need to be installed inside the container to be able to detect GPU.
We plan to include GPU and NPU drivers inside the docker image in the next PR. You can use this

If you plan to test it now, use this openvino.Dockerfile

You could also verify if the GPU is detected by running clinfo inside the container.

@savvadesogle
Copy link

Hi @ravi9
I may be wrong, but isn't installing just intel-opencl-icd enough to detect the GPU in a container?

@savvadesogle
Copy link

pr-15307

Thanks to Zijun Yu, Ravi Panchumarthy, Su Yang, Mustafa Cavus, Arshath, Xuejun Zhai, Yamini Nimmagadda, and Wang Yang, you've done such a great job!

And thanks to reviewers Sigbjørn Skjæret, Georgi Gerganov, and Daniel Bevenius for their strict supervision!

And please don't be offended if I missed anyone, you're all amazing!!!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

devops improvements to build systems and github actions documentation Improvements or additions to documentation ggml changes relating to the ggml tensor library for machine learning testing Everything test related

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Feature Request: OpenVINO backend support request