From e67b3aa29796727c979903dd6e1341386942bee0 Mon Sep 17 00:00:00 2001 From: Todd Malsbary Date: Mon, 11 May 2026 14:49:49 -0700 Subject: [PATCH 1/5] Tidy up SYCL doc a bit - Add explicit links to referenced items - Fix spelling errors Signed-off-by: Todd Malsbary --- docs/backend/SYCL.md | 20 ++++++++++---------- 1 file changed, 10 insertions(+), 10 deletions(-) diff --git a/docs/backend/SYCL.md b/docs/backend/SYCL.md index f66facc856a7..6e28f156bbb3 100644 --- a/docs/backend/SYCL.md +++ b/docs/backend/SYCL.md @@ -43,11 +43,11 @@ The following releases are verified and recommended: ### Ubuntu 24.04 -The release packages for Ubuntu 24.04 x64 (FP32/FP16) only include the binary files of the llama.cpp SYCL backend. They require the target machine to have pre-installed Intel GPU drivers and oneAPI packages that are the same version as the build package. To get the version and installation info, refer to release.yml: ubuntu-24-sycl -> Download & Install oneAPI. +The release packages for Ubuntu 24.04 x64 (FP32/FP16) only include the binary files of the llama.cpp SYCL backend. They require the target machine to have pre-installed Intel GPU drivers and oneAPI packages that are the same version as the build package. To get the version and installation info, refer to [.github/workflows/release.yml#L713](../../.github/workflows/release.yml#L713): ubuntu-24-sycl -> Download & Install oneAPI. -It is recommended to use them with Intel Docker. +It is recommended to use them with [Intel Docker](https://hub.docker.com/r/intel/deep-learning-essentials). -The packages for FP32 and FP16 would have different accuracy and performance on LLMs. Please choose it acording to the test result. +The packages for FP32 and FP16 would have different accuracy and performance on LLMs. Please choose it according to the test result. ## News @@ -190,7 +190,7 @@ docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/d - **Intel GPU** -Intel data center GPUs drivers installation guide and download page can be found here: [Get intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps). +Intel data center GPUs drivers installation guide and download page can be found here: [Get Intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps). *Note*: for client GPUs *(iGPU & Arc A-Series)*, please refer to the [client iGPU driver installation](https://dgpu-docs.intel.com/driver/client/overview.html). @@ -240,7 +240,7 @@ Please follow the instructions for downloading and installing the Toolkit for Li Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable. -Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs. +Upon a successful installation, SYCL is enabled for the available Intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs. |Verified release| |-| @@ -319,7 +319,7 @@ Similar to the native `sycl-ls`, available SYCL devices can be queried as follow ./build/bin/llama-ls-sycl-device ``` -This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following: +This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *Intel GPU* it would look like the following: ``` found 2 SYCL devices: @@ -465,7 +465,7 @@ In the oneAPI command line, run the following to print the available SYCL device sycl-ls.exe ``` -There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device: +There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *Intel Iris Xe* GPU as a Level-zero SYCL device: Output (example): ``` @@ -731,7 +731,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512 |-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------| | GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG | | GGML_SYCL_ENABLE_FLASH_ATTN | 1 (default) or 0| Enable Flash-Attention. It can reduce memory usage. The performance impact depends on the LLM.| -| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) | +| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for Intel devices older than Gen 10) | | GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. | | GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. | | ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.
Recommended to use when --split-mode = layer | @@ -773,8 +773,8 @@ Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spo - `Split-mode:[row]` is not supported. -- Missed the AOT (Ahead-of-Time) in buiding. - - Good: build quickly, smaller size of binary file. +- Missed the AOT (Ahead-of-Time) in building. + - Good: Builds quickly, smaller size of binary file. - Bad: The startup is slow (JIT) in first time, but subsequent performance is unaffected. ## Q&A From 4bf20d90e74405967748b43df051a92cd7c09a0c Mon Sep 17 00:00:00 2001 From: Todd Malsbary Date: Mon, 11 May 2026 14:51:16 -0700 Subject: [PATCH 2/5] Correct documented default for GGML_SYCL_GRAPH The default is ON, not OFF: $ cmake -LAH -B build | grep GGML_SYCL_GRAPH ... GGML_SYCL_GRAPH:BOOL=ON Signed-off-by: Todd Malsbary --- docs/backend/SYCL.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/backend/SYCL.md b/docs/backend/SYCL.md index 6e28f156bbb3..1bcb4a65a6fd 100644 --- a/docs/backend/SYCL.md +++ b/docs/backend/SYCL.md @@ -717,7 +717,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512 | GGML_SYCL_TARGET | INTEL *(default)* | Set the SYCL target device type. | | GGML_SYCL_DEVICE_ARCH | Optional | Set the SYCL device architecture. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. | | GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. (1.) | -| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). | +| GGML_SYCL_GRAPH | ON *(default)* \|OFF *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). | | GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. | | GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. | | CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. | From e25af0b981bd199bb8e8def246c5f88caacc1683 Mon Sep 17 00:00:00 2001 From: Todd Malsbary Date: Tue, 12 May 2026 10:07:17 -0700 Subject: [PATCH 3/5] Move docker instructions from SYCL.md to docker.md This makes them directly accesible from the Quick Start section of the top-level README.md. Signed-off-by: Todd Malsbary --- docs/backend/SYCL.md | 30 +----------------------------- docs/docker.md | 44 ++++++++++++++++++++++++++++++++++++++++++++ 2 files changed, 45 insertions(+), 29 deletions(-) diff --git a/docs/backend/SYCL.md b/docs/backend/SYCL.md index 1bcb4a65a6fd..ec6086a995fe 100644 --- a/docs/backend/SYCL.md +++ b/docs/backend/SYCL.md @@ -152,35 +152,7 @@ NA ## Docker -The docker build option is currently limited to *Intel GPU* targets. - -### Build image - -```sh -# Using FP32 -docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=OFF" --target light -f .devops/intel.Dockerfile . - -# Using FP16 -docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=ON" --target light -f .devops/intel.Dockerfile . -``` - -*Notes*: - -You can also use the `.devops/llama-server-intel.Dockerfile`, which builds the *"server"* alternative. -Check the [documentation for Docker](../docker.md) to see the available images. - -### Run container - -```sh -# First, find all the DRI cards -ls -la /dev/dri -# Then, pick the card that you want to use (here for e.g. /dev/dri/card1). -docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-sycl -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -c 4096 -s 0 -``` - -*Notes:* -- Docker has been tested successfully on native Linux. WSL support has not been verified yet. -- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*. +Please refer to [Docker with SYCL](../docker.md#docker-with-sycl) for details. ## Linux diff --git a/docs/docker.md b/docs/docker.md index 7f99bfaad628..cd6cd9806008 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -140,3 +140,47 @@ docker run -v /path/to/models:/models local/llama.cpp:full-musa --run -m /models docker run -v /path/to/models:/models local/llama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1 docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 1 ``` + +## Docker With SYCL + +## Building Docker locally + +```bash +docker build -t local/llama.cpp:full-intel --target full -f .devops/intel.Dockerfile . +docker build -t local/llama.cpp:light-intel --target light -f .devops/intel.Dockerfile . +docker build -t local/llama.cpp:server-intel --target server -f .devops/intel.Dockerfile . +``` + +You may want to pass in some different `ARGS`, depending on the SYCL environment supported by your container host, as well as the GPU architecture. + +The defaults are: + +- `GGML_SYCL_F16` set to `OFF` +- `IGC_VERSION` set to `v2.30.1` +- `IGC_VERSION_FULL` set to `2_2.30.1+20950` +- `COMPUTE_RUNTIME_VERSION` set to `26.09.37435.1` +- `COMPUTE_RUNTIME_VERSION_FULL` set to `26.09.37435.1-0` +- `IGDGMM_VERSION` set to `22.9.0` + +The resulting images, are essentially the same as the non-SYCL images: + +1. `local/llama.cpp:full-intel`: This image includes both the `llama-cli` and `llama-completion` executables and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. +2. `local/llama.cpp:light-intel`: This image only includes the `llama-cli` and `llama-completion` executables. +3. `local/llama.cpp:server-intel`: This image only includes the `llama-server` executable. + +## Usage + +After building locally, usage is similar to the non-SYCL examples, but you'll need to add the `--device` flag. + +```bash +# First, find all the DRI cards +ls -la /dev/dri +# Then, pick the card that you want to use (here for e.g. /dev/dri/card0). +docker run --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 -v /path/to/models:/models local/llama.cpp:full-intel -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 99 +docker run --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 -v /path/to/models:/models local/llama.cpp:light-intel -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 99 +docker run --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 -v /path/to/models:/models local/llama.cpp:server-intel -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 99 +``` + +*Notes:* +- Docker has been tested successfully on native Linux. WSL support has not been verified yet. +- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](./backend/SYCL.md#linux) for details)*. From 0e8aed12df8c838630525e3d3ae6de82af2dda07 Mon Sep 17 00:00:00 2001 From: Todd Malsbary Date: Mon, 1 Jun 2026 17:21:52 +0000 Subject: [PATCH 4/5] Refer to intel.Dockerfile for ARGs and their defaults The defaults are always changing; this avoids accuracy errors from duplicating the information. Signed-off-by: Todd Malsbary --- docs/docker.md | 10 +--------- 1 file changed, 1 insertion(+), 9 deletions(-) diff --git a/docs/docker.md b/docs/docker.md index cd6cd9806008..b1c6c1f6f9f8 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -152,15 +152,7 @@ docker build -t local/llama.cpp:server-intel --target server -f .devops/intel.Do ``` You may want to pass in some different `ARGS`, depending on the SYCL environment supported by your container host, as well as the GPU architecture. - -The defaults are: - -- `GGML_SYCL_F16` set to `OFF` -- `IGC_VERSION` set to `v2.30.1` -- `IGC_VERSION_FULL` set to `2_2.30.1+20950` -- `COMPUTE_RUNTIME_VERSION` set to `26.09.37435.1` -- `COMPUTE_RUNTIME_VERSION_FULL` set to `26.09.37435.1-0` -- `IGDGMM_VERSION` set to `22.9.0` +Refer to [.devops/intel.Dockerfile](../.devops/intel.Dockerfile) for the available `ARGS` and their defaults. The resulting images, are essentially the same as the non-SYCL images: From 8e0b2f3ed96af7844c6c4f84cdfb71ed2114edcb Mon Sep 17 00:00:00 2001 From: Todd Malsbary Date: Mon, 1 Jun 2026 17:23:07 +0000 Subject: [PATCH 5/5] Remove mention of Nvidia in SYCL row of backend table This support was removed in 2026.02 - refer to the SYCL.md News. Signed-off-by: Todd Malsbary --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index de93f17eb773..a823c5b18283 100644 --- a/README.md +++ b/README.md @@ -279,7 +279,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo | [Metal](docs/build.md#metal-build) | Apple Silicon | | [BLAS](docs/build.md#blas-build) | All | | [BLIS](docs/backend/BLIS.md) | All | -| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU | +| [SYCL](docs/backend/SYCL.md) | Intel GPU | | [OpenVINO [In Progress]](docs/backend/OPENVINO.md) | Intel CPUs, GPUs, and NPUs | | [MUSA](docs/build.md#musa) | Moore Threads GPU | | [CUDA](docs/build.md#cuda) | Nvidia GPU |