diff --git a/README.md b/README.md index de93f17eb773..a823c5b18283 100644 --- a/README.md +++ b/README.md @@ -279,7 +279,7 @@ Instructions for adding support for new models: [HOWTO-add-model.md](docs/develo | [Metal](docs/build.md#metal-build) | Apple Silicon | | [BLAS](docs/build.md#blas-build) | All | | [BLIS](docs/backend/BLIS.md) | All | -| [SYCL](docs/backend/SYCL.md) | Intel and Nvidia GPU | +| [SYCL](docs/backend/SYCL.md) | Intel GPU | | [OpenVINO [In Progress]](docs/backend/OPENVINO.md) | Intel CPUs, GPUs, and NPUs | | [MUSA](docs/build.md#musa) | Moore Threads GPU | | [CUDA](docs/build.md#cuda) | Nvidia GPU | diff --git a/docs/backend/SYCL.md b/docs/backend/SYCL.md index f66facc856a7..ec6086a995fe 100644 --- a/docs/backend/SYCL.md +++ b/docs/backend/SYCL.md @@ -43,11 +43,11 @@ The following releases are verified and recommended: ### Ubuntu 24.04 -The release packages for Ubuntu 24.04 x64 (FP32/FP16) only include the binary files of the llama.cpp SYCL backend. They require the target machine to have pre-installed Intel GPU drivers and oneAPI packages that are the same version as the build package. To get the version and installation info, refer to release.yml: ubuntu-24-sycl -> Download & Install oneAPI. +The release packages for Ubuntu 24.04 x64 (FP32/FP16) only include the binary files of the llama.cpp SYCL backend. They require the target machine to have pre-installed Intel GPU drivers and oneAPI packages that are the same version as the build package. To get the version and installation info, refer to [.github/workflows/release.yml#L713](../../.github/workflows/release.yml#L713): ubuntu-24-sycl -> Download & Install oneAPI. -It is recommended to use them with Intel Docker. +It is recommended to use them with [Intel Docker](https://hub.docker.com/r/intel/deep-learning-essentials). -The packages for FP32 and FP16 would have different accuracy and performance on LLMs. Please choose it acording to the test result. +The packages for FP32 and FP16 would have different accuracy and performance on LLMs. Please choose it according to the test result. ## News @@ -152,35 +152,7 @@ NA ## Docker -The docker build option is currently limited to *Intel GPU* targets. - -### Build image - -```sh -# Using FP32 -docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=OFF" --target light -f .devops/intel.Dockerfile . - -# Using FP16 -docker build -t llama-cpp-sycl --build-arg="GGML_SYCL_F16=ON" --target light -f .devops/intel.Dockerfile . -``` - -*Notes*: - -You can also use the `.devops/llama-server-intel.Dockerfile`, which builds the *"server"* alternative. -Check the [documentation for Docker](../docker.md) to see the available images. - -### Run container - -```sh -# First, find all the DRI cards -ls -la /dev/dri -# Then, pick the card that you want to use (here for e.g. /dev/dri/card1). -docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 llama-cpp-sycl -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 400 -e -ngl 33 -c 4096 -s 0 -``` - -*Notes:* -- Docker has been tested successfully on native Linux. WSL support has not been verified yet. -- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](#linux) for details)*. +Please refer to [Docker with SYCL](../docker.md#docker-with-sycl) for details. ## Linux @@ -190,7 +162,7 @@ docker run -it --rm -v "/path/to/models:/models" --device /dev/dri/renderD128:/d - **Intel GPU** -Intel data center GPUs drivers installation guide and download page can be found here: [Get intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps). +Intel data center GPUs drivers installation guide and download page can be found here: [Get Intel dGPU Drivers](https://dgpu-docs.intel.com/driver/installation.html#ubuntu-install-steps). *Note*: for client GPUs *(iGPU & Arc A-Series)*, please refer to the [client iGPU driver installation](https://dgpu-docs.intel.com/driver/client/overview.html). @@ -240,7 +212,7 @@ Please follow the instructions for downloading and installing the Toolkit for Li Following guidelines/code snippets assume the default installation values. Otherwise, please make sure the necessary changes are reflected where applicable. -Upon a successful installation, SYCL is enabled for the available intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs. +Upon a successful installation, SYCL is enabled for the available Intel devices, along with relevant libraries such as oneAPI oneDNN for Intel GPUs. |Verified release| |-| @@ -319,7 +291,7 @@ Similar to the native `sycl-ls`, available SYCL devices can be queried as follow ./build/bin/llama-ls-sycl-device ``` -This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *intel GPU* it would look like the following: +This command will only display the selected backend that is supported by SYCL. The default backend is level_zero. For example, in a system with 2 *Intel GPU* it would look like the following: ``` found 2 SYCL devices: @@ -465,7 +437,7 @@ In the oneAPI command line, run the following to print the available SYCL device sycl-ls.exe ``` -There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *intel Iris Xe* GPU as a Level-zero SYCL device: +There should be one or more *level-zero* GPU devices displayed as **[ext_oneapi_level_zero:gpu]**. Below is example of such output detecting an *Intel Iris Xe* GPU as a Level-zero SYCL device: Output (example): ``` @@ -717,7 +689,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512 | GGML_SYCL_TARGET | INTEL *(default)* | Set the SYCL target device type. | | GGML_SYCL_DEVICE_ARCH | Optional | Set the SYCL device architecture. Setting the device architecture can improve the performance. See the table [--offload-arch](https://github.com/intel/llvm/blob/sycl/sycl/doc/design/OffloadDesign.md#--offload-arch) for a list of valid architectures. | | GGML_SYCL_F16 | OFF *(default)* \|ON *(optional)* | Enable FP16 build with SYCL code path. (1.) | -| GGML_SYCL_GRAPH | OFF *(default)* \|ON *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). | +| GGML_SYCL_GRAPH | ON *(default)* \|OFF *(Optional)* | Enable build with [SYCL Graph extension](https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc). | | GGML_SYCL_DNN | ON *(default)* \|OFF *(Optional)* | Enable build with oneDNN. | | GGML_SYCL_HOST_MEM_FALLBACK | ON *(default)* \|OFF *(Optional)* | Allow host memory fallback when device memory is full during quantized weight reorder. Enables inference to continue at reduced speed (reading over PCIe) instead of failing. Requires Linux kernel 6.8+. | | CMAKE_C_COMPILER | `icx` *(Linux)*, `icx/cl` *(Windows)* | Set `icx` compiler for SYCL code path. | @@ -731,7 +703,7 @@ use 1 SYCL GPUs: [0] with Max compute units:512 |-------------------|------------------|---------------------------------------------------------------------------------------------------------------------------| | GGML_SYCL_DEBUG | 0 (default) or 1 | Enable log function by macro: GGML_SYCL_DEBUG | | GGML_SYCL_ENABLE_FLASH_ATTN | 1 (default) or 0| Enable Flash-Attention. It can reduce memory usage. The performance impact depends on the LLM.| -| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for intel devices older than Gen 10) | +| GGML_SYCL_DISABLE_OPT | 0 (default) or 1 | Disable optimize features for Intel GPUs. (Recommended to 1 for Intel devices older than Gen 10) | | GGML_SYCL_DISABLE_GRAPH | 0 or 1 (default) | Disable running computations through SYCL Graphs feature. Disabled by default because SYCL Graph is still on development, no better performance. | | GGML_SYCL_DISABLE_DNN | 0 (default) or 1 | Disable running computations through oneDNN and always use oneMKL. | | ZES_ENABLE_SYSMAN | 0 (default) or 1 | Support to get free memory of GPU by sycl::aspect::ext_intel_free_memory.
Recommended to use when --split-mode = layer | @@ -773,8 +745,8 @@ Pass these via `CXXFLAGS` or add a one-off `#define` to enable a flag on the spo - `Split-mode:[row]` is not supported. -- Missed the AOT (Ahead-of-Time) in buiding. - - Good: build quickly, smaller size of binary file. +- Missed the AOT (Ahead-of-Time) in building. + - Good: Builds quickly, smaller size of binary file. - Bad: The startup is slow (JIT) in first time, but subsequent performance is unaffected. ## Q&A diff --git a/docs/docker.md b/docs/docker.md index 7f99bfaad628..b1c6c1f6f9f8 100644 --- a/docs/docker.md +++ b/docs/docker.md @@ -140,3 +140,39 @@ docker run -v /path/to/models:/models local/llama.cpp:full-musa --run -m /models docker run -v /path/to/models:/models local/llama.cpp:light-musa -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 1 docker run -v /path/to/models:/models local/llama.cpp:server-musa -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 1 ``` + +## Docker With SYCL + +## Building Docker locally + +```bash +docker build -t local/llama.cpp:full-intel --target full -f .devops/intel.Dockerfile . +docker build -t local/llama.cpp:light-intel --target light -f .devops/intel.Dockerfile . +docker build -t local/llama.cpp:server-intel --target server -f .devops/intel.Dockerfile . +``` + +You may want to pass in some different `ARGS`, depending on the SYCL environment supported by your container host, as well as the GPU architecture. +Refer to [.devops/intel.Dockerfile](../.devops/intel.Dockerfile) for the available `ARGS` and their defaults. + +The resulting images, are essentially the same as the non-SYCL images: + +1. `local/llama.cpp:full-intel`: This image includes both the `llama-cli` and `llama-completion` executables and the tools to convert LLaMA models into ggml and convert into 4-bit quantization. +2. `local/llama.cpp:light-intel`: This image only includes the `llama-cli` and `llama-completion` executables. +3. `local/llama.cpp:server-intel`: This image only includes the `llama-server` executable. + +## Usage + +After building locally, usage is similar to the non-SYCL examples, but you'll need to add the `--device` flag. + +```bash +# First, find all the DRI cards +ls -la /dev/dri +# Then, pick the card that you want to use (here for e.g. /dev/dri/card0). +docker run --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 -v /path/to/models:/models local/llama.cpp:full-intel -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 99 +docker run --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 -v /path/to/models:/models local/llama.cpp:light-intel -m /models/7B/ggml-model-q4_0.gguf -p "Building a website can be done in 10 simple steps:" -n 512 --n-gpu-layers 99 +docker run --device /dev/dri/renderD128:/dev/dri/renderD128 --device /dev/dri/card0:/dev/dri/card0 -v /path/to/models:/models local/llama.cpp:server-intel -m /models/7B/ggml-model-q4_0.gguf --port 8080 --host 0.0.0.0 -n 512 --n-gpu-layers 99 +``` + +*Notes:* +- Docker has been tested successfully on native Linux. WSL support has not been verified yet. +- You may need to install Intel GPU driver on the **host** machine *(Please refer to the [Linux configuration](./backend/SYCL.md#linux) for details)*.