diff --git a/README.md b/README.md index 9cc325e924f7..477e61489d5a 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,7 @@ Easy, fast, and cheap LLM serving for everyone --- *Latest News* 🔥 +- [2023/12] Added ROCm support to vLLM. - [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing). - [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there. - [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv! @@ -43,6 +44,7 @@ vLLM is flexible and easy to use with: - Tensor parallelism support for distributed inference - Streaming outputs - OpenAI-compatible API server +- Support NVIDIA CUDA and AMD ROCm. vLLM seamlessly supports many Hugging Face models, including the following architectures: diff --git a/docs/source/getting_started/amd-installation.rst b/docs/source/getting_started/amd-installation.rst index 57f902bf0518..7d8db353906b 100644 --- a/docs/source/getting_started/amd-installation.rst +++ b/docs/source/getting_started/amd-installation.rst @@ -1,11 +1,11 @@ -.. _installation: +.. _installation_rocm: Installation with ROCm -============ +====================== vLLM 0.2.x onwards supports model inferencing and serving on AMD GPUs with ROCm. At the moment AWQ quantization is not supported in ROCm, but SqueezeLLM quantization has been ported. -Datatypes currently supported in ROCm are FP16 and BF16. +Data types currently supported in ROCm are FP16 and BF16. Requirements ------------ @@ -13,14 +13,84 @@ Requirements * OS: Linux * Python: 3.8 -- 3.11 (Verified on 3.10) * GPU: MI200s -* Pytorch 2.0.1/2.1.1 -* ROCm >= 5.7.0 +* Pytorch 2.0.1/2.1.1/2.2 +* ROCm 5.7 +Installation options: -.. _build_from_source: +#. :ref:`(Recommended) Quick start with vLLM pre-installed in Docker Image ` +#. :ref:`Build from source ` +#. :ref:`Build from source with docker ` -Build from source with docker ------------------ +.. _quick_start_docker_rocm: + +(Recommended) Option 1: Quick start with vLLM pre-installed in Docker Image +--------------------------------------------------------------------------- + +.. code-block:: console + + $ docker pull embeddedllminfo/vllm-rocm:vllm-v0.2.3 + $ docker run -it \ + --network=host \ + --group-add=video \ + --ipc=host \ + --cap-add=SYS_PTRACE \ + --security-opt seccomp=unconfined \ + --shm-size 8G \ + --device /dev/kfd \ + --device /dev/dri \ + -v :/app/model \ + embeddedllminfo/vllm-rocm \ + bash + + +.. _build_from_source_rocm: + +Option 2: Build from source +--------------------------- + +You can build and install vLLM from source: + +0. Install prerequisites (skip if you are already in an environment/docker with the following installed): + +- `ROCm `_ +- `Pytorch `_ + + .. code-block:: console + + $ pip install torch==2.2.0.dev20231206+rocm5.7 --index-url https://download.pytorch.org/whl/nightly/rocm5.7 # tested version + + +1. Install `flash attention for ROCm `_ + + Install ROCm's flash attention (v2.0.4) following the instructions from `ROCmSoftwarePlatform/flash-attention `_ + +.. note:: + - If you are using rocm5.7 with pytorch 2.1.0 onwards, you don't need to apply the `hipify_python.patch`. You can build the ROCm flash attention directly. + - If you fail to install `ROCmSoftwarePlatform/flash-attention`, try cloning from the commit `6fd2f8e572805681cd67ef8596c7e2ce521ed3c6`. + - ROCm's Flash-attention-2 (v2.0.4) does not support sliding windows attention. + - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`) + +2. Setup `xformers==0.0.22.post7` without dependencies, and apply patches to adapt for ROCm flash attention + + .. code-block:: console + + $ pip install xformers==0.0.22.post7 --no-deps + $ bash patch_xformers-0.0.22.post7.rocm.sh + +3. Build vLLM. + + .. code-block:: console + + $ cd vllm + $ pip install -U -r requirements-rocm.txt + $ python setup.py install # This may take 5-10 minutes. Currently, `pip install .`` does not work for ROCm installation + + +.. _build_from_source_docker_rocm: + +Option 3: Build from source with docker +----------------------------------------------------- You can build and install vLLM from source: @@ -54,21 +124,22 @@ Alternatively, if you plan to install vLLM-ROCm on a local machine or start from Install ROCm's flash attention (v2.0.4) following the instructions from `ROCmSoftwarePlatform/flash-attention `_ .. note:: + - If you are using rocm5.7 with pytorch 2.1.0 onwards, you don't need to apply the `hipify_python.patch`. You can build the ROCm flash attention directly. + - If you fail to install `ROCmSoftwarePlatform/flash-attention`, try cloning from the commit `6fd2f8e572805681cd67ef8596c7e2ce521ed3c6`. - ROCm's Flash-attention-2 (v2.0.4) does not support sliding windows attention. - You might need to downgrade the "ninja" version to 1.10 it is not used when compiling flash-attention-2 (e.g. `pip install ninja==1.10.2.4`) -2. Setup xformers==0.0.22.post7 without dependencies, and apply patches to adapt for ROCm flash attention +2. Setup `xformers==0.0.22.post7` without dependencies, and apply patches to adapt for ROCm flash attention .. code-block:: console $ pip install xformers==0.0.22.post7 --no-deps $ bash patch_xformers-0.0.22.post7.rocm.sh -3. Build vllm. +3. Build vLLM. .. code-block:: console $ cd vllm $ pip install -U -r requirements-rocm.txt $ python setup.py install # This may take 5-10 minutes. - diff --git a/docs/source/index.rst b/docs/source/index.rst index 300c22762df4..04af09073a44 100644 --- a/docs/source/index.rst +++ b/docs/source/index.rst @@ -39,6 +39,7 @@ vLLM is flexible and easy to use with: * Tensor parallelism support for distributed inference * Streaming outputs * OpenAI-compatible API server +* Support NVIDIA CUDA and AMD ROCm. For more information, check out the following: @@ -56,6 +57,7 @@ Documentation :caption: Getting Started getting_started/installation + getting_started/amd-installation getting_started/quickstart .. toctree::