-
-
Notifications
You must be signed in to change notification settings - Fork 17k
Merge EmbeddedLLM/vllm-rocm into vLLM main #1749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
b6f4f4b
eea4631
fc2d074
998a80d
cddb9b2
726cddf
bf999b1
7f5cf5b
edab2f4
1c1bb0f
1815c0a
749bc86
b4d6f2e
9be4bba
077c77c
3a0eea4
89e8cf4
168b6e6
343d234
5abe1e5
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,72 +1,64 @@ | ||
| FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev | ||
|
|
||
| RUN apt-get update -y \ | ||
| && apt-get install -y python3-pip | ||
|
|
||
| WORKDIR /workspace | ||
|
|
||
| # install build and runtime dependencies | ||
| COPY requirements.txt requirements.txt | ||
| RUN --mount=type=cache,target=/root/.cache/pip \ | ||
| pip install -r requirements.txt | ||
|
|
||
| # install development dependencies | ||
| COPY requirements-dev.txt requirements-dev.txt | ||
| RUN --mount=type=cache,target=/root/.cache/pip \ | ||
| pip install -r requirements-dev.txt | ||
|
|
||
| # image to build pytorch extensions | ||
| FROM dev AS build | ||
|
|
||
| # copy input files | ||
| COPY csrc csrc | ||
| COPY setup.py setup.py | ||
| COPY requirements.txt requirements.txt | ||
| COPY pyproject.toml pyproject.toml | ||
| COPY vllm/__init__.py vllm/__init__.py | ||
|
|
||
| # max jobs used by Ninja to build extensions | ||
| ENV MAX_JOBS=$max_jobs | ||
| RUN python3 setup.py build_ext --inplace | ||
|
|
||
| # image to run unit testing suite | ||
| FROM dev AS test | ||
|
|
||
| # copy pytorch extensions separately to avoid having to rebuild | ||
| # when python code changes | ||
| COPY --from=build /workspace/vllm/*.so /workspace/vllm/ | ||
| COPY tests tests | ||
| COPY vllm vllm | ||
|
|
||
| ENTRYPOINT ["python3", "-m", "pytest", "tests"] | ||
|
|
||
| # use CUDA base as CUDA runtime dependencies are already installed via pip | ||
| FROM nvidia/cuda:11.8.0-base-ubuntu22.04 AS vllm-base | ||
|
|
||
| # libnccl required for ray | ||
| RUN apt-get update -y \ | ||
| && apt-get install -y python3-pip | ||
|
|
||
| WORKDIR /workspace | ||
| COPY requirements.txt requirements.txt | ||
| RUN --mount=type=cache,target=/root/.cache/pip \ | ||
| pip install -r requirements.txt | ||
|
|
||
| FROM vllm-base AS vllm | ||
| COPY --from=build /workspace/vllm/*.so /workspace/vllm/ | ||
| COPY vllm vllm | ||
|
|
||
| EXPOSE 8000 | ||
| ENTRYPOINT ["python3", "-m", "vllm.entrypoints.api_server"] | ||
|
|
||
| # openai api server alternative | ||
| FROM vllm-base AS vllm-openai | ||
| # install additional dependencies for openai api server | ||
| RUN --mount=type=cache,target=/root/.cache/pip \ | ||
| pip install accelerate fschat | ||
|
|
||
| COPY --from=build /workspace/vllm/*.so /workspace/vllm/ | ||
| COPY vllm vllm | ||
|
|
||
| ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"] | ||
|
|
||
| FROM rocm/pytorch:rocm5.7_ubuntu22.04_py3.10_pytorch_2.0.1 | ||
|
|
||
| # Install some basic utilities | ||
| RUN apt-get update && apt-get install python3 python3-pip -y | ||
|
|
||
| # Install some basic utilities | ||
| RUN apt-get update && apt-get install -y \ | ||
| curl \ | ||
| ca-certificates \ | ||
| sudo \ | ||
| git \ | ||
| bzip2 \ | ||
| libx11-6 \ | ||
| build-essential \ | ||
| wget \ | ||
| unzip \ | ||
| nvidia-cuda-toolkit \ | ||
| tmux \ | ||
| && rm -rf /var/lib/apt/lists/* | ||
|
|
||
| ### Mount Point ### | ||
| # When launching the container, mount the code directory to /app | ||
| ARG APP_MOUNT=/app | ||
| VOLUME [ ${APP_MOUNT} ] | ||
| WORKDIR ${APP_MOUNT} | ||
|
|
||
| RUN python3 -m pip install --upgrade pip | ||
| RUN python3 -m pip install --no-cache-dir fastapi ninja tokenizers | ||
|
|
||
| ENV LLVM_SYMBOLIZER_PATH=/opt/rocm/llvm/bin/llvm-symbolizer | ||
| ENV PATH=$PATH:/opt/rocm/bin:/libtorch/bin: | ||
| ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/opt/rocm/lib/:/libtorch/lib: | ||
| ENV CPLUS_INCLUDE_PATH=$CPLUS_INCLUDE_PATH:/libtorch/include:/libtorch/include/torch/csrc/api/include/:/opt/rocm/include/: | ||
| ENV PYTORCH_ROCM_ARCH=gfx900;gfx906;gfx908;gfx90a;gfx1030;gfx1101 | ||
|
|
||
| # Install ROCm flash-attention | ||
| RUN mkdir libs \ | ||
| && cd libs \ | ||
| && git clone https://github.com/ROCmSoftwarePlatform/flash-attention.git \ | ||
| && cd flash-attention \ | ||
| && git submodule update --init \ | ||
| && sed -i -e "s/--offload-arch=native/--offload-arch=$(/opt/rocm/llvm/bin/amdgpu-offload-arch)/g" setup.py \ | ||
|
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Thank you for the pull request. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. We implemented a temporary solution during the build process with ROCm/flash-attention@edc7698.
Collaborator
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, use a specific commit of a named branch to achieve stable and reproducible result than using the default branch since it might keep changing. Related to the name of the Dockerfile, you might want to rename the Dockerfile to Dockerfile.rocm_xxx with xxx related to the version of the rocm you are using. |
||
| && patch /opt/conda/envs/py_3.10/lib/python3.10/site-packages/torch/utils/hipify/hipify_python.py hipify_patch.patch \ | ||
| && python3 setup.py install \ | ||
| && cd .. | ||
|
|
||
| COPY ./ /app/vllm-rocm/ | ||
|
|
||
| RUN cd /app \ | ||
| && cd vllm-rocm \ | ||
| && git checkout v0.2.1.post1-rocm \ | ||
| && python3 setup.py install \ | ||
| && cd .. | ||
|
|
||
| RUN cd /app \ | ||
| && mkdir dataset \ | ||
| && cd .. | ||
|
|
||
| COPY ./benchmark_throughput.sh /app/benchmark_throughput.sh | ||
|
|
||
| RUN python3 -m pip install --upgrade pip | ||
| RUN python3 -m pip install --no-cache-dir ray[all] | ||
|
|
||
| CMD ["/bin/bash"] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,92 +1,147 @@ | ||
|
|
||
| <p align="center"> | ||
| <picture> | ||
| <source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-dark.png"> | ||
| <img alt="vLLM" src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/logos/vllm-logo-text-light.png" width=55%> | ||
| <source media="(prefers-color-scheme: dark)" srcset="docs/source/assets/logos/vllm-logo-text-dark.png"> | ||
| <img alt="vLLM" src="docs/source/assets/logos/vllm-logo-text-light.png" width=60%> | ||
| </picture> | ||
| <picture> | ||
| <img alt="ROCm" src="docs/source/assets/logos/16900649.png" width=15%> | ||
| </picture> | ||
| </p> | ||
|
|
||
| <h3 align="center"> | ||
| Easy, fast, and cheap LLM serving for everyone | ||
| </h3> | ||
| <h1> | ||
| vLLM ROCm port | ||
| </h1> | ||
|
|
||
| This version of vLLM 0.2.x supports model inferencing and serving on AMD GPUs with ROCm. This ROCm port was adapted from [vLLM](https://github.com/vllm-project/vllm), a ROCm [community port](https://github.com/pcmoritz/vllm-public/tree/port-to-rocm) and [xformers](https://github.com/facebookresearch/xformers), replacing the attention forward method employed in xformers by the ROCm realization of [flash attention](https://github.com/ROCmSoftwarePlatform/flash-attention). Currently this port does not support AWQ quantization yet, but SqueezeLLM has been incorporated. | ||
|
|
||
| This port is an extension of our previous [vLLM v0.1.4 ROCm port](https://github.com/EmbeddedLLM/vllm-rocm/tree/v0.1.4-rocm). Compared with our previous port, vLLM v0.2.x achieves speedup of > 2x for LLaMA-70B model, and > 3x for LLaMA-7B/13B on MI210 thanks to the introduction of [efficient de-tokenization, vectorized sampling](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit#slide=id.g24c1f26d37c_10_117) and [paged attention v2](https://github.com/vllm-project/vllm/pull/1348). | ||
|
|
||
| <p align="center"> | ||
| | <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> | | ||
| <picture> | ||
| <img alt="throughput_tokens" src="docs/source/assets/benchmarks/throughput_tokens.png" width=70%> | ||
| </picture> | ||
| </p> | ||
|
|
||
| <p align="center"> | ||
| <picture> | ||
| <img alt="throughput_requests" src="docs/source/assets/benchmarks/throughput_requests.png" width=70%> | ||
| </picture> | ||
| </p> | ||
|
|
||
| --- | ||
|
|
||
| *Latest News* 🔥 | ||
| - [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing). | ||
| - [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there. | ||
| - [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv! | ||
| - [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM. | ||
| - [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command! | ||
| - [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds. | ||
| - [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai). | ||
| *Latest News* | ||
| - [2023/11] We have updated our ROCm port for vLLM v0.2.x. | ||
| - [2023/10] LLaMA-2 models are now supported. 7B/13B/70B models can be run and served on AMD GPUs! | ||
|
|
||
| --- | ||
|
|
||
| vLLM is a fast and easy-to-use library for LLM inference and serving. | ||
| ## Getting Started | ||
|
|
||
| vLLM is fast with: | ||
| The following sections describes the installation of this ROCm port. If you intend to use our provided container, please skip to the [using docker](#using-docker) section. | ||
|
|
||
| - State-of-the-art serving throughput | ||
| - Efficient management of attention key and value memory with **PagedAttention** | ||
| - Continuous batching of incoming requests | ||
| - Optimized CUDA kernels | ||
| ## Dependencies | ||
|
|
||
| vLLM is flexible and easy to use with: | ||
| To build this project, the following pre-requisites must be met: | ||
|
|
||
| - Seamless integration with popular Hugging Face models | ||
| - High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more | ||
| - Tensor parallelism support for distributed inference | ||
| - Streaming outputs | ||
| - OpenAI-compatible API server | ||
| - [PyTorch](https://pytorch.org/) with ROCm (5.7.0 or later) support | ||
|
|
||
| vLLM seamlessly supports many Hugging Face models, including the following architectures: | ||
| - Install ROCm [flash-attention](https://github.com/ROCmSoftwarePlatform/flash-attention) following the instructions in [AMD ROCm Support](https://github.com/ROCmSoftwarePlatform/flash-attention#amd-gpurocm-support) | ||
|
|
||
| - Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.) | ||
| - Baichuan (`baichuan-inc/Baichuan-7B`, `baichuan-inc/Baichuan-13B-Chat`, etc.) | ||
| - BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.) | ||
| - Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.) | ||
| - GPT-2 (`gpt2`, `gpt2-xl`, etc.) | ||
| - GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.) | ||
| - GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.) | ||
| - GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.) | ||
| - InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.) | ||
| - LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.) | ||
| - Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.) | ||
| - MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.) | ||
| - OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.) | ||
| - Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.) | ||
| ## Installation | ||
|
|
||
| Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source): | ||
| Build the repository | ||
|
|
||
| ```bash | ||
| pip install vllm | ||
| git clone https://github.com/EmbeddedLLM/vllm-rocm.git | ||
| cd vllm-rocm/ | ||
| python3 setup.py install | ||
| ``` | ||
|
|
||
| ## Getting Started | ||
| ## Using Docker | ||
|
|
||
| Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started. | ||
| - [Installation](https://vllm.readthedocs.io/en/latest/getting_started/installation.html) | ||
| - [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html) | ||
| - [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html) | ||
| A base docker image can be built from this repository: | ||
|
|
||
| ## Contributing | ||
| ```bash | ||
| docker build -t vllm-rocm . | ||
| ``` | ||
|
|
||
| Run a docker container with | ||
|
|
||
| We welcome and value any contributions and collaborations. | ||
| Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved. | ||
| ```bash | ||
| docker run -it \ | ||
| --network=host \ | ||
| --group-add=video \ | ||
| --ipc=host \ | ||
| --cap-add=SYS_PTRACE \ | ||
| --security-opt seccomp=unconfined \ | ||
| --shm-size 8G \ | ||
| --device /dev/kfd \ | ||
| --device /dev/dri \ | ||
| vllm-rocm \ | ||
| bash | ||
| ``` | ||
|
|
||
| ## Citation | ||
| Alternatively, you can pull from our pre-built docker image: | ||
|
|
||
| If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180): | ||
| ```bibtex | ||
| @inproceedings{kwon2023efficient, | ||
| title={Efficient Memory Management for Large Language Model Serving with PagedAttention}, | ||
| author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica}, | ||
| booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles}, | ||
| year={2023} | ||
| } | ||
| ```bash | ||
| docker pull embeddedllminfo/vllm-rocm:vllm-v0.2.1.post1 | ||
|
|
||
| docker run -it \ | ||
| --network=host \ | ||
| --group-add=video \ | ||
| --ipc=host \ | ||
| --cap-add=SYS_PTRACE \ | ||
| --security-opt seccomp=unconfined \ | ||
| --shm-size 8G \ | ||
| --device /dev/kfd \ | ||
| --device /dev/dri \ | ||
| embeddedllminfo/vllm-rocm \ | ||
| bash | ||
| ``` | ||
|
|
||
| ## Serving | ||
|
|
||
| The project supports native vLLM serving | ||
|
|
||
| ```bash | ||
| python -m vllm.entrypoints.api_server \ | ||
| --model lmsys/vicuna-7b-v1.5 \ | ||
| --tensor-parallel-size 2 | ||
| ``` | ||
|
|
||
| ## Benchmarking | ||
|
|
||
| The benchmark results were obtained by running the vLLM benchmark scripts under the *benchmark* directory. | ||
|
|
||
| If your vLLM is installed using the provided [docker environment](#using-docker), you can benchmark the inferencing throughput following the steps below: | ||
| - Download the model you would like to evaluate to a directory of your choice (say a vicuna-7b model is downloaded to /path/to/your/model/vicuna-7b-v1.5) | ||
| - Run the docker and mount the model to /app/model | ||
|
|
||
| ```bash | ||
| docker run -it \ | ||
| --network=host \ | ||
| --group-add=video \ | ||
| --ipc=host \ | ||
| --cap-add=SYS_PTRACE \ | ||
| --security-opt seccomp=unconfined \ | ||
| --shm-size 8G \ | ||
| --device /dev/kfd \ | ||
| --device /dev/dri \ | ||
| -v /path/to/your/model/vicuna-7b-v1.5:/app/model \ | ||
| vllm-rocm \ | ||
| bash | ||
| ``` | ||
| Inside the container, run | ||
| ```bash | ||
| bash /app/benchmark_throughput.sh | ||
| ``` | ||
|
|
||
| ## Acknowledgement | ||
|
|
||
| This ROCm port was built upon the following amazing projects: | ||
|
|
||
| - [vLLM](https://github.com/vllm-project/vllm) and [pcmoritz's ROCm fork](https://github.com/pcmoritz/vllm-public/tree/port-to-rocm) | ||
| - [flash-attention](https://github.com/ROCmSoftwarePlatform/flash-attention) | ||
| - [xformers](https://github.com/facebookresearch/xformers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please make a new docker file.