diff --git a/docs/getting_started/installation/README.md b/docs/getting_started/installation/README.md index ac3309b23414..a0eb56302056 100644 --- a/docs/getting_started/installation/README.md +++ b/docs/getting_started/installation/README.md @@ -3,9 +3,10 @@ vLLM supports the following hardware platforms: - [GPU](gpu.md) - - [NVIDIA CUDA](gpu.md#nvidia-cuda) - - [AMD ROCm](gpu.md#amd-rocm) - - [Intel XPU](gpu.md#intel-xpu) + - [NVIDIA CUDA](gpu.md) + - [AMD ROCm](gpu.md) + - [Intel XPU](gpu.md) + - [Apple Silicon](gpu.md) (via [vLLM-Metal](https://github.com/vllm-project/vllm-metal)) - [CPU](cpu.md) - [Intel/AMD x86](cpu.md#intelamd-x86) - [ARM AArch64](cpu.md#arm-aarch64) diff --git a/docs/getting_started/installation/gpu.apple.inc.md b/docs/getting_started/installation/gpu.apple.inc.md new file mode 100644 index 000000000000..30e6244e05c3 --- /dev/null +++ b/docs/getting_started/installation/gpu.apple.inc.md @@ -0,0 +1,125 @@ + +--8<-- [start:installation] + +For GPU-accelerated inference on Apple Silicon, use [vLLM-Metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend and provides native GPU acceleration via Apple's Metal framework. + +vLLM-Metal works with MLX-optimized models from the [mlx-community](https://huggingface.co/mlx-community) organization on Hugging Face, which provides quantized versions of popular models optimized for Apple Silicon. + +!!! tip + For installation and usage instructions, see the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section below. + +--8<-- [end:installation] +--8<-- [start:requirements] + +- OS: macOS Sonoma or later +- Hardware: Apple Silicon +- Metal support enabled + +!!! note + See the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section below for installation instructions. + +--8<-- [end:requirements] +--8<-- [start:set-up-using-python] + +## Set up using vLLM-Metal + +vLLM-Metal is distributed as a separate package that provides native GPU acceleration on Apple Silicon. + +To install vLLM-Metal, follow the installation instructions in the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation). + +The installation will: + +1. Set up the appropriate Python environment +2. Install MLX and required dependencies +3. Install the vLLM-Metal package + +After installation, you can start using vLLM with Metal GPU acceleration. + +!!! tip + When using vLLM-Metal, use models from the [mlx-community](https://huggingface.co/mlx-community) on Hugging Face for best performance. These models are optimized for MLX and often include quantized versions (4-bit, 8-bit) that run efficiently on Apple Silicon. + + Example model: `mlx-community/Qwen2.5-0.5B-Instruct-4bit` + +### Using vLLM-Metal + +After installation, vLLM-Metal provides an easy-to-use CLI for running an OpenAI-compatible API server: + +```bash +# Activate the vLLM-Metal environment +source ~/.venv-vllm-metal/bin/activate + +# Start the API server (specify your mlx-community model or it will use default) +vllm serve +``` + +Once the server is running, you have multiple options to interact with it: + +#### Option 1: Interactive chat + +Open a new terminal and start an interactive chat session: + +```bash +source ~/.venv-vllm-metal/bin/activate +vllm chat +``` + +#### Option 2: API requests with curl + +```bash +curl http://localhost:8000/v1/chat/completions \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [{"role": "user", "content": "Hello!"}], + "max_tokens": 50 + }' +``` + +#### Option 3: Python with OpenAI SDK + +```python +from openai import OpenAI + +client = OpenAI( + base_url="http://localhost:8000/v1", + api_key="dummy" # No auth required for local server +) + +response = client.chat.completions.create( + model="mlx-community/Qwen2.5-0.5B-Instruct-4bit", + messages=[{"role": "user", "content": "Hello!"}] +) + +print(response.choices[0].message.content) +``` + +For more details on the `vllm` CLI commands, see the [OpenAI-compatible server documentation](../../serving/openai_compatible_server.md). + +--8<-- [end:set-up-using-python] +--8<-- [start:pre-built-wheels] + +vLLM-Metal is installed via the vLLM-Metal package. See the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section above. + +--8<-- [end:pre-built-wheels] +--8<-- [start:build-wheel-from-source] + +For build instructions from source, refer to the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation). + +--8<-- [end:build-wheel-from-source] +--8<-- [start:pre-built-images] + +--8<-- [end:pre-built-images] +--8<-- [start:build-image-from-source] + +--8<-- [end:build-image-from-source] +--8<-- [start:supported-features] + +vLLM-Metal provides: + +- Native GPU acceleration using Metal +- MLX-based compute backend optimized for Apple Silicon +- OpenAI-compatible API server +- Support for popular model architectures + +For specific feature support and limitations, refer to the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal). + +--8<-- [end:supported-features] diff --git a/docs/getting_started/installation/gpu.md b/docs/getting_started/installation/gpu.md index 475c67ce9d05..91d933dd4e86 100644 --- a/docs/getting_started/installation/gpu.md +++ b/docs/getting_started/installation/gpu.md @@ -18,6 +18,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation" +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:installation" + ## Requirements - OS: Linux @@ -38,6 +42,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements" +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:requirements" + ## Set up using Python ### Create a new Python environment @@ -56,6 +64,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python" +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:set-up-using-python" + ### Pre-built wheels {#pre-built-wheels} === "NVIDIA CUDA" @@ -70,6 +82,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels" +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:pre-built-wheels" + ### Build wheel from source === "NVIDIA CUDA" @@ -84,6 +100,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source" +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:build-wheel-from-source" + ## Set up using Docker ### Pre-built images @@ -102,6 +122,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images" +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:pre-built-images" + --8<-- [end:pre-built-images] ### Build image from source @@ -120,6 +144,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source" +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:build-image-from-source" + --8<-- [end:build-image-from-source] ## Supported features @@ -135,3 +163,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G === "Intel XPU" --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features" + +=== "Apple Silicon" + + --8<-- "docs/getting_started/installation/gpu.apple.inc.md:supported-features" diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md index 015514def33f..a748ba4a9300 100644 --- a/docs/getting_started/quickstart.md +++ b/docs/getting_started/quickstart.md @@ -10,6 +10,9 @@ This guide will help you quickly get started with vLLM to perform: - OS: Linux - Python: 3.10 -- 3.13 +!!! note + vLLM also works on macOS with [vLLM-Metal](https://github.com/vllm-project/vllm-metal) for Apple Silicon GPU acceleration. See the [GPU installation guide](installation/gpu.md) and select the "Apple Silicon" tab. + ## Installation === "NVIDIA CUDA" @@ -73,6 +76,18 @@ This guide will help you quickly get started with vLLM to perform: !!! note For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/). +=== "Apple Silicon (Mac)" + + If you are using Apple Silicon Macs, you can use vLLM-Metal for GPU-accelerated inference via Apple's Metal framework. + + Follow the installation instructions in the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation). + + !!! note + vLLM-Metal uses MLX instead of PyTorch as the compute backend and requires MLX-optimized models from the [mlx-community](https://huggingface.co/mlx-community) on Hugging Face. + + !!! tip + For more detailed instructions, please refer to the [GPU installation guide](installation/gpu.md) and select the "Apple Silicon" tab. + !!! note For more detail and non-CUDA platforms, please refer to the [installation guide](installation/README.md) for specific instructions on how to install vLLM.