vllm-project · simon-mo · May 11, 2026 · May 7, 2026 · May 11, 2026 · May 11, 2026
diff --git a/docs/getting_started/installation/README.md b/docs/getting_started/installation/README.md
@@ -3,9 +3,10 @@
 vLLM supports the following hardware platforms:
 
 - [GPU](gpu.md)
-    - [NVIDIA CUDA](gpu.md#nvidia-cuda)
-    - [AMD ROCm](gpu.md#amd-rocm)
-    - [Intel XPU](gpu.md#intel-xpu)
+    - [NVIDIA CUDA](gpu.md)
+    - [AMD ROCm](gpu.md)
+    - [Intel XPU](gpu.md)
+    - [Apple Silicon](gpu.md) (via [vLLM-Metal](https://github.com/vllm-project/vllm-metal))
 - [CPU](cpu.md)
     - [Intel/AMD x86](cpu.md#intelamd-x86)
     - [ARM AArch64](cpu.md#arm-aarch64)

diff --git a/docs/getting_started/installation/gpu.apple.inc.md b/docs/getting_started/installation/gpu.apple.inc.md
@@ -0,0 +1,125 @@
+<!-- markdownlint-disable MD041 -->
+--8<-- [start:installation]
+
+For GPU-accelerated inference on Apple Silicon, use [vLLM-Metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend and provides native GPU acceleration via Apple's Metal framework.
+
+vLLM-Metal works with MLX-optimized models from the [mlx-community](https://huggingface.co/mlx-community) organization on Hugging Face, which provides quantized versions of popular models optimized for Apple Silicon.
+
+!!! tip
+    For installation and usage instructions, see the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section below.
+
+--8<-- [end:installation]
+--8<-- [start:requirements]
+
+- OS: macOS Sonoma or later
+- Hardware: Apple Silicon
+- Metal support enabled
+
+!!! note
+    See the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section below for installation instructions.
+
+--8<-- [end:requirements]
+--8<-- [start:set-up-using-python]
+
+## Set up using vLLM-Metal
+
+vLLM-Metal is distributed as a separate package that provides native GPU acceleration on Apple Silicon.
+
+To install vLLM-Metal, follow the installation instructions in the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation).
+
+The installation will:
+
+1. Set up the appropriate Python environment
+2. Install MLX and required dependencies
+3. Install the vLLM-Metal package
+
+After installation, you can start using vLLM with Metal GPU acceleration.
+
+!!! tip
+    When using vLLM-Metal, use models from the [mlx-community](https://huggingface.co/mlx-community) on Hugging Face for best performance. These models are optimized for MLX and often include quantized versions (4-bit, 8-bit) that run efficiently on Apple Silicon.
+
+    Example model: `mlx-community/Qwen2.5-0.5B-Instruct-4bit`
+
+### Using vLLM-Metal
+
+After installation, vLLM-Metal provides an easy-to-use CLI for running an OpenAI-compatible API server:
+
+```bash
+# Activate the vLLM-Metal environment
+source ~/.venv-vllm-metal/bin/activate
+
+# Start the API server (specify your mlx-community model or it will use default)
+vllm serve
+```
+
+Once the server is running, you have multiple options to interact with it:
+
+#### Option 1: Interactive chat
+
+Open a new terminal and start an interactive chat session:
+
+```bash
+source ~/.venv-vllm-metal/bin/activate
+vllm chat
+```
+
+#### Option 2: API requests with curl
+
+```bash
+curl http://localhost:8000/v1/chat/completions \
+  -H "Content-Type: application/json" \
+  -d '{
+    "messages": [{"role": "user", "content": "Hello!"}],
+    "max_tokens": 50
+  }'
+```
+
+#### Option 3: Python with OpenAI SDK
+
+```python
+from openai import OpenAI
+
+client = OpenAI(
+    base_url="http://localhost:8000/v1",
+    api_key="dummy"  # No auth required for local server
+)
+
+response = client.chat.completions.create(
+    model="mlx-community/Qwen2.5-0.5B-Instruct-4bit",
+    messages=[{"role": "user", "content": "Hello!"}]
+)
+
+print(response.choices[0].message.content)
+```
+
+For more details on the `vllm` CLI commands, see the [OpenAI-compatible server documentation](../../serving/openai_compatible_server.md).
+
+--8<-- [end:set-up-using-python]
+--8<-- [start:pre-built-wheels]
+
+vLLM-Metal is installed via the vLLM-Metal package. See the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section above.
+
+--8<-- [end:pre-built-wheels]
+--8<-- [start:build-wheel-from-source]
+
+For build instructions from source, refer to the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation).
+
+--8<-- [end:build-wheel-from-source]
+--8<-- [start:pre-built-images]
+
+--8<-- [end:pre-built-images]
+--8<-- [start:build-image-from-source]
+
+--8<-- [end:build-image-from-source]
+--8<-- [start:supported-features]
+
+vLLM-Metal provides:
+
+- Native GPU acceleration using Metal
+- MLX-based compute backend optimized for Apple Silicon
+- OpenAI-compatible API server
+- Support for popular model architectures
+
+For specific feature support and limitations, refer to the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal).
+
+--8<-- [end:supported-features]
diff --git a/docs/getting_started/installation/gpu.md b/docs/getting_started/installation/gpu.md
@@ -18,6 +18,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"
 
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:installation"
+
 ## Requirements
 
 - OS: Linux
@@ -38,6 +42,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"
 
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:requirements"
+
 ## Set up using Python
 
 ### Create a new Python environment
@@ -56,6 +64,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"
 
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:set-up-using-python"
+
 ### Pre-built wheels {#pre-built-wheels}
 
 === "NVIDIA CUDA"
@@ -70,6 +82,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"
 
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:pre-built-wheels"
+
 ### Build wheel from source
 
 === "NVIDIA CUDA"
@@ -84,6 +100,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"
 
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:build-wheel-from-source"
+
 ## Set up using Docker
 
 ### Pre-built images
@@ -102,6 +122,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"
 
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:pre-built-images"
+
 --8<-- [end:pre-built-images]
 
 ### Build image from source
@@ -120,6 +144,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"
 
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:build-image-from-source"
+
 --8<-- [end:build-image-from-source]
 
 ## Supported features
@@ -135,3 +163,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
 === "Intel XPU"
 
     --8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"
+
+=== "Apple Silicon"
+
+    --8<-- "docs/getting_started/installation/gpu.apple.inc.md:supported-features"
diff --git a/docs/getting_started/quickstart.md b/docs/getting_started/quickstart.md
@@ -10,6 +10,9 @@ This guide will help you quickly get started with vLLM to perform:
 - OS: Linux
 - Python: 3.10 -- 3.13
 
+!!! note
+    vLLM also works on macOS with [vLLM-Metal](https://github.com/vllm-project/vllm-metal) for Apple Silicon GPU acceleration. See the [GPU installation guide](installation/gpu.md) and select the "Apple Silicon" tab.
+
 ## Installation
 
 === "NVIDIA CUDA"
@@ -73,6 +76,18 @@ This guide will help you quickly get started with vLLM to perform:
     !!! note
         For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).
 
+=== "Apple Silicon (Mac)"
+
+    If you are using Apple Silicon Macs, you can use vLLM-Metal for GPU-accelerated inference via Apple's Metal framework.
+
+    Follow the installation instructions in the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation).
+
+    !!! note
+        vLLM-Metal uses MLX instead of PyTorch as the compute backend and requires MLX-optimized models from the [mlx-community](https://huggingface.co/mlx-community) on Hugging Face.
+
+    !!! tip
+        For more detailed instructions, please refer to the [GPU installation guide](installation/gpu.md) and select the "Apple Silicon" tab.
+
 !!! note
     For more detail and non-CUDA platforms, please refer to the [installation guide](installation/README.md) for specific instructions on how to install vLLM.