Skip to content
Merged
7 changes: 4 additions & 3 deletions docs/getting_started/installation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,10 @@
vLLM supports the following hardware platforms:

- [GPU](gpu.md)
- [NVIDIA CUDA](gpu.md#nvidia-cuda)
- [AMD ROCm](gpu.md#amd-rocm)
- [Intel XPU](gpu.md#intel-xpu)
- [NVIDIA CUDA](gpu.md)
- [AMD ROCm](gpu.md)
- [Intel XPU](gpu.md)
- [Apple Silicon](gpu.md) (via [vLLM-Metal](https://github.com/vllm-project/vllm-metal))
- [CPU](cpu.md)
- [Intel/AMD x86](cpu.md#intelamd-x86)
- [ARM AArch64](cpu.md#arm-aarch64)
Expand Down
125 changes: 125 additions & 0 deletions docs/getting_started/installation/gpu.apple.inc.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,125 @@
<!-- markdownlint-disable MD041 -->
--8<-- [start:installation]

For GPU-accelerated inference on Apple Silicon, use [vLLM-Metal](https://github.com/vllm-project/vllm-metal), a community-maintained hardware plugin that uses MLX as the compute backend and provides native GPU acceleration via Apple's Metal framework.

vLLM-Metal works with MLX-optimized models from the [mlx-community](https://huggingface.co/mlx-community) organization on Hugging Face, which provides quantized versions of popular models optimized for Apple Silicon.

!!! tip
For installation and usage instructions, see the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section below.

--8<-- [end:installation]
--8<-- [start:requirements]

- OS: macOS Sonoma or later
- Hardware: Apple Silicon
- Metal support enabled

!!! note
See the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section below for installation instructions.

--8<-- [end:requirements]
--8<-- [start:set-up-using-python]

## Set up using vLLM-Metal

vLLM-Metal is distributed as a separate package that provides native GPU acceleration on Apple Silicon.

To install vLLM-Metal, follow the installation instructions in the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation).

The installation will:

1. Set up the appropriate Python environment
2. Install MLX and required dependencies
3. Install the vLLM-Metal package

After installation, you can start using vLLM with Metal GPU acceleration.

!!! tip
When using vLLM-Metal, use models from the [mlx-community](https://huggingface.co/mlx-community) on Hugging Face for best performance. These models are optimized for MLX and often include quantized versions (4-bit, 8-bit) that run efficiently on Apple Silicon.

Example model: `mlx-community/Qwen2.5-0.5B-Instruct-4bit`

### Using vLLM-Metal

After installation, vLLM-Metal provides an easy-to-use CLI for running an OpenAI-compatible API server:

```bash
# Activate the vLLM-Metal environment
source ~/.venv-vllm-metal/bin/activate

# Start the API server (specify your mlx-community model or it will use default)
vllm serve
```

Once the server is running, you have multiple options to interact with it:

#### Option 1: Interactive chat

Open a new terminal and start an interactive chat session:

```bash
source ~/.venv-vllm-metal/bin/activate
vllm chat
```

#### Option 2: API requests with curl

```bash
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [{"role": "user", "content": "Hello!"}],
"max_tokens": 50
}'
```

#### Option 3: Python with OpenAI SDK

```python
from openai import OpenAI

client = OpenAI(
base_url="http://localhost:8000/v1",
api_key="dummy" # No auth required for local server
)

response = client.chat.completions.create(
model="mlx-community/Qwen2.5-0.5B-Instruct-4bit",
messages=[{"role": "user", "content": "Hello!"}]
)

print(response.choices[0].message.content)
```

For more details on the `vllm` CLI commands, see the [OpenAI-compatible server documentation](../../serving/openai_compatible_server.md).

--8<-- [end:set-up-using-python]
--8<-- [start:pre-built-wheels]

vLLM-Metal is installed via the vLLM-Metal package. See the [Set up using vLLM-Metal](#set-up-using-vllm-metal) section above.

--8<-- [end:pre-built-wheels]
--8<-- [start:build-wheel-from-source]

For build instructions from source, refer to the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation).

--8<-- [end:build-wheel-from-source]
--8<-- [start:pre-built-images]

--8<-- [end:pre-built-images]
--8<-- [start:build-image-from-source]

--8<-- [end:build-image-from-source]
--8<-- [start:supported-features]

vLLM-Metal provides:

- Native GPU acceleration using Metal
- MLX-based compute backend optimized for Apple Silicon
- OpenAI-compatible API server
- Support for popular model architectures

For specific feature support and limitations, refer to the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal).

--8<-- [end:supported-features]
32 changes: 32 additions & 0 deletions docs/getting_started/installation/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:installation"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:installation"

## Requirements

- OS: Linux
Expand All @@ -38,6 +42,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:requirements"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:requirements"

## Set up using Python

### Create a new Python environment
Expand All @@ -56,6 +64,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:set-up-using-python"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:set-up-using-python"

### Pre-built wheels {#pre-built-wheels}

=== "NVIDIA CUDA"
Expand All @@ -70,6 +82,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-wheels"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:pre-built-wheels"

### Build wheel from source

=== "NVIDIA CUDA"
Expand All @@ -84,6 +100,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-wheel-from-source"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:build-wheel-from-source"

## Set up using Docker

### Pre-built images
Expand All @@ -102,6 +122,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:pre-built-images"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:pre-built-images"

--8<-- [end:pre-built-images]

### Build image from source
Expand All @@ -120,6 +144,10 @@ vLLM is a Python library that supports the following GPU variants. Select your G

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:build-image-from-source"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:build-image-from-source"

--8<-- [end:build-image-from-source]

## Supported features
Expand All @@ -135,3 +163,7 @@ vLLM is a Python library that supports the following GPU variants. Select your G
=== "Intel XPU"

--8<-- "docs/getting_started/installation/gpu.xpu.inc.md:supported-features"

=== "Apple Silicon"

--8<-- "docs/getting_started/installation/gpu.apple.inc.md:supported-features"
15 changes: 15 additions & 0 deletions docs/getting_started/quickstart.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,9 @@ This guide will help you quickly get started with vLLM to perform:
- OS: Linux
- Python: 3.10 -- 3.13

!!! note
vLLM also works on macOS with [vLLM-Metal](https://github.com/vllm-project/vllm-metal) for Apple Silicon GPU acceleration. See the [GPU installation guide](installation/gpu.md) and select the "Apple Silicon" tab.

## Installation

=== "NVIDIA CUDA"
Expand Down Expand Up @@ -73,6 +76,18 @@ This guide will help you quickly get started with vLLM to perform:
!!! note
For more detailed instructions, including Docker, installing from source, and troubleshooting, please refer to the [vLLM on TPU documentation](https://docs.vllm.ai/projects/tpu/en/latest/).

=== "Apple Silicon (Mac)"

If you are using Apple Silicon Macs, you can use vLLM-Metal for GPU-accelerated inference via Apple's Metal framework.

Follow the installation instructions in the [vLLM-Metal documentation](https://github.com/vllm-project/vllm-metal#installation).

!!! note
vLLM-Metal uses MLX instead of PyTorch as the compute backend and requires MLX-optimized models from the [mlx-community](https://huggingface.co/mlx-community) on Hugging Face.

!!! tip
For more detailed instructions, please refer to the [GPU installation guide](installation/gpu.md) and select the "Apple Silicon" tab.

!!! note
For more detail and non-CUDA platforms, please refer to the [installation guide](installation/README.md) for specific instructions on how to install vLLM.

Expand Down
Loading