Skip to content
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 30 additions & 23 deletions docs/advanced_features/attention_backend.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,13 @@ You can test them according to your needs.

```{important}
Selecting an optimal attention backend is crucial for maximizing your performance. Different backends excel in various scenarios, so choose based on your model, hardware, and use case. Not all backends are supported on all platforms and model architectures.

If you don't specify `--attention-backend`, SGLang makes a best effort to automatically select the most performant backend based on your hardware and model architecture.
```

## Support Matrix

The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the [SGLang documentation on DeepSeek MLA](https://github.com/sgl-project/sglang/blob/main/docs/basic_usage/deepseek.md#multi-head-latent-attention-mla) and the original [DeepSeek MLA paper](https://arxiv.org/pdf/2405.04434).
The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the [SGLang documentation on DeepSeek MLA](../basic_usage/deepseek_v3.md#multi-head-latent-attention-mla-throughput-optimizations) and the original [DeepSeek MLA paper](https://arxiv.org/pdf/2405.04434).

### MHA Backends

Expand All @@ -27,6 +29,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu
| **Wave (ROCm)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ |
| **Ascend (NPU)** | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ |
| **Intel XPU** | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ |
| **Intel AMX (CPU)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ |

### MLA Backends

Expand Down Expand Up @@ -54,7 +57,11 @@ Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" c
Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths.
```

Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128).
```{tip}
Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and `page_size = 64`, it won't fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and `page_size = 64`, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use `page_size = 1` for maximum prefix reuse (token-level matching).
```

Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128).

MLA page-size constraints:
- FlashInfer MLA: page_size = 1.
Expand Down Expand Up @@ -140,13 +147,6 @@ python3 -m sglang.launch_server \
--trust-remote-code
```

- Torch Native
```bash
python3 -m sglang.launch_server \
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend torch_native
```

- FlashMLA
```bash
python3 -m sglang.launch_server \
Expand Down Expand Up @@ -181,6 +181,24 @@ python3 -m sglang.launch_server \
--trust-remote-code
```

- FlashAttention 4 (MHA & MLA)
```bash
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--prefill-attention-backend fa4 \
--trust-remote-code
```

- Cutlass MLA
```bash
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--attention-backend cutlass_mla \
--trust-remote-code
```

- Ascend
```bash
python3 -m sglang.launch_server \
Expand Down Expand Up @@ -216,22 +234,11 @@ python3 -m sglang.launch_server \
--attention-backend dual_chunk_flash_attn
```

- Cutlass MLA
```bash
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--attention-backend cutlass_mla \
--trust-remote-code
```

- FlashAttention 4 (MHA & MLA)
- Torch Native
```bash
python3 -m sglang.launch_server \
--tp 8 \
--model deepseek-ai/DeepSeek-R1 \
--prefill-attention-backend fa4 \
--trust-remote-code
--model meta-llama/Meta-Llama-3.1-8B-Instruct \
--attention-backend torch_native
```

## Steps to add a new attention backend
Expand Down
Loading