diff --git a/docs/advanced_features/attention_backend.md b/docs/advanced_features/attention_backend.md index cbc34043e462..b290c3278afb 100644 --- a/docs/advanced_features/attention_backend.md +++ b/docs/advanced_features/attention_backend.md @@ -5,11 +5,13 @@ You can test them according to your needs. ```{important} Selecting an optimal attention backend is crucial for maximizing your performance. Different backends excel in various scenarios, so choose based on your model, hardware, and use case. Not all backends are supported on all platforms and model architectures. + +If you don't specify `--attention-backend`, SGLang makes a best effort to automatically select the most performant backend based on your hardware and model architecture. ``` ## Support Matrix -The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the [SGLang documentation on DeepSeek MLA](https://github.com/sgl-project/sglang/blob/main/docs/basic_usage/deepseek.md#multi-head-latent-attention-mla) and the original [DeepSeek MLA paper](https://arxiv.org/pdf/2405.04434). +The support matrix is split into two parts: MHA (standard attention) and MLA (multi-head latent attention). For an explanation of the key differences between MHA and MLA, please see the [SGLang documentation on DeepSeek MLA](../basic_usage/deepseek_v3.md#multi-head-latent-attention-mla-throughput-optimizations) and the original [DeepSeek MLA paper](https://arxiv.org/pdf/2405.04434). ### MHA Backends @@ -27,6 +29,7 @@ The support matrix is split into two parts: MHA (standard attention) and MLA (mu | **Wave (ROCm)** | ✅ | ❌ | ❌ | ❌ | ❌ | ❌ | | **Ascend (NPU)** | ✅ | ❌ | ❌ | ❌ | ❌ | ✅ | | **Intel XPU** | ✅ | ❌ | ❌ | ❌ | ✅ | ❌ | +| **Intel AMX (CPU)** | ❌ | ❌ | ❌ | ❌ | ❌ | ❌ | ### MLA Backends @@ -54,7 +57,11 @@ Multimodal attention is selected by `--mm-attention-backend`. The "MultiModal" c Speculative decoding topk: `topk` is the number of draft tokens sampled per step from the draft model. `topk = 1` follows classic EAGLE; `topk > 1` explores multiple branches and requires backend support in both draft and verification paths. ``` -Note: Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128). +```{tip} +Page size controls how many tokens are grouped into a KV cache block. For the prefix cache to take effect, the number of tokens must fill at least one complete page. For example, if your prompt is only 32 tokens and `page_size = 64`, it won't fill a complete page and cannot be matched in the prefix cache (pages cannot be padded). With 65 tokens and `page_size = 64`, only the first page of 64 tokens will be cached and matched; the remaining 1 token is discarded. Use `page_size = 1` for maximum prefix reuse (token-level matching). +``` + +Many backends that do not natively operate on pages can emulate `page_size > 1` at the wrapper layer by expanding page tables to per-token indices. The "Page Size > 1 (native)" column indicates true in-kernel paging. Some backends require fixed native page sizes and cannot be reduced/emulated differently: TRTLLM MHA (16/32/64), TRTLLM MLA (32/64), FlashMLA (64), Cutlass MLA (128), Ascend (128). MLA page-size constraints: - FlashInfer MLA: page_size = 1. @@ -140,13 +147,6 @@ python3 -m sglang.launch_server \ --trust-remote-code ``` -- Torch Native -```bash -python3 -m sglang.launch_server \ - --model meta-llama/Meta-Llama-3.1-8B-Instruct \ - --attention-backend torch_native -``` - - FlashMLA ```bash python3 -m sglang.launch_server \ @@ -181,6 +181,24 @@ python3 -m sglang.launch_server \ --trust-remote-code ``` +- FlashAttention 4 (MHA & MLA) +```bash +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --prefill-attention-backend fa4 \ + --trust-remote-code +``` + +- Cutlass MLA +```bash +python3 -m sglang.launch_server \ + --tp 8 \ + --model deepseek-ai/DeepSeek-R1 \ + --attention-backend cutlass_mla \ + --trust-remote-code +``` + - Ascend ```bash python3 -m sglang.launch_server \ @@ -216,22 +234,11 @@ python3 -m sglang.launch_server \ --attention-backend dual_chunk_flash_attn ``` -- Cutlass MLA -```bash -python3 -m sglang.launch_server \ - --tp 8 \ - --model deepseek-ai/DeepSeek-R1 \ - --attention-backend cutlass_mla \ - --trust-remote-code -``` - -- FlashAttention 4 (MHA & MLA) +- Torch Native ```bash python3 -m sglang.launch_server \ - --tp 8 \ - --model deepseek-ai/DeepSeek-R1 \ - --prefill-attention-backend fa4 \ - --trust-remote-code + --model meta-llama/Meta-Llama-3.1-8B-Instruct \ + --attention-backend torch_native ``` ## Steps to add a new attention backend