vllm-project · LucasWilkinson · Jan 28, 2026 · Oct 10, 2025 · Oct 10, 2025 · Oct 10, 2025
diff --git a/benchmarks/attention_benchmarks/README.md b/benchmarks/attention_benchmarks/README.md
@@ -0,0 +1,266 @@
+# vLLM Attention Benchmarking Suite
+
+Fast, flexible benchmarking for vLLM attention and MLA backends with an extended batch specification grammar.
+
+## Quick Start
+
+```bash
+cd benchmarks/attention_benchmarks
+
+# Run a pre-configured benchmark
+python benchmark.py --config configs/mla_decode.yaml
+python benchmark.py --config configs/mla_mixed_batch.yaml
+python benchmark.py --config configs/speculative_decode.yaml
+python benchmark.py --config configs/standard_attention.yaml
+python benchmark.py --config configs/reorder_threshold.yaml
+
+# Or run custom benchmarks
+python benchmark.py \
+    --backends flash flashinfer \
+    --batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
+    --output-csv results.csv
+```
+
+## Simplified Batch Specification Grammar
+
+Express workloads concisely using query length and sequence length:
+
+```python
+"q2k"              # 2048-token prefill (q_len=2048, seq_len=2048)
+"q1s1k"            # Decode: 1 token with 1K sequence
+"8q1s1k"           # 8 decode requests
+"q4s1k"            # 4-token extend (e.g., spec decode)
+"2q2k_32q1s1k"     # Mixed: 2 prefills + 32 decodes
+"16q4s1k"          # 16 spec decode (4 tokens each)
+```
+
+### Grammar Rule
+
+```text
+Format: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?
+
+- count:   Number of identical requests (optional, default=1)
+- q_len:   Query length (number of new tokens)
+- seq_len: Total sequence length (optional, defaults to q_len for prefill)
+- 'k':     Multiplies value by 1024
+
+Mixed batches: Use _ to combine (e.g., "2q2k_32q1s1k")
+```
+
+**Note**: Decode, prefill, and spec decode are just different query lengths - no special syntax needed!
+
+## Pre-configured Benchmarks
+
+The suite includes several pre-configured YAML benchmark configurations:
+
+### MLA Decode Benchmark
+
+Tests pure decode performance across MLA backends with varying batch sizes and sequence lengths.
+
+```bash
+python benchmark.py --config configs/mla_decode.yaml
+```
+
+### MLA Mixed Batch Benchmark
+
+Tests chunked prefill performance with mixed prefill + decode batches.
+
+```bash
+python benchmark.py --config configs/mla_mixed_batch.yaml
+```
+
+### Speculative Decoding Benchmark
+
+Tests speculative decode scenarios (K-token verification) and reorder_batch_threshold optimization.
+
+```bash
+python benchmark.py --config configs/speculative_decode.yaml
+```
+
+### Standard Attention Benchmark
+
+Tests standard attention backends (Flash/Triton/FlashInfer) with pure prefill, decode, and mixed batches.
+
+```bash
+python benchmark.py --config configs/standard_attention.yaml
+```
+
+### Reorder Threshold Study
+
+**Question:** At what query length does the prefill pipeline become faster than the decode pipeline?
+
+Tests query lengths from 1-1024 across 9 batch sizes to find the crossover point. Uses `decode_vs_prefill` mode to compare both pipelines for each query length.
+
+```bash
+python benchmark.py --config configs/reorder_threshold.yaml
+```
+
+---
+
+## Universal Benchmark
+
+The `benchmark.py` script handles **all** backends - both standard attention and MLA.
+
+### Standard Attention (Flash/Triton/FlashInfer)
+
+```bash
+python benchmark.py \
+    --backends flash triton flashinfer \
+    --batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
+    --num-layers 10 \
+    --repeats 5 \
+    --output-csv results.csv
+```
+
+### MLA Backends
+
+```bash
+# Compare all MLA backends
+python benchmark.py \
+    --backends cutlass_mla flashinfer_mla flashattn_mla flashmla \
+    --batch-specs "64q1s1k" "64q1s4k" \
+    --output-csv mla_results.csv
+```
+
+### Parameter Sweeps
+
+Use `--sweep-param` and `--sweep-values` to run parameter sweeps from the CLI:
+
+#### CUTLASS MLA num-splits Optimization
+
+**Question:** What is the optimal `num_kv_splits` for CUTLASS MLA?
+
+```bash
+python benchmark.py \
+    --backend cutlass_mla \
+    --batch-specs "64q1s1k" "64q1s4k" "64q1s16k" \
+    --sweep-param num_kv_splits \
+    --sweep-values 1 2 4 8 16 \
+    --output-json optimal_splits.json
+```
+
+#### Reorder Batch Threshold Optimization
+
+**Question:** What's the optimal `reorder_batch_threshold` for speculative decoding?
+
+```bash
+python benchmark.py \
+    --backend flashmla \
+    --batch-specs "q4s1k" "q8s2k" \
+    --sweep-param reorder_batch_threshold \
+    --sweep-values 1 4 16 64 256 512 \
+    --output-csv threshold_sweep.csv
+```
+
+### All Command-Line Options
+
+```text
+--config CONFIG                     # Path to YAML config file (overrides other args)
+--backends BACKEND [BACKEND ...]    # flash, triton, flashinfer, cutlass_mla,
+                                    # flashinfer_mla, flashattn_mla, flashmla
+--backend BACKEND                   # Single backend (alternative to --backends)
+--batch-specs SPEC [SPEC ...]       # Batch specifications using extended grammar
+
+# Model configuration
+--num-layers N                      # Number of layers
+--head-dim N                        # Head dimension
+--num-q-heads N                     # Query heads
+--num-kv-heads N                    # KV heads
+--block-size N                      # Block size
+
+# Benchmark settings
+--device DEVICE                     # Device (default: cuda:0)
+--repeats N                         # Repetitions
+--warmup-iters N                    # Warmup iterations
+--profile-memory                    # Profile memory usage
+
+# Parameter sweeps
+--sweep-param PARAM                 # Parameter name to sweep (e.g., num_kv_splits,
+                                    # reorder_batch_threshold)
+--sweep-values N [N ...]            # Values to sweep for the parameter
+
+# Output
+--output-csv FILE                   # Save to CSV
+--output-json FILE                  # Save to JSON
+```
+
+## Hardware Requirements
+
+| Backend | Hardware |
+|---------|----------|
+| Flash/Triton/FlashInfer | Any CUDA GPU |
+| CUTLASS MLA | Blackwell (SM100+) |
+| FlashAttn MLA | Hopper (SM90+) |
+| FlashMLA | Hopper (SM90+) |
+| FlashInfer-MLA | Any CUDA GPU |
+
+## Using MLA Runner Directly
+
+All MLA backends are available through `mla_runner.run_mla_benchmark()`:
+
+```python
+from mla_runner import run_mla_benchmark
+from common import BenchmarkConfig
+
+config = BenchmarkConfig(
+    backend="cutlass_mla",
+    batch_spec="64q1s4k",
+    num_layers=10,
+    head_dim=576,
+    num_q_heads=128,
+    num_kv_heads=1,
+    block_size=128,
+    device="cuda:0",
+    repeats=5,
+    warmup_iters=3,
+)
+
+# CUTLASS MLA with specific num_kv_splits
+result = run_mla_benchmark("cutlass_mla", config, num_kv_splits=4)
+print(f"Time: {result.mean_time:.6f}s")
+
+# FlashInfer-MLA
+result = run_mla_benchmark("flashinfer_mla", config)
+
+# FlashAttn MLA (Hopper SM90+)
+result = run_mla_benchmark("flashattn_mla", config, reorder_batch_threshold=64)
+
+# FlashMLA (Hopper SM90+)
+result = run_mla_benchmark("flashmla", config, reorder_batch_threshold=64)
+```
+
+## Python API
+
+```python
+from batch_spec import parse_batch_spec, format_batch_spec, get_batch_stats
+from common import BenchmarkConfig, BenchmarkResult, ResultsFormatter
+
+# Parse batch specs
+requests = parse_batch_spec("2q2k_q4s1k_32q1s1k")
+print(format_batch_spec(requests))
+# "2 prefill (2x2k), 1 extend (1xq4kv1k), 32 decode (32x1k)"
+
+# Get batch statistics
+stats = get_batch_stats(requests)
+print(f"Total tokens: {stats['total_tokens']}")
+print(f"Num decode: {stats['num_decode']}, Num prefill: {stats['num_prefill']}")
+
+# Format results
+formatter = ResultsFormatter()
+formatter.save_csv(results, "output.csv")
+formatter.save_json(results, "output.json")
+```
+
+## Tips
+
+**1. Warmup matters** - Use `--warmup-iters 10` for stable results
+
+**2. Multiple repeats** - Use `--repeats 20` for low variance
+
+**3. Save results** - Always use `--output-csv` or `--output-json`
+
+**4. Test incrementally** - Start with `--num-layers 1 --repeats 1`
+
+**5. Extended grammar** - Leverage spec decode, chunked prefill patterns
+
+**6. Parameter sweeps** - Use `--sweep-param` and `--sweep-values` to find optimal values
diff --git a/benchmarks/attention_benchmarks/__init__.py b/benchmarks/attention_benchmarks/__init__.py
@@ -0,0 +1,44 @@
+# SPDX-License-Identifier: Apache-2.0
+# SPDX-FileCopyrightText: Copyright contributors to the vLLM project
+
+"""vLLM Attention Benchmarking Suite."""
+
+from .batch_spec import (
+    BatchRequest,
+    format_batch_spec,
+    get_batch_stats,
+    parse_batch_spec,
+    reorder_for_flashinfer,
+    split_by_type,
+)
+from .common import (
+    BenchmarkConfig,
+    BenchmarkResult,
+    MockLayer,
+    MockModelConfig,
+    ResultsFormatter,
+    get_attention_scale,
+    is_mla_backend,
+    setup_mla_dims,
+)
+
+__all__ = [
+    # Batch specification
+    "BatchRequest",
+    "parse_batch_spec",
+    "format_batch_spec",
+    "reorder_for_flashinfer",
+    "split_by_type",
+    "get_batch_stats",
+    # Benchmarking infrastructure
+    "BenchmarkConfig",
+    "BenchmarkResult",
+    "ResultsFormatter",
+    # Mock objects
+    "MockLayer",
+    "MockModelConfig",
+    # Utilities
+    "setup_mla_dims",
+    "get_attention_scale",
+    "is_mla_backend",
+]