Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
47 commits
Select commit Hold shift + click to select a range
ed33f4a
initial commit of benchmarks
MatthewBonanni Oct 10, 2025
ac4cf6b
don't unnecessarily reinitialize
MatthewBonanni Oct 10, 2025
62ffea7
clean up grammar
MatthewBonanni Oct 10, 2025
057bb3a
simplify
MatthewBonanni Oct 10, 2025
558b049
add batch spec ranges
MatthewBonanni Oct 13, 2025
1e1b541
rename
MatthewBonanni Oct 13, 2025
7e3fad2
disambiguate grammar
MatthewBonanni Oct 13, 2025
6bc0f82
use metadata builders
MatthewBonanni Oct 13, 2025
53f7a0d
bugfixes
MatthewBonanni Oct 14, 2025
269c0cc
fix typo
MatthewBonanni Oct 14, 2025
0e2039b
refactor
MatthewBonanni Oct 14, 2025
9241ced
Fix attention benchmark: support decode/prefill modes, add MockKVBProj
MatthewBonanni Oct 14, 2025
2ef27cc
turn off auto
MatthewBonanni Oct 14, 2025
8a890e7
abbreviate column titles
MatthewBonanni Oct 14, 2025
3a92b2e
refactor
MatthewBonanni Oct 14, 2025
8377b02
update configurations
MatthewBonanni Oct 14, 2025
b46342c
fix tests
MatthewBonanni Oct 14, 2025
f09a963
update old batch specs
MatthewBonanni Oct 14, 2025
ce3a1ec
bugfix mla dims
MatthewBonanni Oct 14, 2025
4b6e2bf
add plotting script
MatthewBonanni Oct 14, 2025
daf4900
visualize some potential heuristics
MatthewBonanni Oct 14, 2025
ce22932
clean up plotting script
MatthewBonanni Oct 15, 2025
a888268
new policy
MatthewBonanni Oct 15, 2025
24bf31d
comments
MatthewBonanni Oct 15, 2025
07e680c
bugfixes, add model parameter sweep
MatthewBonanni Oct 21, 2025
6419a6d
don't download from HF
MatthewBonanni Oct 21, 2025
33c95b6
update specs
MatthewBonanni Oct 21, 2025
c26b90a
remove batch size > 16
MatthewBonanni Oct 21, 2025
0522e0b
update configs
MatthewBonanni Oct 22, 2025
bcc63d0
rename
MatthewBonanni Oct 22, 2025
fdc1a59
fix pre-commit
MatthewBonanni Nov 3, 2025
b60e5fc
fix pre-commit
MatthewBonanni Nov 3, 2025
241f1ce
Merge branch 'main' into benchmark_attention
MatthewBonanni Nov 21, 2025
55c00fa
Fix
MatthewBonanni Jan 27, 2026
af0578f
Cleanup
MatthewBonanni Jan 27, 2026
bf50878
Remove visualize_numsplits.py
MatthewBonanni Jan 27, 2026
0a3f987
Refactor and simplify
MatthewBonanni Jan 27, 2026
332df87
Fix MLA
MatthewBonanni Jan 27, 2026
3babc05
Merge branch 'main' into benchmark_attention
MatthewBonanni Jan 27, 2026
73de818
Fix
MatthewBonanni Jan 27, 2026
2d69eaf
Fix
MatthewBonanni Jan 27, 2026
2cdfbca
Fix
MatthewBonanni Jan 27, 2026
b9d9573
Cleanup
MatthewBonanni Jan 27, 2026
9f3a76e
Clean up
MatthewBonanni Jan 27, 2026
b632ae0
Remove unused test
MatthewBonanni Jan 27, 2026
10a68b7
Update README and sample
MatthewBonanni Jan 27, 2026
32e9280
Update README
MatthewBonanni Jan 27, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
266 changes: 266 additions & 0 deletions benchmarks/attention_benchmarks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
# vLLM Attention Benchmarking Suite

Fast, flexible benchmarking for vLLM attention and MLA backends with an extended batch specification grammar.

## Quick Start

```bash
cd benchmarks/attention_benchmarks

# Run a pre-configured benchmark
python benchmark.py --config configs/mla_decode.yaml
python benchmark.py --config configs/mla_mixed_batch.yaml
python benchmark.py --config configs/speculative_decode.yaml
python benchmark.py --config configs/standard_attention.yaml
python benchmark.py --config configs/reorder_threshold.yaml

# Or run custom benchmarks
python benchmark.py \
--backends flash flashinfer \
--batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
--output-csv results.csv
```

## Simplified Batch Specification Grammar

Express workloads concisely using query length and sequence length:

```python
"q2k" # 2048-token prefill (q_len=2048, seq_len=2048)
"q1s1k" # Decode: 1 token with 1K sequence
"8q1s1k" # 8 decode requests
"q4s1k" # 4-token extend (e.g., spec decode)
"2q2k_32q1s1k" # Mixed: 2 prefills + 32 decodes
"16q4s1k" # 16 spec decode (4 tokens each)
```

### Grammar Rule

```text
Format: (<count>?) q<q_len>(k?) (s<seq_len>(k?))?

- count: Number of identical requests (optional, default=1)
- q_len: Query length (number of new tokens)
- seq_len: Total sequence length (optional, defaults to q_len for prefill)
- 'k': Multiplies value by 1024

Mixed batches: Use _ to combine (e.g., "2q2k_32q1s1k")
```

**Note**: Decode, prefill, and spec decode are just different query lengths - no special syntax needed!

## Pre-configured Benchmarks

The suite includes several pre-configured YAML benchmark configurations:

### MLA Decode Benchmark

Tests pure decode performance across MLA backends with varying batch sizes and sequence lengths.

```bash
python benchmark.py --config configs/mla_decode.yaml
```

### MLA Mixed Batch Benchmark

Tests chunked prefill performance with mixed prefill + decode batches.

```bash
python benchmark.py --config configs/mla_mixed_batch.yaml
```

### Speculative Decoding Benchmark

Tests speculative decode scenarios (K-token verification) and reorder_batch_threshold optimization.

```bash
python benchmark.py --config configs/speculative_decode.yaml
```

### Standard Attention Benchmark

Tests standard attention backends (Flash/Triton/FlashInfer) with pure prefill, decode, and mixed batches.

```bash
python benchmark.py --config configs/standard_attention.yaml
```

### Reorder Threshold Study

**Question:** At what query length does the prefill pipeline become faster than the decode pipeline?

Tests query lengths from 1-1024 across 9 batch sizes to find the crossover point. Uses `decode_vs_prefill` mode to compare both pipelines for each query length.

```bash
python benchmark.py --config configs/reorder_threshold.yaml
```

---

## Universal Benchmark

The `benchmark.py` script handles **all** backends - both standard attention and MLA.

### Standard Attention (Flash/Triton/FlashInfer)

```bash
python benchmark.py \
--backends flash triton flashinfer \
--batch-specs "q2k" "8q1s1k" "2q2k_32q1s1k" \
--num-layers 10 \
--repeats 5 \
--output-csv results.csv
```

### MLA Backends

```bash
# Compare all MLA backends
python benchmark.py \
--backends cutlass_mla flashinfer_mla flashattn_mla flashmla \
--batch-specs "64q1s1k" "64q1s4k" \
--output-csv mla_results.csv
```

### Parameter Sweeps

Use `--sweep-param` and `--sweep-values` to run parameter sweeps from the CLI:

#### CUTLASS MLA num-splits Optimization

**Question:** What is the optimal `num_kv_splits` for CUTLASS MLA?

```bash
python benchmark.py \
--backend cutlass_mla \
--batch-specs "64q1s1k" "64q1s4k" "64q1s16k" \
--sweep-param num_kv_splits \
--sweep-values 1 2 4 8 16 \
--output-json optimal_splits.json
```

#### Reorder Batch Threshold Optimization

**Question:** What's the optimal `reorder_batch_threshold` for speculative decoding?

```bash
python benchmark.py \
--backend flashmla \
--batch-specs "q4s1k" "q8s2k" \
--sweep-param reorder_batch_threshold \
--sweep-values 1 4 16 64 256 512 \
--output-csv threshold_sweep.csv
```

### All Command-Line Options

```text
--config CONFIG # Path to YAML config file (overrides other args)
--backends BACKEND [BACKEND ...] # flash, triton, flashinfer, cutlass_mla,
# flashinfer_mla, flashattn_mla, flashmla
--backend BACKEND # Single backend (alternative to --backends)
--batch-specs SPEC [SPEC ...] # Batch specifications using extended grammar

# Model configuration
--num-layers N # Number of layers
--head-dim N # Head dimension
--num-q-heads N # Query heads
--num-kv-heads N # KV heads
--block-size N # Block size

# Benchmark settings
--device DEVICE # Device (default: cuda:0)
--repeats N # Repetitions
--warmup-iters N # Warmup iterations
--profile-memory # Profile memory usage

# Parameter sweeps
--sweep-param PARAM # Parameter name to sweep (e.g., num_kv_splits,
# reorder_batch_threshold)
--sweep-values N [N ...] # Values to sweep for the parameter

# Output
--output-csv FILE # Save to CSV
--output-json FILE # Save to JSON
```

## Hardware Requirements

| Backend | Hardware |
|---------|----------|
| Flash/Triton/FlashInfer | Any CUDA GPU |
| CUTLASS MLA | Blackwell (SM100+) |
| FlashAttn MLA | Hopper (SM90+) |
| FlashMLA | Hopper (SM90+) |
| FlashInfer-MLA | Any CUDA GPU |

## Using MLA Runner Directly

All MLA backends are available through `mla_runner.run_mla_benchmark()`:

```python
from mla_runner import run_mla_benchmark
from common import BenchmarkConfig

config = BenchmarkConfig(
backend="cutlass_mla",
batch_spec="64q1s4k",
num_layers=10,
head_dim=576,
num_q_heads=128,
num_kv_heads=1,
block_size=128,
device="cuda:0",
repeats=5,
warmup_iters=3,
)

# CUTLASS MLA with specific num_kv_splits
result = run_mla_benchmark("cutlass_mla", config, num_kv_splits=4)
print(f"Time: {result.mean_time:.6f}s")

# FlashInfer-MLA
result = run_mla_benchmark("flashinfer_mla", config)

# FlashAttn MLA (Hopper SM90+)
result = run_mla_benchmark("flashattn_mla", config, reorder_batch_threshold=64)

# FlashMLA (Hopper SM90+)
result = run_mla_benchmark("flashmla", config, reorder_batch_threshold=64)
```

## Python API

```python
from batch_spec import parse_batch_spec, format_batch_spec, get_batch_stats
from common import BenchmarkConfig, BenchmarkResult, ResultsFormatter

# Parse batch specs
requests = parse_batch_spec("2q2k_q4s1k_32q1s1k")
print(format_batch_spec(requests))
# "2 prefill (2x2k), 1 extend (1xq4kv1k), 32 decode (32x1k)"

# Get batch statistics
stats = get_batch_stats(requests)
print(f"Total tokens: {stats['total_tokens']}")
print(f"Num decode: {stats['num_decode']}, Num prefill: {stats['num_prefill']}")

# Format results
formatter = ResultsFormatter()
formatter.save_csv(results, "output.csv")
formatter.save_json(results, "output.json")
```

## Tips

**1. Warmup matters** - Use `--warmup-iters 10` for stable results

**2. Multiple repeats** - Use `--repeats 20` for low variance

**3. Save results** - Always use `--output-csv` or `--output-json`

**4. Test incrementally** - Start with `--num-layers 1 --repeats 1`

**5. Extended grammar** - Leverage spec decode, chunked prefill patterns

**6. Parameter sweeps** - Use `--sweep-param` and `--sweep-values` to find optimal values
44 changes: 44 additions & 0 deletions benchmarks/attention_benchmarks/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# SPDX-License-Identifier: Apache-2.0
# SPDX-FileCopyrightText: Copyright contributors to the vLLM project

"""vLLM Attention Benchmarking Suite."""

from .batch_spec import (
BatchRequest,
format_batch_spec,
get_batch_stats,
parse_batch_spec,
reorder_for_flashinfer,
split_by_type,
)
from .common import (
BenchmarkConfig,
BenchmarkResult,
MockLayer,
MockModelConfig,
ResultsFormatter,
get_attention_scale,
is_mla_backend,
setup_mla_dims,
)

__all__ = [
# Batch specification
"BatchRequest",
"parse_batch_spec",
"format_batch_spec",
"reorder_for_flashinfer",
"split_by_type",
"get_batch_stats",
# Benchmarking infrastructure
"BenchmarkConfig",
"BenchmarkResult",
"ResultsFormatter",
# Mock objects
"MockLayer",
"MockModelConfig",
# Utilities
"setup_mla_dims",
"get_attention_scale",
"is_mla_backend",
]
Loading