Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
90 changes: 88 additions & 2 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ The aim of `flashinfer_benchmark.py` is to provide a single framework for benchm
## Overview

This framework provides tools to:
- Benchmark FlashInfer's Attention, GEMM, MOE, Norm, and Quantization API performance from different kernel backends such as FlashAttention2/3, cuDNN, cuBLAS, CUTLASS, CuTe-DSL, and TensorRT-LLM
- Benchmark FlashInfer's Attention, GEMM, MOE, Norm, Quantization, Sampling, and RoPE API performance from different kernel backends such as FlashAttention2/3, cuDNN, cuBLAS, CUTLASS, CuTe-DSL, and TensorRT-LLM
- Compare performance across different configurations
- Batch performance test multiple test cases

Currently supports testing attention, gemm, fused MOE, normalization, and quantization APIs:
Currently supports testing attention, gemm, fused MOE, normalization, quantization, sampling, and RoPE APIs:
- Attention:
- `BatchDecodeWithPagedKVCacheWrapper` - Decode attention with paged KV cache.
- Also supports computationally similar `cudnn_batch_decode_with_kv_cache` and `trtllm_batch_decode_with_kv_cache`.
Expand Down Expand Up @@ -42,6 +42,31 @@ Currently supports testing attention, gemm, fused MOE, normalization, and quanti
- `mxfp4_quantize` - Quantize tensor to MxFP4 format (Blackwell SM10.0+).
- `nvfp4_quantize` - Quantize tensor to NVFP4 format with configurable scale factor layout (Blackwell SM10.0+).
- `nvfp4_batched_quantize` - Batched NVFP4 quantization (Blackwell SM10.0+).
- Sampling:
- `softmax` - Softmax with optional temperature scaling.
- `sampling_from_probs` - Sample token indices from probability distributions.
- `sampling_from_logits` - Sample token indices from logits (fused softmax + sampling).
- `top_k_sampling_from_probs` - Top-K sampling from probabilities.
- `top_p_sampling_from_probs` - Top-P (nucleus) sampling from probabilities.
- `top_k_top_p_sampling_from_probs` - Combined Top-K and Top-P sampling from probabilities.
- `top_k_top_p_sampling_from_logits` - Combined Top-K and Top-P sampling from logits.
- `min_p_sampling_from_probs` - Min-P sampling from probabilities.
- `top_k_renorm_probs` - Renormalize probabilities after Top-K filtering.
- `top_p_renorm_probs` - Renormalize probabilities after Top-P filtering.
- `top_k_mask_logits` - Mask logits outside Top-K values.
- `chain_speculative_sampling` - Chain speculative sampling for speculative decoding.
- `top_k` - Radix-based Top-K selection.
- `top_k_page_table_transform` - Fused Top-K with page table lookup.
- `top_k_ragged_transform` - Fused Top-K with ragged index transform.
- RoPE (Rotary Positional Embeddings):
- `apply_rope` - Apply RoPE with indptr/offsets.
- `apply_rope_pos_ids` - Apply RoPE with position IDs.
- `apply_llama31_rope` - Apply Llama 3.1 style RoPE with indptr/offsets.
- `apply_llama31_rope_pos_ids` - Apply Llama 3.1 style RoPE with position IDs.
- `apply_rope_with_cos_sin_cache` - Apply RoPE with precomputed cos/sin cache.
- `mla_rope_quantize_fp8` - MLA RoPE with FP8 quantization (SM8.9+).
- `rope_quantize_fp8` - RoPE with FP8 quantization (SM8.9+).
- `rope_quantize_fp8_append_paged_kv_cache` - RoPE with FP8 quantization and paged KV cache append (SM8.9+).

## Quick Start
### Single Test Run
Expand Down Expand Up @@ -316,6 +341,44 @@ mpirun -np 8 python benchmarks/flashinfer_benchmark.py \
| `--sf_vec_size` | Scale factor vector size for NVFP4 quantization. Default: 16 |
| `--backends` | Backend to test. Default: `cuda` |

### Sampling Flags
| Flag | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size` | Batch size (number of sequences) |
| `--vocab_size` | Vocabulary size |
| `--input_dtype` | Input data type for logits: `float32` (default), `float16`, or `bfloat16` |
| `--top_k` | Top-K value for top-k sampling. Default: 50 |
| `--top_p` | Top-P threshold for top-p (nucleus) sampling. Default: 0.9 |
| `--min_p` | Min-P threshold for min-p sampling. Default: 0.1 |
| `--temperature` | Temperature for softmax. Default: 1.0 |
| `--filter_apply_order` | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint` |
| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5 |
| `--max_len` | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096 |
| `--num_rows` | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size |
| `--backends` | Backend to test: `cuda` (default) |

### RoPE Flags
| Flag | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size` | Batch size (number of sequences) |
| `--seq_len` | Sequence length (qkv_len or kv_len) |
| `--num_qo_heads` | Number of query/output heads |
| `--num_kv_heads` | Number of key/value heads |
| `--head_dim` | Head dimension |
| `--rotary_dim` | Rotary dimension (defaults to head_dim if not specified) |
| `--no_rope_dim` | Number of dimensions without RoPE (for MLA). Default: 0 |
| `--input_dtype` | Input data type: `float16` (default) or `bfloat16` |
| `--quant_dtype` | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2` |
| `--rope_scale` | RoPE scaling factor. Default: 1.0 |
| `--rope_theta` | RoPE theta base frequency. Default: 10000.0 |
| `--interleave` | Use interleaved rotary embedding (GPT-J style) |
| `--page_size` | Page size for paged KV cache. Default: 16 |
| `--kv_layout` | KV cache layout: `NHD` (default) or `HND` |
| `--low_freq_factor` | Low frequency factor for Llama 3.1 RoPE. Default: 1.0 |
| `--high_freq_factor` | High frequency factor for Llama 3.1 RoPE. Default: 4.0 |
| `--old_context_len` | Old context length for Llama 3.1 RoPE. Default: 8192 |
| `--backends` | Backend to test: `cuda` (default) |

Comment on lines +344 to +381
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add blank lines around the new tables to satisfy markdownlint.
Lines 345 and 361 start tables immediately after headings; MD058 expects blank lines before/after tables.

📝 Proposed markdownlint-friendly spacing
### Sampling Flags
+
| Flag                     | Description                                                                                                 |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size`           | Batch size (number of sequences)                                                                           |
| `--vocab_size`           | Vocabulary size                                                                                            |
| `--input_dtype`          | Input data type for logits: `float32` (default), `float16`, or `bfloat16`                                  |
| `--top_k`                | Top-K value for top-k sampling. Default: 50                                                                |
| `--top_p`                | Top-P threshold for top-p (nucleus) sampling. Default: 0.9                                                 |
| `--min_p`                | Min-P threshold for min-p sampling. Default: 0.1                                                           |
| `--temperature`          | Temperature for softmax. Default: 1.0                                                                      |
| `--filter_apply_order`   | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint`                              |
| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5                                    |
| `--max_len`              | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096           |
| `--num_rows`             | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size       |
| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
### RoPE Flags
+
| Flag                     | Description                                                                                                 |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size`           | Batch size (number of sequences)                                                                           |
| `--seq_len`              | Sequence length (qkv_len or kv_len)                                                                        |
| `--num_qo_heads`         | Number of query/output heads                                                                               |
| `--num_kv_heads`         | Number of key/value heads                                                                                  |
| `--head_dim`             | Head dimension                                                                                             |
| `--rotary_dim`           | Rotary dimension (defaults to head_dim if not specified)                                                   |
| `--no_rope_dim`          | Number of dimensions without RoPE (for MLA). Default: 0                                                    |
| `--input_dtype`          | Input data type: `float16` (default) or `bfloat16`                                                         |
| `--quant_dtype`          | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2`                                   |
| `--rope_scale`           | RoPE scaling factor. Default: 1.0                                                                          |
| `--rope_theta`           | RoPE theta base frequency. Default: 10000.0                                                                |
| `--interleave`           | Use interleaved rotary embedding (GPT-J style)                                                             |
| `--page_size`            | Page size for paged KV cache. Default: 16                                                                  |
| `--kv_layout`            | KV cache layout: `NHD` (default) or `HND`                                                                  |
| `--low_freq_factor`      | Low frequency factor for Llama 3.1 RoPE. Default: 1.0                                                      |
| `--high_freq_factor`     | High frequency factor for Llama 3.1 RoPE. Default: 4.0                                                     |
| `--old_context_len`      | Old context length for Llama 3.1 RoPE. Default: 8192                                                       |
| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
### Sampling Flags
| Flag | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size` | Batch size (number of sequences) |
| `--vocab_size` | Vocabulary size |
| `--input_dtype` | Input data type for logits: `float32` (default), `float16`, or `bfloat16` |
| `--top_k` | Top-K value for top-k sampling. Default: 50 |
| `--top_p` | Top-P threshold for top-p (nucleus) sampling. Default: 0.9 |
| `--min_p` | Min-P threshold for min-p sampling. Default: 0.1 |
| `--temperature` | Temperature for softmax. Default: 1.0 |
| `--filter_apply_order` | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint` |
| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5 |
| `--max_len` | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096 |
| `--num_rows` | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size |
| `--backends` | Backend to test: `cuda` (default) |
### RoPE Flags
| Flag | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size` | Batch size (number of sequences) |
| `--seq_len` | Sequence length (qkv_len or kv_len) |
| `--num_qo_heads` | Number of query/output heads |
| `--num_kv_heads` | Number of key/value heads |
| `--head_dim` | Head dimension |
| `--rotary_dim` | Rotary dimension (defaults to head_dim if not specified) |
| `--no_rope_dim` | Number of dimensions without RoPE (for MLA). Default: 0 |
| `--input_dtype` | Input data type: `float16` (default) or `bfloat16` |
| `--quant_dtype` | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2` |
| `--rope_scale` | RoPE scaling factor. Default: 1.0 |
| `--rope_theta` | RoPE theta base frequency. Default: 10000.0 |
| `--interleave` | Use interleaved rotary embedding (GPT-J style) |
| `--page_size` | Page size for paged KV cache. Default: 16 |
| `--kv_layout` | KV cache layout: `NHD` (default) or `HND` |
| `--low_freq_factor` | Low frequency factor for Llama 3.1 RoPE. Default: 1.0 |
| `--high_freq_factor` | High frequency factor for Llama 3.1 RoPE. Default: 4.0 |
| `--old_context_len` | Old context length for Llama 3.1 RoPE. Default: 8192 |
| `--backends` | Backend to test: `cuda` (default) |
### Sampling Flags
| Flag | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size` | Batch size (number of sequences) |
| `--vocab_size` | Vocabulary size |
| `--input_dtype` | Input data type for logits: `float32` (default), `float16`, or `bfloat16` |
| `--top_k` | Top-K value for top-k sampling. Default: 50 |
| `--top_p` | Top-P threshold for top-p (nucleus) sampling. Default: 0.9 |
| `--min_p` | Min-P threshold for min-p sampling. Default: 0.1 |
| `--temperature` | Temperature for softmax. Default: 1.0 |
| `--filter_apply_order` | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint` |
| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5 |
| `--max_len` | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096 |
| `--num_rows` | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size |
| `--backends` | Backend to test: `cuda` (default) |
### RoPE Flags
| Flag | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size` | Batch size (number of sequences) |
| `--seq_len` | Sequence length (qkv_len or kv_len) |
| `--num_qo_heads` | Number of query/output heads |
| `--num_kv_heads` | Number of key/value heads |
| `--head_dim` | Head dimension |
| `--rotary_dim` | Rotary dimension (defaults to head_dim if not specified) |
| `--no_rope_dim` | Number of dimensions without RoPE (for MLA). Default: 0 |
| `--input_dtype` | Input data type: `float16` (default) or `bfloat16` |
| `--quant_dtype` | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2` |
| `--rope_scale` | RoPE scaling factor. Default: 1.0 |
| `--rope_theta` | RoPE theta base frequency. Default: 10000.0 |
| `--interleave` | Use interleaved rotary embedding (GPT-J style) |
| `--page_size` | Page size for paged KV cache. Default: 16 |
| `--kv_layout` | KV cache layout: `NHD` (default) or `HND` |
| `--low_freq_factor` | Low frequency factor for Llama 3.1 RoPE. Default: 1.0 |
| `--high_freq_factor` | High frequency factor for Llama 3.1 RoPE. Default: 4.0 |
| `--old_context_len` | Old context length for Llama 3.1 RoPE. Default: 8192 |
| `--backends` | Backend to test: `cuda` (default) |
🧰 Tools
🪛 LanguageTool

[uncategorized] ~378-~378: If this is a compound adjective that modifies the following noun, use a hyphen.
Context: ... | | --high_freq_factor | High frequency factor for Llama 3.1 RoPE. Default: 4.0...

(EN_COMPOUND_ADJECTIVE_INTERNAL)

🪛 markdownlint-cli2 (0.20.0)

[warning] 345-345: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)


[warning] 361-361: Tables should be surrounded by blank lines

(MD058, blanks-around-tables)

🤖 Prompt for AI Agents
In `@benchmarks/README.md` around lines 344 - 381, The markdown tables under the
headings "### Sampling Flags" and "### RoPE Flags" start immediately after the
headings and before the following text, triggering MD058; add a single blank
line between each heading and its table, and ensure there is a blank line after
each table (i.e., add one empty line before the table under "### Sampling Flags"
and one after it, and do the same for the table under "### RoPE Flags") so both
tables have blank lines before and after to satisfy markdownlint.

## `flashinfer_benchmark.py` Routine & Backend Support Matrix
The following table summarizes the support surface of each routine & backend's on various [CUDA Compute Capabilities](https://developer.nvidia.com/cuda-gpus).

Expand Down Expand Up @@ -357,6 +420,29 @@ Legend:
| **mxfp4_quantize** | | | | | | cuda | cuda | |
| **nvfp4_quantize** | | | | | | cuda | cuda | |
| **nvfp4_batched_quantize** | | | | | | cuda | cuda | |
| **softmax** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **sampling_from_logits** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_top_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_top_p_sampling_from_logits** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **min_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_renorm_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_p_renorm_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_mask_logits** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **chain_speculative_sampling** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_page_table_transform** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_ragged_transform** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **apply_rope** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **apply_rope_pos_ids** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **apply_llama31_rope** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **apply_llama31_rope_pos_ids** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **apply_rope_with_cos_sin_cache** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **mla_rope_quantize_fp8** | | | | cuda | cuda | cuda | cuda | cuda |
| **rope_quantize_fp8** | | | | cuda | cuda | cuda | cuda | cuda |
| **rope_quantize_fp8_append_paged_kv_cache** | | | | cuda | cuda | cuda | cuda | cuda |

Backend Legend:
- fa2: FlashAttention2
Expand Down
20 changes: 19 additions & 1 deletion benchmarks/flashinfer_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,14 @@ def run_test(args):
from routines.quantization import run_quantization_test

res = run_quantization_test(args)
elif args.routine in benchmark_apis["sampling"]:
from routines.sampling import run_sampling_test

res = run_sampling_test(args)
elif args.routine in benchmark_apis["rope"]:
from routines.rope import run_rope_test

res = run_rope_test(args)
else:
raise ValueError(f"Unsupported routine: {args.routine}")

Expand Down Expand Up @@ -89,7 +97,9 @@ def parse_args(line=sys.argv[1:]):
+ list(benchmark_apis["moe"])
+ list(benchmark_apis["moe_comm"])
+ list(benchmark_apis["norm"])
+ list(benchmark_apis["quantization"]),
+ list(benchmark_apis["quantization"])
+ list(benchmark_apis["sampling"])
+ list(benchmark_apis["rope"]),
)
args, _ = parser.parse_known_args(line[:])

Expand Down Expand Up @@ -199,6 +209,14 @@ def parse_args(line=sys.argv[1:]):
from routines.quantization import parse_quantization_args

args = parse_quantization_args(line, parser)
elif args.routine in benchmark_apis["sampling"]:
from routines.sampling import parse_sampling_args

args = parse_sampling_args(line, parser)
elif args.routine in benchmark_apis["rope"]:
from routines.rope import parse_rope_args

args = parse_rope_args(line, parser)
else:
raise ValueError(f"Unsupported routine: {args.routine}")

Expand Down
Loading
Loading