Skip to content
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 28 additions & 2 deletions benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,11 @@ The aim of `flashinfer_benchmark.py` is to provide a single framework for benchm
## Overview

This framework provides tools to:
- Benchmark FlashInfer's Attention, GEMM, MOE, Norm, and Quantization API performance from different kernel backends such as FlashAttention2/3, cuDNN, cuBLAS, CUTLASS, CuTe-DSL, and TensorRT-LLM
- Benchmark FlashInfer's Attention, GEMM, MOE, Norm, Quantization, and Sampling API performance from different kernel backends such as FlashAttention2/3, cuDNN, cuBLAS, CUTLASS, CuTe-DSL, and TensorRT-LLM
- Compare performance across different configurations
- Batch performance test multiple test cases

Currently supports testing attention, gemm, fused MOE, normalization, and quantization APIs:
Currently supports testing attention, gemm, fused MOE, normalization, quantization, and sampling APIs:
- Attention:
- `BatchDecodeWithPagedKVCacheWrapper` - Decode attention with paged KV cache.
- Also supports computationally similar `cudnn_batch_decode_with_kv_cache` and `trtllm_batch_decode_with_kv_cache`.
Expand Down Expand Up @@ -42,6 +42,14 @@ Currently supports testing attention, gemm, fused MOE, normalization, and quanti
- `mxfp4_quantize` - Quantize tensor to MxFP4 format (Blackwell SM10.0+).
- `nvfp4_quantize` - Quantize tensor to NVFP4 format with configurable scale factor layout (Blackwell SM10.0+).
- `nvfp4_batched_quantize` - Batched NVFP4 quantization (Blackwell SM10.0+).
- Sampling:
- `sampling_from_probs` - Basic category sampling from probability distributions.
- `top_p_sampling_from_probs` - Top-p (nucleus) sampling from probabilities.
- `top_k_sampling_from_probs` - Top-k sampling from probabilities.
- `top_k_top_p_sampling_from_probs` - Combined top-k and top-p sampling from probabilities.
- `top_k_renorm_probs` - Renormalize probabilities by top-k thresholding.
- `top_p_renorm_probs` - Renormalize probabilities by top-p thresholding.
- `top_k_mask_logits` - Mask logits by top-k thresholding.

## Quick Start
### Single Test Run
Expand Down Expand Up @@ -316,6 +324,17 @@ mpirun -np 8 python benchmarks/flashinfer_benchmark.py \
| `--sf_vec_size` | Scale factor vector size for NVFP4 quantization. Default: 16 |
| `--backends` | Backend to test. Default: `cuda` |

### Sampling Flags
| Flag | Description |
|--------------------------|-------------------------------------------------------------------------------------------------------------|
| `--batch_size` | Batch size (number of sequences to sample from) |
| `--vocab_size` | Vocabulary size. Default: 128256 (Llama 3 vocab size) |
| `--input_dtype` | Input data type: `float32` (default), `float16`, or `bfloat16` |
| `--top_p` | Top-p threshold for nucleus sampling. Default: 0.9 |
| `--top_k` | Top-k threshold for top-k sampling. Default: 50 |
| `--no_deterministic` | Disable deterministic sampling. Default: deterministic is enabled |
| `--backends` | Backend to test. Default: `cuda` |

## `flashinfer_benchmark.py` Routine & Backend Support Matrix
The following table summarizes the support surface of each routine & backend's on various [CUDA Compute Capabilities](https://developer.nvidia.com/cuda-gpus).

Expand Down Expand Up @@ -357,6 +376,13 @@ Legend:
| **mxfp4_quantize** | | | | | | cuda | cuda | |
| **nvfp4_quantize** | | | | | | cuda | cuda | |
| **nvfp4_batched_quantize** | | | | | | cuda | cuda | |
| **sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_top_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_renorm_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_p_renorm_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
| **top_k_mask_logits** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |

Backend Legend:
- fa2: FlashAttention2
Expand Down
11 changes: 10 additions & 1 deletion benchmarks/flashinfer_benchmark.py
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,10 @@ def run_test(args):
from routines.quantization import run_quantization_test

res = run_quantization_test(args)
elif args.routine in benchmark_apis["sampling"]:
from routines.sampling import run_sampling_test

res = run_sampling_test(args)
else:
raise ValueError(f"Unsupported routine: {args.routine}")

Expand Down Expand Up @@ -89,7 +93,8 @@ def parse_args(line=sys.argv[1:]):
+ list(benchmark_apis["moe"])
+ list(benchmark_apis["moe_comm"])
+ list(benchmark_apis["norm"])
+ list(benchmark_apis["quantization"]),
+ list(benchmark_apis["quantization"])
+ list(benchmark_apis["sampling"]),
)
args, _ = parser.parse_known_args(line[:])

Expand Down Expand Up @@ -199,6 +204,10 @@ def parse_args(line=sys.argv[1:]):
from routines.quantization import parse_quantization_args

args = parse_quantization_args(line, parser)
elif args.routine in benchmark_apis["sampling"]:
from routines.sampling import parse_sampling_args

args = parse_sampling_args(line, parser)
else:
raise ValueError(f"Unsupported routine: {args.routine}")

Expand Down
87 changes: 87 additions & 0 deletions benchmarks/routines/flashinfer_benchmark_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,12 @@
"do_shuffle",
"sf_vec_size",
],
"sampling": [
"vocab_size",
"top_p",
"top_k",
"deterministic",
],
"general": [
"batch_size",
"hidden_size",
Expand All @@ -118,6 +124,7 @@
+ output_column_dict["moe_comm"]
+ output_column_dict["norm"]
+ output_column_dict["quantization"]
+ output_column_dict["sampling"]
+ output_column_dict["general"]
)

Expand Down Expand Up @@ -157,6 +164,15 @@
"nvfp4_quantize",
"nvfp4_batched_quantize",
],
"sampling": [
"sampling_from_probs",
"top_p_sampling_from_probs",
"top_k_sampling_from_probs",
"top_k_top_p_sampling_from_probs",
"top_k_renorm_probs",
"top_p_renorm_probs",
"top_k_mask_logits",
],
}


Expand Down Expand Up @@ -431,6 +447,77 @@ def dtype_str_to_torch_dtype(dtype_str):
"10.3": ["cuda"],
"12.0": ["cuda"],
},
# SAMPLING - supported on all architectures
"sampling_from_probs": {
"7.5": ["cuda"],
"8.0": ["cuda"],
"8.6": ["cuda"],
"8.9": ["cuda"],
"9.0": ["cuda"],
"10.0": ["cuda"],
"10.3": ["cuda"],
"12.0": ["cuda"],
},
"top_p_sampling_from_probs": {
"7.5": ["cuda"],
"8.0": ["cuda"],
"8.6": ["cuda"],
"8.9": ["cuda"],
"9.0": ["cuda"],
"10.0": ["cuda"],
"10.3": ["cuda"],
"12.0": ["cuda"],
},
"top_k_sampling_from_probs": {
"7.5": ["cuda"],
"8.0": ["cuda"],
"8.6": ["cuda"],
"8.9": ["cuda"],
"9.0": ["cuda"],
"10.0": ["cuda"],
"10.3": ["cuda"],
"12.0": ["cuda"],
},
"top_k_top_p_sampling_from_probs": {
"7.5": ["cuda"],
"8.0": ["cuda"],
"8.6": ["cuda"],
"8.9": ["cuda"],
"9.0": ["cuda"],
"10.0": ["cuda"],
"10.3": ["cuda"],
"12.0": ["cuda"],
},
"top_k_renorm_probs": {
"7.5": ["cuda"],
"8.0": ["cuda"],
"8.6": ["cuda"],
"8.9": ["cuda"],
"9.0": ["cuda"],
"10.0": ["cuda"],
"10.3": ["cuda"],
"12.0": ["cuda"],
},
"top_p_renorm_probs": {
"7.5": ["cuda"],
"8.0": ["cuda"],
"8.6": ["cuda"],
"8.9": ["cuda"],
"9.0": ["cuda"],
"10.0": ["cuda"],
"10.3": ["cuda"],
"12.0": ["cuda"],
},
"top_k_mask_logits": {
"7.5": ["cuda"],
"8.0": ["cuda"],
"8.6": ["cuda"],
"8.9": ["cuda"],
"9.0": ["cuda"],
"10.0": ["cuda"],
"10.3": ["cuda"],
"12.0": ["cuda"],
},
}
Comment on lines +450 to 521
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

There's a lot of code duplication here for defining the supported compute capabilities for sampling routines. All sampling routines share the same support matrix. To improve maintainability and reduce code duplication, you can define the support dictionary once and reuse it for all sampling routines. A dictionary comprehension can make this more concise.

    # SAMPLING - supported on all architectures
    **{
        routine: {
            "7.5": ["cuda"],
            "8.0": ["cuda"],
            "8.6": ["cuda"],
            "8.9": ["cuda"],
            "9.0": ["cuda"],
            "10.0": ["cuda"],
            "10.3": ["cuda"],
            "12.0": ["cuda"],
        }
        for routine in benchmark_apis["sampling"]
    },
}



Expand Down
Loading
Loading