flashinfer-ai · bkryu · Feb 5, 2026 · Feb 3, 2026 · Feb 3, 2026 · Feb 3, 2026
@@ -5,11 +5,11 @@ The aim of `flashinfer_benchmark.py` is to provide a single framework for benchm
 ## Overview
 
 This framework provides tools to:
-- Benchmark FlashInfer's Attention, GEMM, MOE, Norm, and Quantization API performance from different kernel backends such as FlashAttention2/3, cuDNN, cuBLAS, CUTLASS, CuTe-DSL, and TensorRT-LLM
+- Benchmark FlashInfer's Attention, GEMM, MOE, Norm, Quantization, Sampling, and RoPE API performance from different kernel backends such as FlashAttention2/3, cuDNN, cuBLAS, CUTLASS, CuTe-DSL, and TensorRT-LLM
 - Compare performance across different configurations
 - Batch performance test multiple test cases
 
-Currently supports testing attention, gemm, fused MOE, normalization, and quantization APIs:
+Currently supports testing attention, gemm, fused MOE, normalization, quantization, sampling, and RoPE APIs:
 - Attention:
     - `BatchDecodeWithPagedKVCacheWrapper` - Decode attention with paged KV cache.
         - Also supports computationally similar `cudnn_batch_decode_with_kv_cache` and `trtllm_batch_decode_with_kv_cache`.
@@ -42,6 +42,31 @@ Currently supports testing attention, gemm, fused MOE, normalization, and quanti
     - `mxfp4_quantize` - Quantize tensor to MxFP4 format (Blackwell SM10.0+).
     - `nvfp4_quantize` - Quantize tensor to NVFP4 format with configurable scale factor layout (Blackwell SM10.0+).
     - `nvfp4_batched_quantize` - Batched NVFP4 quantization (Blackwell SM10.0+).
+- Sampling:
+    - `softmax` - Softmax with optional temperature scaling.
+    - `sampling_from_probs` - Sample token indices from probability distributions.
+    - `sampling_from_logits` - Sample token indices from logits (fused softmax + sampling).
+    - `top_k_sampling_from_probs` - Top-K sampling from probabilities.
+    - `top_p_sampling_from_probs` - Top-P (nucleus) sampling from probabilities.
+    - `top_k_top_p_sampling_from_probs` - Combined Top-K and Top-P sampling from probabilities.
+    - `top_k_top_p_sampling_from_logits` - Combined Top-K and Top-P sampling from logits.
+    - `min_p_sampling_from_probs` - Min-P sampling from probabilities.
+    - `top_k_renorm_probs` - Renormalize probabilities after Top-K filtering.
+    - `top_p_renorm_probs` - Renormalize probabilities after Top-P filtering.
+    - `top_k_mask_logits` - Mask logits outside Top-K values.
+    - `chain_speculative_sampling` - Chain speculative sampling for speculative decoding.
+    - `top_k` - Radix-based Top-K selection.
+    - `top_k_page_table_transform` - Fused Top-K with page table lookup.
+    - `top_k_ragged_transform` - Fused Top-K with ragged index transform.
+- RoPE (Rotary Positional Embeddings):
+    - `apply_rope` - Apply RoPE with indptr/offsets.
+    - `apply_rope_pos_ids` - Apply RoPE with position IDs.
+    - `apply_llama31_rope` - Apply Llama 3.1 style RoPE with indptr/offsets.
+    - `apply_llama31_rope_pos_ids` - Apply Llama 3.1 style RoPE with position IDs.
+    - `apply_rope_with_cos_sin_cache` - Apply RoPE with precomputed cos/sin cache.
+    - `mla_rope_quantize_fp8` - MLA RoPE with FP8 quantization (SM8.9+).
+    - `rope_quantize_fp8` - RoPE with FP8 quantization (SM8.9+).
+    - `rope_quantize_fp8_append_paged_kv_cache` - RoPE with FP8 quantization and paged KV cache append (SM8.9+).
 
 ## Quick Start
 ### Single Test Run
@@ -316,6 +341,44 @@ mpirun -np 8 python benchmarks/flashinfer_benchmark.py \
 | `--sf_vec_size`          | Scale factor vector size for NVFP4 quantization. Default: 16                                               |
 | `--backends`             | Backend to test. Default: `cuda`                                                                           |
 
+### Sampling Flags
+| Flag                     | Description                                                                                                 |
+|--------------------------|-------------------------------------------------------------------------------------------------------------|
+| `--batch_size`           | Batch size (number of sequences)                                                                           |
+| `--vocab_size`           | Vocabulary size                                                                                            |
+| `--input_dtype`          | Input data type for logits: `float32` (default), `float16`, or `bfloat16`                                  |
+| `--top_k`                | Top-K value for top-k sampling. Default: 50                                                                |
+| `--top_p`                | Top-P threshold for top-p (nucleus) sampling. Default: 0.9                                                 |
+| `--min_p`                | Min-P threshold for min-p sampling. Default: 0.1                                                           |
+| `--temperature`          | Temperature for softmax. Default: 1.0                                                                      |
+| `--filter_apply_order`   | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint`                              |
+| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5                                    |
+| `--max_len`              | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096           |
+| `--num_rows`             | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size       |
+| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
+### RoPE Flags
+| Flag                     | Description                                                                                                 |
+|--------------------------|-------------------------------------------------------------------------------------------------------------|
+| `--batch_size`           | Batch size (number of sequences)                                                                           |
+| `--seq_len`              | Sequence length (qkv_len or kv_len)                                                                        |
+| `--num_qo_heads`         | Number of query/output heads                                                                               |
+| `--num_kv_heads`         | Number of key/value heads                                                                                  |
+| `--head_dim`             | Head dimension                                                                                             |
+| `--rotary_dim`           | Rotary dimension (defaults to head_dim if not specified)                                                   |
+| `--no_rope_dim`          | Number of dimensions without RoPE (for MLA). Default: 0                                                    |
+| `--input_dtype`          | Input data type: `float16` (default) or `bfloat16`                                                         |
+| `--quant_dtype`          | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2`                                   |
+| `--rope_scale`           | RoPE scaling factor. Default: 1.0                                                                          |
+| `--rope_theta`           | RoPE theta base frequency. Default: 10000.0                                                                |
+| `--interleave`           | Use interleaved rotary embedding (GPT-J style)                                                             |
+| `--page_size`            | Page size for paged KV cache. Default: 16                                                                  |
+| `--kv_layout`            | KV cache layout: `NHD` (default) or `HND`                                                                  |
+| `--low_freq_factor`      | Low frequency factor for Llama 3.1 RoPE. Default: 1.0                                                      |
+| `--high_freq_factor`     | High frequency factor for Llama 3.1 RoPE. Default: 4.0                                                     |
+| `--old_context_len`      | Old context length for Llama 3.1 RoPE. Default: 8192                                                       |
+| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
-### Sampling Flags
-| Flag                     | Description                                                                                                 |
-|--------------------------|-------------------------------------------------------------------------------------------------------------|
-| `--batch_size`           | Batch size (number of sequences)                                                                           |
-| `--vocab_size`           | Vocabulary size                                                                                            |
-| `--input_dtype`          | Input data type for logits: `float32` (default), `float16`, or `bfloat16`                                  |
-| `--top_k`                | Top-K value for top-k sampling. Default: 50                                                                |
-| `--top_p`                | Top-P threshold for top-p (nucleus) sampling. Default: 0.9                                                 |
-| `--min_p`                | Min-P threshold for min-p sampling. Default: 0.1                                                           |
-| `--temperature`          | Temperature for softmax. Default: 1.0                                                                      |
-| `--filter_apply_order`   | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint`                              |
-| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5                                    |
-| `--max_len`              | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096           |
-| `--num_rows`             | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size       |
-| `--backends`             | Backend to test: `cuda` (default)                                                                          |
-
-### RoPE Flags
-| Flag                     | Description                                                                                                 |
-|--------------------------|-------------------------------------------------------------------------------------------------------------|
-| `--batch_size`           | Batch size (number of sequences)                                                                           |
-| `--seq_len`              | Sequence length (qkv_len or kv_len)                                                                        |
-| `--num_qo_heads`         | Number of query/output heads                                                                               |
-| `--num_kv_heads`         | Number of key/value heads                                                                                  |
-| `--head_dim`             | Head dimension                                                                                             |
-| `--rotary_dim`           | Rotary dimension (defaults to head_dim if not specified)                                                   |
-| `--no_rope_dim`          | Number of dimensions without RoPE (for MLA). Default: 0                                                    |
-| `--input_dtype`          | Input data type: `float16` (default) or `bfloat16`                                                         |
-| `--quant_dtype`          | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2`                                   |
-| `--rope_scale`           | RoPE scaling factor. Default: 1.0                                                                          |
-| `--rope_theta`           | RoPE theta base frequency. Default: 10000.0                                                                |
-| `--interleave`           | Use interleaved rotary embedding (GPT-J style)                                                             |
-| `--page_size`            | Page size for paged KV cache. Default: 16                                                                  |
-| `--kv_layout`            | KV cache layout: `NHD` (default) or `HND`                                                                  |
-| `--low_freq_factor`      | Low frequency factor for Llama 3.1 RoPE. Default: 1.0                                                      |
-| `--high_freq_factor`     | High frequency factor for Llama 3.1 RoPE. Default: 4.0                                                     |
-| `--old_context_len`      | Old context length for Llama 3.1 RoPE. Default: 8192                                                       |
-| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+### Sampling Flags
+
+| Flag                     | Description                                                                                                 |
+|--------------------------|-------------------------------------------------------------------------------------------------------------|
+| `--batch_size`           | Batch size (number of sequences)                                                                           |
+| `--vocab_size`           | Vocabulary size                                                                                            |
+| `--input_dtype`          | Input data type for logits: `float32` (default), `float16`, or `bfloat16`                                  |
+| `--top_k`                | Top-K value for top-k sampling. Default: 50                                                                |
+| `--top_p`                | Top-P threshold for top-p (nucleus) sampling. Default: 0.9                                                 |
+| `--min_p`                | Min-P threshold for min-p sampling. Default: 0.1                                                           |
+| `--temperature`          | Temperature for softmax. Default: 1.0                                                                      |
+| `--filter_apply_order`   | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint`                              |
+| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5                                    |
+| `--max_len`              | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096           |
+| `--num_rows`             | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size       |
+| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
+### RoPE Flags
+
+| Flag                     | Description                                                                                                 |
+|--------------------------|-------------------------------------------------------------------------------------------------------------|
+| `--batch_size`           | Batch size (number of sequences)                                                                           |
+| `--seq_len`              | Sequence length (qkv_len or kv_len)                                                                        |
+| `--num_qo_heads`         | Number of query/output heads                                                                               |
+| `--num_kv_heads`         | Number of key/value heads                                                                                  |
+| `--head_dim`             | Head dimension                                                                                             |
+| `--rotary_dim`           | Rotary dimension (defaults to head_dim if not specified)                                                   |
+| `--no_rope_dim`          | Number of dimensions without RoPE (for MLA). Default: 0                                                    |
+| `--input_dtype`          | Input data type: `float16` (default) or `bfloat16`                                                         |
+| `--quant_dtype`          | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2`                                   |
+| `--rope_scale`           | RoPE scaling factor. Default: 1.0                                                                          |
+| `--rope_theta`           | RoPE theta base frequency. Default: 10000.0                                                                |
+| `--interleave`           | Use interleaved rotary embedding (GPT-J style)                                                             |
+| `--page_size`            | Page size for paged KV cache. Default: 16                                                                  |
+| `--kv_layout`            | KV cache layout: `NHD` (default) or `HND`                                                                  |
+| `--low_freq_factor`      | Low frequency factor for Llama 3.1 RoPE. Default: 1.0                                                      |
+| `--high_freq_factor`     | High frequency factor for Llama 3.1 RoPE. Default: 4.0                                                     |
+| `--old_context_len`      | Old context length for Llama 3.1 RoPE. Default: 8192                                                       |
+| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
-### Sampling Flags
-| Flag                     | Description                                                                                                 |
-|--------------------------|-------------------------------------------------------------------------------------------------------------|
-| `--batch_size`           | Batch size (number of sequences)                                                                           |
-| `--vocab_size`           | Vocabulary size                                                                                            |
-| `--input_dtype`          | Input data type for logits: `float32` (default), `float16`, or `bfloat16`                                  |
-| `--top_k`                | Top-K value for top-k sampling. Default: 50                                                                |
-| `--top_p`                | Top-P threshold for top-p (nucleus) sampling. Default: 0.9                                                 |
-| `--min_p`                | Min-P threshold for min-p sampling. Default: 0.1                                                           |
-| `--temperature`          | Temperature for softmax. Default: 1.0                                                                      |
-| `--filter_apply_order`   | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint`                              |
-| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5                                    |
-| `--max_len`              | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096           |
-| `--num_rows`             | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size       |
-| `--backends`             | Backend to test: `cuda` (default)                                                                          |
-
-### RoPE Flags
-| Flag                     | Description                                                                                                 |
-|--------------------------|-------------------------------------------------------------------------------------------------------------|
-| `--batch_size`           | Batch size (number of sequences)                                                                           |
-| `--seq_len`              | Sequence length (qkv_len or kv_len)                                                                        |
-| `--num_qo_heads`         | Number of query/output heads                                                                               |
-| `--num_kv_heads`         | Number of key/value heads                                                                                  |
-| `--head_dim`             | Head dimension                                                                                             |
-| `--rotary_dim`           | Rotary dimension (defaults to head_dim if not specified)                                                   |
-| `--no_rope_dim`          | Number of dimensions without RoPE (for MLA). Default: 0                                                    |
-| `--input_dtype`          | Input data type: `float16` (default) or `bfloat16`                                                         |
-| `--quant_dtype`          | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2`                                   |
-| `--rope_scale`           | RoPE scaling factor. Default: 1.0                                                                          |
-| `--rope_theta`           | RoPE theta base frequency. Default: 10000.0                                                                |
-| `--interleave`           | Use interleaved rotary embedding (GPT-J style)                                                             |
-| `--page_size`            | Page size for paged KV cache. Default: 16                                                                  |
-| `--kv_layout`            | KV cache layout: `NHD` (default) or `HND`                                                                  |
-| `--low_freq_factor`      | Low frequency factor for Llama 3.1 RoPE. Default: 1.0                                                      |
-| `--high_freq_factor`     | High frequency factor for Llama 3.1 RoPE. Default: 4.0                                                     |
-| `--old_context_len`      | Old context length for Llama 3.1 RoPE. Default: 8192                                                       |
-| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+### Sampling Flags
+
+| Flag                     | Description                                                                                                 |
+|--------------------------|-------------------------------------------------------------------------------------------------------------|
+| `--batch_size`           | Batch size (number of sequences)                                                                           |
+| `--vocab_size`           | Vocabulary size                                                                                            |
+| `--input_dtype`          | Input data type for logits: `float32` (default), `float16`, or `bfloat16`                                  |
+| `--top_k`                | Top-K value for top-k sampling. Default: 50                                                                |
+| `--top_p`                | Top-P threshold for top-p (nucleus) sampling. Default: 0.9                                                 |
+| `--min_p`                | Min-P threshold for min-p sampling. Default: 0.1                                                           |
+| `--temperature`          | Temperature for softmax. Default: 1.0                                                                      |
+| `--filter_apply_order`   | Order of applying top-k and top-p filters: `top_k_first` (default) or `joint`                              |
+| `--num_speculate_tokens` | Number of speculative tokens for chain speculative sampling. Default: 5                                    |
+| `--max_len`              | Max sequence length for `top_k_page_table_transform` and `top_k_ragged_transform`. Default: 4096           |
+| `--num_rows`             | Number of rows for `top_k_page_table_transform` and `top_k_ragged_transform`. Defaults to batch_size       |
+| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
+### RoPE Flags
+
+| Flag                     | Description                                                                                                 |
+|--------------------------|-------------------------------------------------------------------------------------------------------------|
+| `--batch_size`           | Batch size (number of sequences)                                                                           |
+| `--seq_len`              | Sequence length (qkv_len or kv_len)                                                                        |
+| `--num_qo_heads`         | Number of query/output heads                                                                               |
+| `--num_kv_heads`         | Number of key/value heads                                                                                  |
+| `--head_dim`             | Head dimension                                                                                             |
+| `--rotary_dim`           | Rotary dimension (defaults to head_dim if not specified)                                                   |
+| `--no_rope_dim`          | Number of dimensions without RoPE (for MLA). Default: 0                                                    |
+| `--input_dtype`          | Input data type: `float16` (default) or `bfloat16`                                                         |
+| `--quant_dtype`          | Quantized data type for FP8 routines: `fp8_e4m3` (default) or `fp8_e5m2`                                   |
+| `--rope_scale`           | RoPE scaling factor. Default: 1.0                                                                          |
+| `--rope_theta`           | RoPE theta base frequency. Default: 10000.0                                                                |
+| `--interleave`           | Use interleaved rotary embedding (GPT-J style)                                                             |
+| `--page_size`            | Page size for paged KV cache. Default: 16                                                                  |
+| `--kv_layout`            | KV cache layout: `NHD` (default) or `HND`                                                                  |
+| `--low_freq_factor`      | Low frequency factor for Llama 3.1 RoPE. Default: 1.0                                                      |
+| `--high_freq_factor`     | High frequency factor for Llama 3.1 RoPE. Default: 4.0                                                     |
+| `--old_context_len`      | Old context length for Llama 3.1 RoPE. Default: 8192                                                       |
+| `--backends`             | Backend to test: `cuda` (default)                                                                          |
+
 ## `flashinfer_benchmark.py` Routine & Backend Support Matrix
 The following table summarizes the support surface of each routine & backend's on various [CUDA Compute Capabilities](https://developer.nvidia.com/cuda-gpus).
 
@@ -357,6 +420,29 @@ Legend:
 | **mxfp4_quantize** |  |  |  |  |  | cuda | cuda |  |
 | **nvfp4_quantize** |  |  |  |  |  | cuda | cuda |  |
 | **nvfp4_batched_quantize** |  |  |  |  |  | cuda | cuda |  |
+| **softmax** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **sampling_from_logits** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k_top_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k_top_p_sampling_from_logits** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **min_p_sampling_from_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k_renorm_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_p_renorm_probs** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k_mask_logits** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **chain_speculative_sampling** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k_page_table_transform** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **top_k_ragged_transform** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **apply_rope** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **apply_rope_pos_ids** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **apply_llama31_rope** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **apply_llama31_rope_pos_ids** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **apply_rope_with_cos_sin_cache** | cuda | cuda | cuda | cuda | cuda | cuda | cuda | cuda |
+| **mla_rope_quantize_fp8** |  |  |  | cuda | cuda | cuda | cuda | cuda |
+| **rope_quantize_fp8** |  |  |  | cuda | cuda | cuda | cuda | cuda |
+| **rope_quantize_fp8_append_paged_kv_cache** |  |  |  | cuda | cuda | cuda | cuda | cuda |
 
 Backend Legend:
 - fa2: FlashAttention2

@@ -44,6 +44,14 @@ def run_test(args):
         from routines.quantization import run_quantization_test
 
         res = run_quantization_test(args)
+    elif args.routine in benchmark_apis["sampling"]:
+        from routines.sampling import run_sampling_test
+
+        res = run_sampling_test(args)
+    elif args.routine in benchmark_apis["rope"]:
+        from routines.rope import run_rope_test
+
+        res = run_rope_test(args)
     else:
         raise ValueError(f"Unsupported routine: {args.routine}")
 
@@ -89,7 +97,9 @@ def parse_args(line=sys.argv[1:]):
         + list(benchmark_apis["moe"])
         + list(benchmark_apis["moe_comm"])
         + list(benchmark_apis["norm"])
-        + list(benchmark_apis["quantization"]),
+        + list(benchmark_apis["quantization"])
+        + list(benchmark_apis["sampling"])
+        + list(benchmark_apis["rope"]),
     )
     args, _ = parser.parse_known_args(line[:])
 
@@ -199,6 +209,14 @@ def parse_args(line=sys.argv[1:]):
         from routines.quantization import parse_quantization_args
 
         args = parse_quantization_args(line, parser)
+    elif args.routine in benchmark_apis["sampling"]:
+        from routines.sampling import parse_sampling_args
+
+        args = parse_sampling_args(line, parser)
+    elif args.routine in benchmark_apis["rope"]:
+        from routines.rope import parse_rope_args
+
+        args = parse_rope_args(line, parser)
     else:
         raise ValueError(f"Unsupported routine: {args.routine}")