Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
name: add-cuda-kernel
description: Step-by-step tutorial for adding new CUDA kernels to FlashInfer
---

# Tutorial: Adding a New Kernel to FlashInfer

This tutorial walks through adding a simple element-wise scale operation to FlashInfer. We'll implement `scale(x, factor) = x * factor` to demonstrate the complete workflow.
Expand Down Expand Up @@ -736,6 +741,55 @@ pytest tests/test_scale.py -v
pytest tests/test_scale.py::test_scale_correctness[float16-128] -v
```

## Step 10: Add Benchmark

**All new kernels should have benchmarks.** This helps track performance regressions and allows users to compare against other implementations.

Create a benchmark file in `benchmarks/` (e.g., `benchmarks/bench_scale.py`):

```python
import torch
from flashinfer.testing import bench_gpu_time

def bench_scale():
"""Benchmark scale kernel."""
import flashinfer

sizes = [1024, 4096, 16384, 65536, 262144]
dtypes = [torch.float16, torch.bfloat16]

print("Scale Kernel Benchmark")
print("-" * 60)
print(f"{'Size':>10} {'Dtype':>10} {'Time (us)':>12} {'Std (us)':>10}")
print("-" * 60)

for size in sizes:
for dtype in dtypes:
x = torch.randn(size, dtype=dtype, device="cuda")

# Benchmark with CUPTI (auto-fallback to CUDA events)
median_time, std_time = bench_gpu_time(
flashinfer.scale,
args=(x, 2.0),
enable_cupti=True,
dry_run_iters=10,
repeat_iters=100,
)

print(f"{size:>10} {str(dtype):>10} {median_time*1e6:>12.2f} {std_time*1e6:>10.2f}")

if __name__ == "__main__":
bench_scale()
Comment on lines +751 to +782
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example benchmark script has a few issues that would prevent it from running correctly:

  1. It's missing an import for numpy, which is needed to calculate np.median and np.std.
  2. The bench_gpu_time function returns a list of execution times in milliseconds, not the median and standard deviation directly. The example code should be updated to calculate these statistics from the returned list.
  3. The conversion to microseconds (us) should be from milliseconds, so the multiplication factor should be 1000, not 1e6 (which would be for seconds-to-microseconds).

Here is a corrected version of the script.

import torch
import numpy as np
from flashinfer.testing import bench_gpu_time

def bench_scale():
    """Benchmark scale kernel."""
    import flashinfer

    sizes = [1024, 4096, 16384, 65536, 262144]
    dtypes = [torch.float16, torch.bfloat16]

    print("Scale Kernel Benchmark")
    print("-" * 60)
    print(f"{'Size':>10} {'Dtype':>10} {'Time (us)':>12} {'Std (us)':>10}")
    print("-" * 60)

    for size in sizes:
        for dtype in dtypes:
            x = torch.randn(size, dtype=dtype, device="cuda")

            # Benchmark with CUPTI (auto-fallback to CUDA events)
            times_ms = bench_gpu_time(
                flashinfer.scale,
                args=(x, 2.0),
                enable_cupti=True,
                dry_run_iters=10,
                repeat_iters=100,
            )
            # bench_gpu_time returns a list of times in milliseconds
            median_time_us = np.median(times_ms) * 1000
            std_time_us = np.std(times_ms) * 1000

            print(f"{size:>10} {str(dtype):>10} {median_time_us:>12.2f} {std_time_us:>10.2f}")

if __name__ == "__main__":
    bench_scale()

```

**For more complex kernels**, consider:

- Adding comparisons against reference implementations (e.g., PyTorch native, cuBLAS, cuDNN)
- Using the unified benchmarking framework in `benchmarks/flashinfer_benchmark.py` if applicable
- Testing across different problem sizes and configurations

β†’ **For complete benchmarking guide, see [`.claude/skills/benchmark-kernel/SKILL.md`](../benchmark-kernel/SKILL.md)**

## Summary of Files Created/Modified

```
Expand All @@ -747,4 +801,5 @@ flashinfer/scale.py # NEW: Python API
flashinfer/__init__.py # MODIFIED: Export API
flashinfer/aot.py # MODIFIED: Register AOT
tests/test_scale.py # NEW: Unit tests
benchmarks/bench_scale.py # NEW: Benchmark script
```
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
name: benchmark-kernel
description: Guide for benchmarking FlashInfer kernels with CUPTI timing
---

# Tutorial: Benchmarking FlashInfer Kernels

This tutorial shows you how to accurately benchmark FlashInfer kernels.
Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,8 @@
---
name: debug-cuda-crash
description: Tutorial for debugging CUDA crashes using API logging
---

# Tutorial: Debugging CUDA Crashes with API Logging

This tutorial shows you how to debug CUDA crashes and errors in FlashInfer using the `@flashinfer_api` logging decorator.
Expand Down