You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
FlashInfer is a library and kernel generator for Large Language Models that provides high-performance implementation of LLM GPU kernels such as FlashAttention, SparseAttention, PageAttention, Sampling, and more. FlashInfer focuses on LLM serving and inference, and delivers state-of-the-art performance across diverse scenarios.
20
19
21
20
Check our [v0.2 release blog](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) for new features!
22
21
23
22
The core features of FlashInfer include:
23
+
24
24
1.**Efficient Sparse/Dense Attention Kernels**: Efficient single/batch attention for sparse(paged)/dense KV-storage on CUDA Cores and Tensor Cores (both FA2 & FA3) templates. The vector-sparse attention can achieve 90% of the bandwidth of dense kernels with same problem size.
25
25
2.**Load-Balanced Scheduling**: FlashInfer decouples `plan`/`run` stage of attention computation where we schedule the computation of variable-length inputs in `plan` stage to alleviate load-imbalance issue.
26
26
3.**Memory Efficiency**: FlashInfer offers [Cascade Attention](https://docs.flashinfer.ai/api/cascade.html#flashinfer.cascade.MultiLevelCascadeAttentionWrapper) for hierarchical KV-Cache, and implements Head-Query fusion for accelerating Grouped-Query Attention, and efficient kernels for low-precision attention and fused-RoPE attention for compressed KV-Cache.
@@ -31,6 +31,7 @@ The core features of FlashInfer include:
31
31
FlashInfer supports PyTorch, TVM and C++ (header-only) APIs, and can be easily integrated into existing projects.
32
32
33
33
## News
34
+
34
35
-[Mar 10, 2025][Blog Post](https://flashinfer.ai/2025/03/10/sampling.html) Sorting-Free GPU Kernels for LLM Sampling, which explains the design of sampling kernels in FlashInfer.
35
36
-[Mar 1, 2025] Checkout flashinfer's [intra-kernel profiler](https://github.com/flashinfer-ai/flashinfer/tree/main/profiler) for visualizing the timeline of each threadblock in GPU kernels.
36
37
-[Dec 16, 2024][Blog Post](https://flashinfer.ai/2024/12/16/flashinfer-v02-release.html) FlashInfer 0.2 - Efficient and Customizable Kernels for LLM Inference Serving
@@ -51,11 +52,13 @@ pip install flashinfer-python
51
52
```
52
53
53
54
**Package Options:**
55
+
54
56
-**flashinfer-python**: Core package that compiles/downloads kernels on first use
55
57
-**flashinfer-cubin**: Pre-compiled kernel binaries for all supported GPU architectures
56
58
-**flashinfer-jit-cache**: Pre-built kernel cache for specific CUDA versions
57
59
58
60
**For faster initialization and offline usage**, install the optional packages to have most kernels pre-compiled:
61
+
59
62
```bash
60
63
pip install flashinfer-python flashinfer-cubin
61
64
# JIT cache package (replace cu129 with your CUDA version: cu128, cu129, or cu130)
# Set log destination (stdout (default), stderr, or file path)
181
+
export FLASHINFER_LOGDEST=stdout
182
+
```
183
+
184
+
For detailed information about logging levels, configuration, and advanced features, see [LOGGING.md](LOGGING.md).
185
+
165
186
## Custom Attention Variants
166
187
167
188
Starting from FlashInfer v0.2, users can customize their own attention variants with additional parameters. For more details, refer to our [JIT examples](https://github.com/flashinfer-ai/flashinfer/blob/main/tests/utils/test_jit_example.py).
@@ -173,6 +194,7 @@ FlashInfer currently provides support for NVIDIA SM architectures 75 and higher
173
194
## Adoption
174
195
175
196
We are thrilled to share that FlashInfer is being adopted by many cutting-edge projects, including but not limited to:
0 commit comments