agent: add CLAUDE.md and claude skills by yzh119 · Pull Request #2240 · flashinfer-ai/flashinfer

yzh119 · 2025-12-18T21:13:11Z

📌 Description

Add CLAUDE.md as contribution guide to agents (and human).
Add several skills (adding an CUDA operator to flashinfer, debug, profiling), this list will grow in the future.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
Co-Authored-By: aleozlx aleyang@nvidia.com
Co-Authored-By: bkryu bkryu@nvidia.com
Co-Authored-By: nvmbreughe nvmbreughe@nvidia.com
Co-Authored-By: jimmyzho jimmzhou@nvidia.com
Co-Authored-By: cyx-6 yaxingc@nvidia.com

Summary by CodeRabbit

New Features
- Element-wise tensor scaling (in-place/out-of-place) with JIT-backed modules and a simple Python API supporting FP16, BF16, and FP32; AOT pre-generation support.
Tests
- Unit tests covering FP16, BF16, FP32 across sizes, in-place outputs, and invalid-input handling.
Chores
- Build integration to pre-generate scale modules and package export of the new API.
Documentation
- Kernel benchmarking tutorial; CUDA crash debugging guide; comprehensive developer workflow doc.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2025-12-18T21:13:23Z

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Adds a CUDA element-wise tensor scaling feature: templated CUDA kernel and launcher, TVM‑FFI bindings, JIT generation plus AOT pre-generation, a cached Python API and tests, and three new documentation guides (benchmarking, CUDA crash debugging, developer guide).

Changes

Cohort / File(s)	Summary
Scale — CUDA & C++ `include/flashinfer/scale.cuh`, `csrc/scale.cu`, `csrc/scale_jit_binding.cu`	New templated `ScaleKernel<T>` and `ScaleLauncher<T>` (half/bf16/float); PyTorch-facing `scale_launcher`; TVM‑FFI binding `scale_run` with input validation, TensorView↔torch::Tensor handling, export macro and error conventions.
Scale — JIT & Python API `flashinfer/jit/scale.py`, `flashinfer/scale.py`	JIT generator `gen_scale_module(dtype_in, dtype_out)` producing module URIs and copying sources; cached retrieval and high‑level `scale(input, factor, out=None)` validating inputs, allocating/dispatching outputs, and invoking the TVM‑FFI run function.
Integration & Tests `flashinfer/aot.py`, `flashinfer/__init__.py`, `tests/test_scale.py`	Adds `gen_scale_modules()` for AOT pre-generation and build integration, exposes `scale` in package API, and unit tests covering FP16/BF16/FP32, multiple sizes, in‑place outputs, and CPU‑tensor error handling.
Docs — Benchmarking `.claude/skills/benchmark-kernel/skill.md`	New benchmarking guide covering CUPTI vs CUDA events, measurement methods, usage examples, flags, batch benchmarking and troubleshooting.
Docs — CUDA Crash Debugging `.claude/skills/debug-cuda-crash/skill.md`	New debugging guide covering API logging, sanitizers, kernel printf, multi‑process workflows, step‑by‑step reproduction and diagnosis examples.
Docs — Developer Guide `CLAUDE.md`	New developer reference describing project overview, JIT architecture (JitSpec, CompilationContext), build/test workflows, logging, debugging approaches, and TVM‑FFI details.

Sequence Diagram

sequenceDiagram
    participant User
    participant PyAPI as Python API
    participant Cache as Module Cache
    participant JIT as JIT Generator
    participant Compiler as TVM Compiler
    participant TVMFFI as TVM-FFI Binding
    participant CUDA as CUDA Launcher
    participant GPU

    User->>PyAPI: scale(input, factor, out?)
    PyAPI->>PyAPI: validate input (dtype, device, shape)

    alt module cached
        PyAPI->>Cache: get(module for dtype pair)
        Cache-->>PyAPI: module
    else compile
        PyAPI->>JIT: gen_scale_module(dtype_in,dtype_out)
        JIT-->>PyAPI: JitSpec (sources, URI)
        PyAPI->>Compiler: compile(JitSpec)
        Compiler-->>PyAPI: compiled module
        PyAPI->>Cache: cache(module)
    end

    PyAPI->>PyAPI: prepare/allocate output
    PyAPI->>TVMFFI: run(input, output, factor)
    TVMFFI->>CUDA: launch ScaleLauncher<T> (with stream)
    CUDA->>GPU: execute ScaleKernel (element-wise multiply)
    GPU-->>CUDA: done
    CUDA-->>TVMFFI: completion
    TVMFFI-->>PyAPI: return
    PyAPI-->>User: return scaled tensor

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Poem

🐰 I nudged each byte with a playful hop,
kernels hum, small numbers skip and pop,
JITs assemble, caches hold the key,
streams whisk tensors on a GPU sea,
🥕

Pre-merge checks

✅ Passed checks (3 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately summarizes the main changes: adding CLAUDE.md and Claude skills documentation for agent/human contributors.
Description check	✅ Passed	The PR description covers the main changes and includes completed checklist items, but lacks detail on the specific skills being added and their purpose.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2025-12-18T21:13:27Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the developer experience for FlashInfer by introducing a new, comprehensive contribution guide and a suite of practical tutorials. These resources aim to streamline the process of adding new CUDA operators, accurately benchmarking kernel performance, and effectively debugging common CUDA-related issues, thereby empowering contributors and improving code quality.

Highlights

New Contribution Guide: Introduced CLAUDE.md as a comprehensive contribution guide for developers, detailing the project overview, development workflow, JIT compilation system, testing, and debugging practices.
CUDA Kernel Development Tutorial: Added a step-by-step tutorial on how to add a new CUDA operator to FlashInfer, covering kernel definition, launcher creation, TVM-FFI binding, JIT generation, Python API, and testing.
Kernel Benchmarking Tutorial: Included a detailed guide on accurately benchmarking FlashInfer kernels, explaining CUPTI vs. CUDA Events, usage of flashinfer_benchmark.py, and best practices.
CUDA Crash Debugging Tutorial: Provided a tutorial on debugging CUDA crashes using FlashInfer's API logging decorator (@flashinfer_api), covering various error types and advanced debugging tools.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces comprehensive documentation for developers and AI agents, including a main CLAUDE.md guide and several detailed "skills" in Markdown format. These documents cover adding new kernels, benchmarking, and debugging within the flashinfer library. The guides are well-structured and highly informative. My review focuses on ensuring the accuracy and clarity of the code examples and technical details. I've identified a few minor inconsistencies in the code examples that could lead to errors if followed directly, and a small contradiction in the architecture support documentation. The suggested changes aim to correct these issues and enhance the overall quality of these excellent new guides.

gemini-code-assist · 2025-12-18T21:15:05Z

.claude/skills/benchmark-kernel/skill.md

+    # Your kernel call here
+    return output


The example my_kernel_wrapper function returns an undefined variable output, which would cause a NameError. To make the example runnable and clearer for the user, this suggestion defines output as a placeholder for the actual kernel's result.

Suggested change

# Your kernel call here

return output

# Your kernel call here, for example:

# output = my_flashinfer_kernel(q, k, v)

output = torch.empty_like(q) # Placeholder for the actual kernel output

return output

gemini-code-assist · 2025-12-18T21:15:05Z

.claude/skills/benchmark-kernel/skill.md

+    my_kernel,
+    args=(x, y),


This code example has two inconsistencies that will cause errors:

It calls a function named my_kernel, but the function defined earlier in the tutorial is my_kernel_wrapper.

The arguments passed are (x, y), which do not match the (q, k, v) signature of my_kernel_wrapper.

Suggested change

my_kernel,

args=(x, y),

my_kernel_wrapper,

args=(q, k, v),

gemini-code-assist · 2025-12-18T21:15:05Z

.claude/skills/benchmark-kernel/skill.md

+    my_kernel,
+    args=(x, y),


Similar to the previous example, the function name my_kernel and arguments (x, y) are incorrect. They should be my_kernel_wrapper and (q, k, v) respectively to match the function defined earlier in the tutorial.

Suggested change

my_kernel,

args=(x, y),

my_kernel_wrapper,

args=(q, k, v),

gemini-code-assist · 2025-12-18T21:15:05Z

CLAUDE.md

+# Blackwell FMHA: Blackwell only
+def gen_fmhav2_blackwell_module(...):
+    nvcc_flags = current_compilation_context.get_nvcc_flags_list(
+        supported_major_versions=[12]  # SM120 only


The comment "Blackwell FMHA: Blackwell only" on line 333 contradicts this code line, which specifies supported_major_versions=[12] for SM120. Blackwell architecture is SM10x (major version 10). This discrepancy is confusing. Assuming the function gen_fmhav2_blackwell_module is for Blackwell, the supported_major_versions should be [10].

Suggested change

supported_major_versions=[12] # SM120 only

supported_major_versions=[10] # SM100 only

SM12x is also considered Blackwell

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (4)

.claude/skills/debug-cuda-crash/skill.md (1)
109-111: Add language specifiers to fenced code blocks.

Code blocks on lines 109–111, 202–203, and 208–209 are missing language identifiers. Based on context, these should be bash.
🔎 Proposed fixes

Line 109-111:
-```
+```bash
 RuntimeError: CUDA error: an illegal memory access was encountered
-```
+```bash
Line 202-203:
-```
+```bash
 RuntimeError: Function ... returned nan or inf
-```
+```bash
Line 208-209:
-```
+```bash
 RuntimeError: CUDA out of memory
-```
+```bash
Also applies to: 202-203, 208-209
CLAUDE.md (2)
415-445: Add language specifier to directory structure code block.

The directory listing on line 415 should include a language specifier for clarity. This is a structured text block that would benefit from text specifier.
🔎 Proposed fix
-```
+```text
 flashinfer/
 ├── include/flashinfer/           # Header-only CUDA kernel templates
709-710: Capitalize "Markdown" as proper noun.

On lines 709 and 715, "markdown" should be "Markdown" (proper noun for the markup format).
🔎 Proposed fixes

Line 709:
-  - **Tip**: Add `.md` to get markdown format: <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl.html.md>
+  - **Tip**: Add `.md` to get Markdown format: <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl.html.md>
Line 715:
-  - **Tip**: Add `.md` to any page URL to get markdown format
+  - **Tip**: Add `.md` to any page URL to get Markdown format
Also applies to: 715-716
.claude/skills/add-cuda-kernel/skill.md (1)
466-475: Add language specifier to file listing code block.

The file listing on line 466 should include a language specifier. Since it shows file paths and operations, text is most appropriate.
🔎 Proposed fix
-```
+```text
 include/flashinfer/scale.cuh              # NEW: CUDA kernel definition
 csrc/scale.cu                              # NEW: PyTorch launcher

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 454e7b2 and 553d783.

📒 Files selected for processing (4)

.claude/skills/add-cuda-kernel/skill.md (1 hunks)
.claude/skills/benchmark-kernel/skill.md (1 hunks)
.claude/skills/debug-cuda-crash/skill.md (1 hunks)
CLAUDE.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

CLAUDE.md

[uncategorized] ~709-~709: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...dsl.html> - Tip: Add .md to get markdown format: <https://docs.nvidia.com/cutlas...

(MARKDOWN_NNP)

[uncategorized] ~715-~715: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...Tip**: Add .md to any page URL to get markdown format - Use for: Low-level instructi...

(MARKDOWN_NNP)

🪛 markdownlint-cli2 (0.18.1)

CLAUDE.md

415-415: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

.claude/skills/debug-cuda-crash/skill.md

105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

202-202: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

208-208: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

.claude/skills/add-cuda-kernel/skill.md

466-466: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (4)

.claude/skills/debug-cuda-crash/skill.md (1)

575-578: Cross-reference consistency verified.

References to CLAUDE.md and flashinfer/api_logging.py are consistent with the broader documentation structure and implementation references provided in the PR. ✅

CLAUDE.md (1)

1-726: Comprehensive and well-structured developer guide.

CLAUDE.md provides excellent breadth and depth for developers working with FlashInfer. The layout progression (quick start → concepts → architecture → debugging) is logical, and the cross-references to skills and external docs support self-service learning.

Key strengths:

Quick reference table is practical for daily workflows

JIT architecture explanation demystifies compilation

Testing section with architecture checks prevents common gotchas

Environment variable reference is complete and well-organized

External documentation section acknowledges dependencies transparently

Minor suggestion: Consider adding a "Common Mistakes" section to preempt frequent issues (e.g., forgetting --recursive during clone, or confusion around when to use --no-build-isolation), though current content is thorough.

.claude/skills/benchmark-kernel/skill.md (1)

1-413: Excellent benchmarking tutorial with clear dual-method approach.

The tutorial effectively balances comprehensiveness with accessibility. Strengths:

Clear method separation: flashinfer_benchmark.py (unified CLI) vs. bench_gpu_time() (programmatic) serve different user needs

Realistic expectations: Upfront note that CUPTI is optional but recommended reduces barrier-to-entry

Practical troubleshooting: "CUPTI Warning", "Inconsistent Results", and "Reference Check Failures" sections preempt common issues

Copy-paste examples: Quick examples for decode, prefill, FP8 GEMM, MOE provide instant starting points

Metrics explanation: min/max/mean/TFLOPS/TB/s breakdown helps interpret results

The comparison table at line 394–402 is especially valuable for decision-making.

To confirm correctness, please verify:

CUPTI requirement (CUDA 13+) against cupti-python package specs

Routine names in examples (e.g., BatchDecodeWithPagedKVCacheWrapper) match actual benchmark implementation

bench_gpu_time() function signature and parameters match the flashinfer.testing module API

.claude/skills/add-cuda-kernel/skill.md (1)

1-475: Exemplary step-by-step tutorial with complete working example.

This is a high-quality developer guide that walks through a realistic workflow. Strengths:

Progressive complexity: Steps build logically from kernel → launcher → binding → JIT → Python → tests → AOT → export

Framework separation principle: Clear explanation of why include/ is framework-agnostic and csrc/ handles PyTorch bindings (lines 447–452)

TVM-FFI error handling: Concrete pattern with TVM_FFI_THROW(ErrorType) << "message" and examples of ValueError, TypeError (lines 166–171)

Architecture specialization guidance: Optional section (lines 228–273) shows common patterns without overwhelming the simple case

Test completeness: Covers correctness, in-place operations, error cases, and dtype variations

End-to-end perspective: References section (464–475) lists all files created/modified for clarity

The use of element-wise scale as the running example is appropriate—simple enough to follow, complex enough to be realistic.

coderabbitai

Actionable comments posted: 5

🧹 Nitpick comments (2)

.claude/skills/add-cuda-kernel/skill.md (2)
208-212: Add validation when copying source files in JIT generator.

The code assumes scale.cu and scale_jit_binding.cu exist at jit_env.FLASHINFER_CSRC_DIR. If these files are missing or moved, the copy silently fails. Add existence checks or error messages to help users debug issues.
🔎 Proposed defensive addition
     # Copy source files (no Jinja needed for this simple case)
     sources = []
     for fname in ["scale.cu", "scale_jit_binding.cu"]:
         src_path = jit_env.FLASHINFER_CSRC_DIR / fname
+        if not src_path.exists():
+            raise FileNotFoundError(f"Source file not found: {src_path}")
         dest_path = gen_directory / fname
         shutil.copy(src_path, dest_path)
         sources.append(dest_path)
228-374: Consider separating CUDA architecture guidance into a reference doc.

The "Specifying Supported CUDA Architectures" section (lines 228–374) is comprehensive and valuable but interrupts the step-by-step flow of the main tutorial. Readers new to FlashInfer may find this overwhelming mid-tutorial. Consider moving this to a separate reference document (e.g., .claude/skills/cuda-architecture-reference.md) and linking to it from Step 4 with a note like: "⚠️ Advanced: If your kernel only supports specific GPU architectures, see CUDA Architecture Reference."

This keeps the main tutorial focused on the happy path while preserving detailed guidance for advanced users.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 553d783 and edc56a0.

📒 Files selected for processing (2)

.claude/skills/add-cuda-kernel/skill.md (1 hunks)
CLAUDE.md (1 hunks)

🧰 Additional context used

🪛 LanguageTool

CLAUDE.md

[uncategorized] ~508-~508: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...dsl.html> - Tip: Add .md to get markdown format: <https://docs.nvidia.com/cutlas...

(MARKDOWN_NNP)

[uncategorized] ~514-~514: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...Tip**: Add .md to any page URL to get markdown format - Use for: Low-level instructi...

(MARKDOWN_NNP)

🪛 markdownlint-cli2 (0.18.1)

CLAUDE.md

264-264: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

.claude/skills/add-cuda-kernel/skill.md

566-566: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (6)

CLAUDE.md (6)

31-54: Installation and quick-start section is clear and actionable.

The instructions correctly emphasize the importance of the --recursive flag, the role of --no-build-isolation, and JIT's convenience for development. The section successfully communicates that no manual rebuild is needed after kernel changes.

174-240: JIT compilation architecture explanation is thorough and well-structured.

The three-layer breakdown (JitSpec, Code Generation, Compilation and Loading) with concrete code examples and comments is excellent pedagogical content. The note about optional Jinja templates and the pattern template in Layer 2 clearly guide developers on when templating is necessary.

303-322: Adding a New Operation section provides good guidance with concrete examples.

References to reference implementations (RMSNorm, sampling, decode) at different complexity levels help developers find appropriate starting points. The 9-step overview is clear and comprehensive.

459-475: TVM-FFI section clearly explains cross-language capabilities and current constraints.

The explanation that FlashInfer currently provides PyTorch bindings while the underlying kernels are framework-agnostic is important context. This section successfully sets expectations for future multi-framework support.

490-524: External documentation resources section is well-curated and actionable.

The inclusion of specific tips (e.g., "Read source code directly" for CUTLASS, "Add .md to get Markdown format" for PTX ISA) demonstrates domain knowledge. The "When to Consult These Docs" subsection provides practical guidance for developers on which resource to use for different tasks.

1-30: Request verification: Confirm that referenced skill.md files exist in this PR.

This document references three external skill files at lines 158, 195, and 384:

.claude/skills/benchmark-kernel/skill.md

.claude/skills/add-cuda-kernel/skill.md

.claude/skills/debug-cuda-crash/skill.md

These are important reference points for developers. Verify that these files are included in this PR and are accessible at the specified paths.

.claude/skills/add-cuda-kernel/skill.md

coderabbitai · 2025-12-18T21:47:19Z

.claude/skills/add-cuda-kernel/skill.md

+
+## Summary of Files Created/Modified
+
+```


⚠️ Potential issue | 🟡 Minor

Add language identifier to code block.

The fenced code block listing file paths is missing a language specification. Add bash, text, or diff as appropriate.

-``` +```text include/flashinfer/scale.cuh # NEW: CUDA kernel definition csrc/scale.cu # NEW: PyTorch launcher csrc/scale_jit_binding.cu # NEW: TVM-FFI binding flashinfer/jit/scale.py # NEW: JIT generator flashinfer/scale.py # NEW: Python API flashinfer/__init__.py # MODIFIED: Export API flashinfer/aot.py # MODIFIED: Register AOT tests/test_scale.py # NEW: Unit tests -``` +```

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

566-566: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In .claude/skills/add-cuda-kernel/skill.md around line 566, the fenced code block that lists file paths is missing a language identifier; update the opening fence from ``` to ```text (or ```bash/```diff if preferred) so the block becomes a proper fenced code block with a language specifier, leaving the file list contents and closing fence unchanged.

coderabbitai · 2025-12-18T21:47:19Z

CLAUDE.md

+```
+flashinfer/
+├── include/flashinfer/           # Header-only CUDA kernel templates
+│   ├── attention/                # Attention kernels
+│   ├── gemm/                     # GEMM kernels
+│   ├── comm/                     # Communication kernels
+│   ├── mma.cuh                   # Matrix multiply utilities
+│   ├── utils.cuh                 # Common utilities
+│   └── [...]
+│
+├── csrc/                          # Framework bindings (via TVM-FFI)
+│   ├── *.cu                       # Kernel launcher implementations
+│   ├── *_jit_binding.cu           # TVM-FFI exports
+│   ├── *_customize_config.jinja   # Type config templates (optional)
+│   └── [...]
+│
+├── flashinfer/                    # Python package
+│   ├── jit/
+│   │   ├── core.py                # JitSpec, compilation infrastructure
+│   │   ├── cpp_ext.py             # Ninja build generation
+│   │   ├── env.py                 # Workspace paths
+│   │   ├── attention/             # Attention module generators
+│   │   ├── gemm/                  # GEMM module generators
+│   │   ├── fused_moe/             # MOE module generators
+│   │   └── [...]
+│   ├── *.py                       # High-level Python APIs
+│   ├── aot.py                     # AOT compilation for pre-built packages
+│   └── [...]
+│
+└── build_backend.py               # PEP 517 build backend
+```


⚠️ Potential issue | 🟡 Minor

Add language specifier to directory structure code block.

The fenced code block at line 264 lacks a language specifier. While the tree structure is readable, adding a language identifier improves rendering consistency.

🔎 Proposed fix

-``` +``` flashinfer/ ├── include/flashinfer/ # Header-only CUDA kernel templates │ ├── attention/ # Attention kernels

Note: The language specifier for a plain tree structure may be left empty or use text, but consistency with other code blocks is recommended.

Committable suggestion skipped: line range outside the PR's diff.

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

264-264: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents

In CLAUDE.md around lines 264 to 294, the fenced code block containing the repository tree lacks a language specifier; update the opening fence to include a language (e.g., ```text) so the block renders consistently (use the same specifier as other code blocks in the file for consistency).

CLAUDE.md

.claude/skills/add-cuda-kernel/skill.md

aleozlx · 2025-12-18T23:23:32Z

.claude/skills/debug-cuda-crash/skill.md

+torch.cuda.synchronize()  # ← Flushes printf output
+```
+
+### ⚠️ Warp-Specialized Kernels: Use `lane_id == 0`


could this suggestion be too narrow? i think often warp-specialization is across a warp group with four warps, but lane id is the local id to one warp? and in theory there could also be more than one warp group involved in the same specialization group/task?

CLAUDE.md

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

aleozlx · 2025-12-18T23:37:57Z

left inline suggestions and some comments, otherwise LGTM

bkryu

Left some minor comments. This seems exciting!

.claude/skills/add-cuda-kernel/skill.md

.claude/skills/benchmark-kernel/skill.md

.claude/skills/debug-cuda-crash/skill.md

CLAUDE.md

Co-authored-by: Alex Yang <aleozlx@gmail.com>

.claude/skills/add-cuda-kernel/skill.md

nvmbreughe · 2025-12-19T04:13:03Z

.claude/skills/add-cuda-kernel/skill.md

+- Set appropriate tolerances for each dtype
+- Test error cases
+
+## Step 7: Register in AOT


Should we mark this as optional?

How to know when needed?

.claude/skills/add-cuda-kernel/skill.md

nvmbreughe · 2025-12-19T04:17:01Z

.claude/skills/add-cuda-kernel/skill.md

+
+## Step 3: Create TVM-FFI Binding in `csrc/`
+
+Create `csrc/scale_jit_binding.cu`:


Why not have this in the same file as the launcher?

Separating them can accelerate compilation time (in terms of latency), however, if kernel compilation time is long enough, it doesn't matter.

nvmbreughe · 2025-12-19T04:32:42Z

CLAUDE.md

+Run all tests:
+
+```bash
+pytest tests/


Not recommended: some tests adjust state that will negatively impact subsequent tests (e.g., global variables). Other tests may forget to clean up space, leading to IMA's or segfaults in subsequent tests.

setting aside IMA, for other reasons that sounds like the tests would be written incorrectly tho

this should be supported from least surprising principle, except IMA is an inevitable edge case if it occurs

I agree about the root cause, but we want Claude to know that this is a current issue and may misinform it with lots of false negatives.

CLAUDE.md

nvmbreughe · 2025-12-19T04:43:41Z

CLAUDE.md

+8. Register in `flashinfer/aot.py` for AOT compilation
+9. Export in `flashinfer/__init__.py`
+
+**Reference implementations:**


Example implementations.

Reference sounds like it could be used for accuracy testing.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (4)

.claude/skills/add-cuda-kernel/skill.md (4)
463-463: Update Python API description to mention @flashinfer_api.

The key points section should mention that adding the @flashinfer_api decorator enables logging and sets it apart from helper functions. Update the bullet point to include this guidance.
🔎 Proposed fix
 **Key points:**
 
 - Uses `@functools.cache` to cache compiled modules
-- Clean Python API with docstring
+- Clean Python API with docstring and `@flashinfer_api` decorator (enables logging and signals public API)
 - Handles output allocation
 - Validates inputs using `@backend_requirement` decorator
414-419: Add @flashinfer_api decorator to public API function.

The scale function is a public API that users are expected to call directly, but it's missing the @flashinfer_api decorator. This decorator serves two purposes: (a) it signals that this is a public API (not a helper function), and (b) it enables function logging when logging mode is on. This was flagged in previous reviews and should be added.
🔎 Proposed fix to add decorator
 @backend_requirement(
     backend_checks={},  # No backend choices for this simple kernel
     common_check=_check_scale_problem_size,
 )
+@flashinfer_api
 def scale(input: torch.Tensor, factor: float,
           out: Optional[torch.Tensor] = None) -> torch.Tensor:
82-92: Add error handling to DISPATCH_DTYPE macro for unsupported dtypes.

The macro silently does nothing if an unsupported dtype is passed. Tutorial readers will encounter mysterious silent failures. Add an else clause that throws an error to provide immediate, actionable feedback (as noted in previous reviews).
🔎 Proposed fix for error handling
 #define DISPATCH_DTYPE(dtype, DType, ...)     \
   if (dtype == torch::kFloat16) {              \
     using DType = half;                        \
     __VA_ARGS__                                \
   } else if (dtype == torch::kBFloat16) {      \
     using DType = __nv_bfloat16;               \
     __VA_ARGS__                                \
   } else if (dtype == torch::kFloat32) {       \
     using DType = float;                       \
     __VA_ARGS__                                \
-  }
+  } else {                                     \
+    throw std::runtime_error("Unsupported dtype"); \
+  }
730-730: Add language identifier to fenced code block.

The code block listing file paths is missing a language specification. Add text as the language identifier.
🔎 Proposed fix
-```
+```text
 include/flashinfer/scale.cuh              # NEW: CUDA kernel definition
 csrc/scale.cu                              # NEW: PyTorch launcher
 csrc/scale_jit_binding.cu                  # NEW: TVM-FFI binding
 flashinfer/jit/scale.py                    # NEW: JIT generator
 flashinfer/scale.py                        # NEW: Python API
 flashinfer/__init__.py                     # MODIFIED: Export API
 flashinfer/aot.py                          # MODIFIED: Register AOT
 tests/test_scale.py                        # NEW: Unit tests
-```
+```

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between d5a3f88 and 575951f.

📒 Files selected for processing (1)

.claude/skills/add-cuda-kernel/skill.md

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

.claude/skills/add-cuda-kernel/skill.md

🪛 markdownlint-cli2 (0.18.1)

.claude/skills/add-cuda-kernel/skill.md

489-489: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

512-512: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

545-545: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

730-730: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (1)

.claude/skills/add-cuda-kernel/skill.md (1)

1-20: Overall structure and scope are well-designed.

The tutorial provides excellent step-by-step guidance for adding a CUDA kernel to FlashInfer, with clear examples, proper separation of concerns (CUDA kernel, launchers, bindings, JIT, Python API), comprehensive decorator documentation, and good test coverage. The architecture specification section (lines 228–356) is particularly thorough and valuable for readers. Once the error handling, decorator, and markdown formatting issues are addressed, this will be a high-quality teaching resource.

coderabbitai · 2025-12-23T23:27:19Z

.claude/skills/add-cuda-kernel/skill.md

+
+Enforces backend and problem size requirements at runtime. There are three usage patterns:
+
+**Pattern 1: Single Backend (No Backend Choices)**


⚠️ Potential issue | 🟡 Minor

Use proper markdown heading instead of emphasis.

Line 489 uses bold text (**Pattern 1: ...**) as a section heading. Use a proper markdown heading (e.g., #### Pattern 1: ...) for better document structure and accessibility.

🔎 Proposed fix

-**Pattern 1: Single Backend (No Backend Choices)** +#### Pattern 1: Single Backend (No Backend Choices)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Pattern 1: Single Backend (No Backend Choices)**

#### Pattern 1: Single Backend (No Backend Choices)

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

489-489: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents

In .claude/skills/add-cuda-kernel/skill.md around line 489, the section title is formatted using bold text instead of a Markdown heading; replace the bolded line with an appropriate heading level (for example "#### Pattern 1: Single Backend (No Backend Choices)") to improve document structure and accessibility, ensuring the heading matches surrounding hierarchy and removing the bold-only styling.

coderabbitai · 2025-12-23T23:27:19Z

.claude/skills/add-cuda-kernel/skill.md

+    pass
+```
+
+**Pattern 2: Multiple Backends**


⚠️ Potential issue | 🟡 Minor

Use proper markdown heading instead of emphasis.

Line 512 uses bold text (**Pattern 2: ...**) as a section heading. Use a proper markdown heading instead.

🔎 Proposed fix

-**Pattern 2: Multiple Backends** +#### Pattern 2: Multiple Backends

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Pattern 2: Multiple Backends**

#### Pattern 2: Multiple Backends

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

512-512: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents

.in file .claude/skills/add-cuda-kernel/skill.md around line 512, the section title is formatted as bold text ("**Pattern 2: Multiple Backends**") instead of a markdown heading; replace the bold emphasis with an appropriate markdown heading (e.g., prepend one or two hash marks like "# Pattern 2: Multiple Backends" or "## Pattern 2: Multiple Backends" depending on document hierarchy) so the section becomes a proper heading and renders correctly in markdown.

coderabbitai · 2025-12-23T23:27:19Z

.claude/skills/add-cuda-kernel/skill.md

+        pass
+```
+
+**Pattern 3: Auto Backend Selection**


⚠️ Potential issue | 🟡 Minor

Use proper markdown heading instead of emphasis.

Line 545 uses bold text (**Pattern 3: ...**) as a section heading. Use a proper markdown heading instead.

🔎 Proposed fix

-**Pattern 3: Auto Backend Selection** +#### Pattern 3: Auto Backend Selection

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

**Pattern 3: Auto Backend Selection**

#### Pattern 3: Auto Backend Selection

🧰 Tools

🪛 markdownlint-cli2 (0.18.1)

545-545: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

🤖 Prompt for AI Agents

.claude/skills/add-cuda-kernel/skill.md around line 545: the section title is written using bold emphasis (**Pattern 3: Auto Backend Selection**) instead of a proper markdown heading; replace the bold text with an appropriate markdown heading (e.g., prefix with one or more # characters depending on desired level, such as "### Pattern 3: Auto Backend Selection") to make it a real section header and update surrounding spacing to conform to markdown heading conventions.

- Add documentation explaining when to use each error macro: - TVM_FFI_THROW for normal runtime errors - TVM_FFI_LOG_AND_THROW for construction-time errors that may not propagate - Add reminder to keep docs in sync with infrastructure changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

coderabbitai

Actionable comments posted: 2

♻️ Duplicate comments (6)

.claude/skills/add-cuda-kernel/skill.md (6)
82-92: Else clause still missing from DISPATCH_DTYPE macro.

This issue was raised in a prior review and marked as addressed, but the else clause that throws an error for unsupported dtypes is still absent. Readers following this tutorial will encounter silent failures if they pass an unsupported dtype.
🔎 Proposed fix
 #define DISPATCH_DTYPE(dtype, DType, ...)     \
   if (dtype == torch::kFloat16) {              \
     using DType = half;                        \
     __VA_ARGS__                                \
   } else if (dtype == torch::kBFloat16) {      \
     using DType = __nv_bfloat16;               \
     __VA_ARGS__                                \
   } else if (dtype == torch::kFloat32) {       \
     using DType = float;                       \
     __VA_ARGS__                                \
-  }
+  } else {                                     \
+    throw std::runtime_error("Unsupported dtype");  \
+  }
450-455: Add @flashinfer_api decorator to public scale function.

The scale() function is a public API that users will call directly. Add the @flashinfer_api decorator to mark it as public and enable logging support.
🔎 Proposed fix
 @backend_requirement(
     backend_checks={},  # No backend choices for this simple kernel
     common_check=_check_scale_problem_size,
 )
+@flashinfer_api
 def scale(input: torch.Tensor, factor: float,
           out: Optional[torch.Tensor] = None) -> torch.Tensor:
525-526: Convert bold text to proper markdown heading.

Replace the bold text with a markdown heading (####) for proper document structure and accessibility.
🔎 Proposed fix
-**Pattern 1: Single Backend (No Backend Choices)**
+#### Pattern 1: Single Backend (No Backend Choices)
548-549: Convert bold text to proper markdown heading.

Replace the bold text with a markdown heading (####) for proper document structure and accessibility.
🔎 Proposed fix
-**Pattern 2: Multiple Backends**
+#### Pattern 2: Multiple Backends
581-582: Convert bold text to proper markdown heading.

Replace the bold text with a markdown heading (####) for proper document structure and accessibility.
🔎 Proposed fix
-**Pattern 3: Auto Backend Selection**
+#### Pattern 3: Auto Backend Selection
765-775: Add language identifier to code block.

The fenced code block listing file paths is missing a language specification. Add text as the language identifier.
🔎 Proposed fix
-```
+```text
 include/flashinfer/scale.cuh              # NEW: CUDA kernel definition
 csrc/scale.cu                              # NEW: PyTorch launcher
 csrc/scale_jit_binding.cu                  # NEW: TVM-FFI binding
 flashinfer/jit/scale.py                    # NEW: JIT generator
 flashinfer/scale.py                        # NEW: Python API
 flashinfer/__init__.py                     # MODIFIED: Export API
 flashinfer/aot.py                          # MODIFIED: Register AOT
 tests/test_scale.py                        # NEW: Unit tests
-```
+```

🧹 Nitpick comments (1)

.claude/skills/add-cuda-kernel/skill.md (1)
295-302: Clarify GPU architecture terminology.

References to "SM90+" in this section should be more precise. Instead of "SM90+", specify exact supported architectures (e.g., "Hopper SM90, Blackwell SM100, and newer") to avoid implying unsupported future architectures. This aligns with FlashInfer's backwards compatibility approach.
🔎 Proposed fix
 #### How CompilationContext Works
 
 **Automatic Detection** (default):
 ```python
 from flashinfer.compilation_context import CompilationContext
 
 ctx = CompilationContext()
-# Automatically detects all GPUs in the system
-# For SM90+, adds 'a' suffix (e.g., 9.0a for Hopper)
+# Automatically detects all GPUs in the system
+# For Hopper (SM90) and newer, adds 'a' suffix (e.g., 9.0a for Hopper)
 # Result: ctx.TARGET_CUDA_ARCHS = {(9, '0a'), (10, '0a'), ...}

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 575951f and 9c097ff.

📒 Files selected for processing (2)

.claude/skills/add-cuda-kernel/skill.md
CLAUDE.md

🚧 Files skipped from review as they are similar to previous changes (1)

CLAUDE.md

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

.claude/skills/add-cuda-kernel/skill.md

🪛 LanguageTool

.claude/skills/add-cuda-kernel/skill.md

[style] ~180-~180: ‘almost never’ might be wordy. Consider a shorter alternative.
Context: ...nitialization) 3. The error condition almost never fails in practice (e.g., internal error...

(EN_WORDINESS_PREMIUM_ALMOST_NEVER)

🪛 markdownlint-cli2 (0.18.1)

.claude/skills/add-cuda-kernel/skill.md

264-264: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

.claude/skills/add-cuda-kernel/skill.md

- Add @flashinfer_api decorator to scale example - Add nvidia-cutlass-dsl pip package mention for CuTe DSL - Add SM12x mention for Blackwell architecture - Add else clause to DISPATCH_DTYPE with TVM_FFI_THROW - Simplify warp specialization guidance (no assumed group size) - Add destination passing style API design guidance - Clarify AOT registration purpose - Add note about OOM handling in tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…mbreughe - Update rotate_buffers to cold_l2_cache=True - Add CUDA events accuracy note for fast kernels (5-50 us) - Add multi-GPU test command - Fix SM90+ statements to not imply future compatibility - Add @backend_requirement decorated APIs info - Add Python API subdirectories to directory structure - Add tests/ directory structure - Rename 'Reference implementations' to 'Example implementations' - Rename scale_run to run in binding example - Update test location guidance to use specific subdirs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Co-Authored-By: bkryu <bkryu@nvidia.com> Co-Authored-By: nvmbreughe <nvmbreughe@nvidia.com>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

.claude/skills/add-cuda-kernel/skill.md (1)
175-210: Simplify phrase on line 182.

The phrase "almost never fails in practice" is slightly wordy. Consider "rarely fails in practice" for conciseness.
-  3. The error condition almost never fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros)
+  3. The error condition rarely fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros)

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 9c097ff and 6122b38.

📒 Files selected for processing (4)

.claude/skills/add-cuda-kernel/skill.md
.claude/skills/benchmark-kernel/skill.md
.claude/skills/debug-cuda-crash/skill.md
CLAUDE.md

🚧 Files skipped from review as they are similar to previous changes (1)

.claude/skills/benchmark-kernel/skill.md

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

.claude/skills/add-cuda-kernel/skill.md

🪛 LanguageTool

.claude/skills/add-cuda-kernel/skill.md

[style] ~182-~182: ‘almost never’ might be wordy. Consider a shorter alternative.
Context: ...nitialization) 3. The error condition almost never fails in practice (e.g., internal error...

(EN_WORDINESS_PREMIUM_ALMOST_NEVER)

🪛 markdownlint-cli2 (0.18.1)

CLAUDE.md

105-105: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

202-202: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

208-208: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

.claude/skills/debug-cuda-crash/skill.md

531-531: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

554-554: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (14)

CLAUDE.md (6)
1-65: LGTM on quick start sections.

Clear, practical guidance for installation and initial setup. Commands are accurate and the emphasis on submodule initialization is appropriate.

105-119: Add language specifier to fenced code block.

Line 105 code block is missing a language identifier for proper syntax highlighting.
🔎 Proposed fix
-**Example:**
-```python
+**Example:**
+```python
 from flashinfer.utils import is_sm90a_supported
131-165: LGTM on benchmarking section.

Clear guidance with practical examples and appropriate cross-references to skill documentation.

202-234: Add language specifiers to fenced code blocks.

Lines 202 and 208 code blocks are missing language identifiers for syntax highlighting.
🔎 Proposed fix
-### Layer 2: Code Generation
-
-Every `gen_*_module()` function in `flashinfer/jit/` follows this pattern:
-
-```
+### Layer 2: Code Generation
+
+Every `gen_*_module()` function in `flashinfer/jit/` follows this pattern:
+
+```python
 def gen_some_module(dtype_in, dtype_out, ...):
     # 1. Compute unique identifier from parameters
Also fix line 254:
-```jinja
+```jinja
 // Input template
 using DTypeIn = {{ dtype_in }};
268-334: LGTM on structure and adding operations sections.

Clear directory organization with proper emphasis on framework separation. Good cross-references to step-by-step tutorials.

340-549: LGTM on architectural patterns and final sections.

Comprehensive coverage of caching, dispatch macros, API logging, and development workflows. The closing section on AI agent best practices ties together the guidance well.
.claude/skills/debug-cuda-crash/skill.md (3)

1-100: LGTM on tutorial structure and initial sections.

Clear step-by-step progression with practical examples. Environment variables and logging levels are well-documented.

300-400: LGTM on kernel debugging and advanced tools sections.

Good coverage of compute-sanitizer, cuda-gdb, and kernel printf debugging. Warp-specialization guidance is clear.

400-569: LGTM on final sections and example.

End-to-end debug session example is concrete and actionable. Summary effectively recaps the workflow.
.claude/skills/add-cuda-kernel/skill.md (5)
211-344: LGTM on JIT generator and architecture specification.

Comprehensive guidance on module generation, architecture targeting, and CompilationContext usage. Real examples from the codebase are helpful.

414-508: LGTM on Python API implementation.

Good coverage of decorators (@flashinfer_api, @backend_requirement), destination-passing pattern, and detailed decorator feature documentation. Docstring example is clear.

655-716: LGTM on testing section.

Good parametrization strategy with multiple dtypes and sizes. Appropriate tolerance handling for different precisions and comprehensive error case coverage.

725-760: LGTM on AOT and export sections.

Clear explanation of AOT purpose and practical examples for both registration and API export.

762-783: Add language specifier to file listing code block.

Line 774 code block is missing a language identifier for proper rendering.
-## Summary of Files Created/Modified
-
-```
+## Summary of Files Created/Modified
+
+```text
 include/flashinfer/scale.cuh              # NEW: CUDA kernel definition
 csrc/scale.cu                              # NEW: PyTorch launcher

yzh119 · 2025-12-27T08:11:02Z

@aleozlx @bkryu @nvmbreughe updated according to your suggestions!

Updated CUDA kernel implementation and TVM-FFI bindings for scaling functionality. Improved error handling and type dispatching.

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (6)

.claude/skills/add-cuda-kernel/skill.md (6)
498-498: Use proper markdown heading instead of emphasis.

Convert bold emphasis to a level 4 heading for correct document structure.
-**Pattern 1: Single Backend (No Backend Choices)**
+#### Pattern 1: Single Backend (No Backend Choices)
521-521: Use proper markdown heading instead of emphasis.

Convert bold emphasis to a level 4 heading.
-**Pattern 2: Multiple Backends**
+#### Pattern 2: Multiple Backends
554-554: Use proper markdown heading instead of emphasis.

Convert bold emphasis to a level 4 heading.
-**Pattern 3: Auto Backend Selection**
+#### Pattern 3: Auto Backend Selection
741-741: Add language identifier to code block.

The fenced code block listing file paths is missing a language specification for proper syntax highlighting.
-```
+```text
 include/flashinfer/scale.cuh              # NEW: CUDA kernel definition
 csrc/scale.cu                              # NEW: PyTorch launcher
 csrc/scale_jit_binding.cu                  # NEW: TVM-FFI binding
 flashinfer/jit/scale.py                    # NEW: JIT generator
 flashinfer/scale.py                        # NEW: Python API
 flashinfer/__init__.py                     # MODIFIED: Export API
 flashinfer/aot.py                          # MODIFIED: Register AOT
 tests/test_scale.py                        # NEW: Unit tests
-```
+```
129-129: Simplify wordy phrase.

Replace "almost never" with a more concise alternative.
-  3. The error condition almost never fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros)
+  3. The error condition rarely fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros)
240-247: Add language identifier to Python code block.

The CompilationContext example missing a language identifier for proper syntax highlighting.
-```python
+```python
 from flashinfer.compilation_context import CompilationContext
 
 ctx = CompilationContext()

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 6122b38 and ce01fd4.

📒 Files selected for processing (1)

.claude/skills/add-cuda-kernel/skill.md

🧰 Additional context used

🧠 Learnings (1)

📚 Learning: 2025-11-12T03:35:17.583Z

Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.

Applied to files:

.claude/skills/add-cuda-kernel/skill.md

🪛 LanguageTool

.claude/skills/add-cuda-kernel/skill.md

[style] ~129-~129: ‘almost never’ might be wordy. Consider a shorter alternative.
Context: ...nitialization) 3. The error condition almost never fails in practice (e.g., internal error...

(EN_WORDINESS_PREMIUM_ALMOST_NEVER)

🪛 markdownlint-cli2 (0.18.1)

.claude/skills/add-cuda-kernel/skill.md

498-498: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

521-521: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

554-554: Emphasis used instead of a heading

(MD036, no-emphasis-as-heading)

741-741: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

🔇 Additional comments (3)

.claude/skills/add-cuda-kernel/skill.md (3)

421-425: Verify decorator usage matches FlashInfer API.

The example uses @flashinfer_api and @backend_requirement decorators. Confirm these:

Match the actual API surface in the codebase

Accept the parameters shown (backend_checks, common_check)

Are documented consistently across other examples

75-101: Verify TVM-FFI macro names and patterns.

The launcher uses DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP32_FP16 and TVM_FFI_ICHECK utilities. Confirm these macros:

Match actual definitions in csrc/tvm_ffi_utils.h

Support the dtypes claimed (FP32, FP16, BF16 if applicable)

Are the recommended pattern for similar operations

162-171: Verify TVM-FFI export syntax.

Line 170 uses TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, scale_launcher). Confirm:

This macro exists and is the current recommended export pattern

The function signature and types are correctly handled

Whether renaming the export to match the function name (as suggested in past comments) was considered

upd

553d783

gemini-code-assist bot reviewed Dec 18, 2025

View reviewed changes

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

refactor

edc56a0

coderabbitai bot reviewed Dec 18, 2025

View reviewed changes

aleozlx reviewed Dec 18, 2025

View reviewed changes

.claude/skills/add-cuda-kernel/skill.md Show resolved Hide resolved

aleozlx reviewed Dec 18, 2025

View reviewed changes

.claude/skills/add-cuda-kernel/skill.md Show resolved Hide resolved

aleozlx reviewed Dec 18, 2025

View reviewed changes

CLAUDE.md Outdated Show resolved Hide resolved

aleozlx and others added 2 commits December 18, 2025 15:34

Update CLAUDE.md

5483680

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

Update CLAUDE.md

8324db4

Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>

bkryu reviewed Dec 18, 2025

View reviewed changes

.claude/skills/add-cuda-kernel/skill.md Show resolved Hide resolved

.claude/skills/benchmark-kernel/skill.md Outdated Show resolved Hide resolved

.claude/skills/benchmark-kernel/skill.md Outdated Show resolved Hide resolved

.claude/skills/debug-cuda-crash/skill.md Show resolved Hide resolved

aleozlx reviewed Dec 19, 2025

View reviewed changes

CLAUDE.md Show resolved Hide resolved

Update CLAUDE.md

d5a3f88

Co-authored-by: Alex Yang <aleozlx@gmail.com>

nvmbreughe reviewed Dec 19, 2025

View reviewed changes

yzh119 mentioned this pull request Dec 21, 2025

misc: support checks unit test tracking #2224

Open

5 tasks

add support check decorators md

575951f

coderabbitai bot reviewed Dec 23, 2025

View reviewed changes

coderabbitai bot reviewed Dec 27, 2025

View reviewed changes

.claude/skills/add-cuda-kernel/skill.md Show resolved Hide resolved

.claude/skills/add-cuda-kernel/skill.md Show resolved Hide resolved

yzh119 and others added 2 commits December 26, 2025 23:53

coderabbitai bot reviewed Dec 27, 2025

View reviewed changes

yzh119 enabled auto-merge (squash) December 27, 2025 08:10

yzh119 requested review from cyx-6 and yongwww December 28, 2025 07:52

Refactor CUDA kernel and TVM-FFI bindings for scale

ce01fd4

Updated CUDA kernel implementation and TVM-FFI bindings for scaling functionality. Improved error handling and type dispatching.

coderabbitai bot reviewed Dec 30, 2025

View reviewed changes

cyx-6 approved these changes Dec 30, 2025

View reviewed changes

yzh119 merged commit 835a015 into flashinfer-ai:main Dec 30, 2025
4 checks passed

yzh119 mentioned this pull request Dec 31, 2025

bugfix: fix claude skills #2275

Merged

5 tasks

coderabbitai bot mentioned this pull request Jan 30, 2026

bugfix: fix stub generation directory in fused_moe module #2445

Merged

5 tasks

xinhaoc mentioned this pull request Feb 15, 2026

[Feature Request] - Add Agent Skills for task kernel generation mirage-project/mirage#621

Open

	supported_major_versions=[12] # SM120 only
	supported_major_versions=[10] # SM100 only


		## Step 3: Create TVM-FFI Binding in `csrc/`

		Create `csrc/scale_jit_binding.cu`:


		Enforces backend and problem size requirements at runtime. There are three usage patterns:

		Pattern 1: Single Backend (No Backend Choices)

	Pattern 1: Single Backend (No Backend Choices)
	#### Pattern 1: Single Backend (No Backend Choices)

	Pattern 2: Multiple Backends
	#### Pattern 2: Multiple Backends

	Pattern 3: Auto Backend Selection
	#### Pattern 3: Auto Backend Selection

Conversation

yzh119 commented Dec 18, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

✅ Pre-commit Checks

🧪 Tests

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Other AI code review bot(s) detected

Walkthrough

Changes

Sequence Diagram

Estimated code review effort

Poem

Pre-merge checks

Uh oh!

gemini-code-assist bot commented Dec 18, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

aleozlx commented Dec 18, 2025

Uh oh!

bkryu left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yzh119 commented Dec 18, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 18, 2025 •

edited

Loading