agent: add CLAUDE.md and claude skills#2240
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. 📝 WalkthroughWalkthroughAdds a CUDA element-wise tensor scaling feature: templated CUDA kernel and launcher, TVM‑FFI bindings, JIT generation plus AOT pre-generation, a cached Python API and tests, and three new documentation guides (benchmarking, CUDA crash debugging, developer guide). Changes
Sequence DiagramsequenceDiagram
participant User
participant PyAPI as Python API
participant Cache as Module Cache
participant JIT as JIT Generator
participant Compiler as TVM Compiler
participant TVMFFI as TVM-FFI Binding
participant CUDA as CUDA Launcher
participant GPU
User->>PyAPI: scale(input, factor, out?)
PyAPI->>PyAPI: validate input (dtype, device, shape)
alt module cached
PyAPI->>Cache: get(module for dtype pair)
Cache-->>PyAPI: module
else compile
PyAPI->>JIT: gen_scale_module(dtype_in,dtype_out)
JIT-->>PyAPI: JitSpec (sources, URI)
PyAPI->>Compiler: compile(JitSpec)
Compiler-->>PyAPI: compiled module
PyAPI->>Cache: cache(module)
end
PyAPI->>PyAPI: prepare/allocate output
PyAPI->>TVMFFI: run(input, output, factor)
TVMFFI->>CUDA: launch ScaleLauncher<T> (with stream)
CUDA->>GPU: execute ScaleKernel (element-wise multiply)
GPU-->>CUDA: done
CUDA-->>TVMFFI: completion
TVMFFI-->>PyAPI: return
PyAPI-->>User: return scaled tensor
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
Pre-merge checks✅ Passed checks (3 passed)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the developer experience for FlashInfer by introducing a new, comprehensive contribution guide and a suite of practical tutorials. These resources aim to streamline the process of adding new CUDA operators, accurately benchmarking kernel performance, and effectively debugging common CUDA-related issues, thereby empowering contributors and improving code quality. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces comprehensive documentation for developers and AI agents, including a main CLAUDE.md guide and several detailed "skills" in Markdown format. These documents cover adding new kernels, benchmarking, and debugging within the flashinfer library. The guides are well-structured and highly informative. My review focuses on ensuring the accuracy and clarity of the code examples and technical details. I've identified a few minor inconsistencies in the code examples that could lead to errors if followed directly, and a small contradiction in the architecture support documentation. The suggested changes aim to correct these issues and enhance the overall quality of these excellent new guides.
| # Your kernel call here | ||
| return output |
There was a problem hiding this comment.
The example my_kernel_wrapper function returns an undefined variable output, which would cause a NameError. To make the example runnable and clearer for the user, this suggestion defines output as a placeholder for the actual kernel's result.
| # Your kernel call here | |
| return output | |
| # Your kernel call here, for example: | |
| # output = my_flashinfer_kernel(q, k, v) | |
| output = torch.empty_like(q) # Placeholder for the actual kernel output | |
| return output |
| my_kernel, | ||
| args=(x, y), |
There was a problem hiding this comment.
This code example has two inconsistencies that will cause errors:
- It calls a function named
my_kernel, but the function defined earlier in the tutorial ismy_kernel_wrapper. - The arguments passed are
(x, y), which do not match the(q, k, v)signature ofmy_kernel_wrapper.
| my_kernel, | |
| args=(x, y), | |
| my_kernel_wrapper, | |
| args=(q, k, v), |
| my_kernel, | ||
| args=(x, y), |
There was a problem hiding this comment.
CLAUDE.md
Outdated
| # Blackwell FMHA: Blackwell only | ||
| def gen_fmhav2_blackwell_module(...): | ||
| nvcc_flags = current_compilation_context.get_nvcc_flags_list( | ||
| supported_major_versions=[12] # SM120 only |
There was a problem hiding this comment.
The comment "Blackwell FMHA: Blackwell only" on line 333 contradicts this code line, which specifies supported_major_versions=[12] for SM120. Blackwell architecture is SM10x (major version 10). This discrepancy is confusing. Assuming the function gen_fmhav2_blackwell_module is for Blackwell, the supported_major_versions should be [10].
| supported_major_versions=[12] # SM120 only | |
| supported_major_versions=[10] # SM100 only |
There was a problem hiding this comment.
SM12x is also considered Blackwell
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (4)
.claude/skills/debug-cuda-crash/skill.md (1)
109-111: Add language specifiers to fenced code blocks.Code blocks on lines 109–111, 202–203, and 208–209 are missing language identifiers. Based on context, these should be
bash.🔎 Proposed fixes
Line 109-111:
-``` +```bash RuntimeError: CUDA error: an illegal memory access was encountered -``` +```bashLine 202-203:
-``` +```bash RuntimeError: Function ... returned nan or inf -``` +```bashLine 208-209:
-``` +```bash RuntimeError: CUDA out of memory -``` +```bashAlso applies to: 202-203, 208-209
CLAUDE.md (2)
415-445: Add language specifier to directory structure code block.The directory listing on line 415 should include a language specifier for clarity. This is a structured text block that would benefit from
textspecifier.🔎 Proposed fix
-``` +```text flashinfer/ ├── include/flashinfer/ # Header-only CUDA kernel templates
709-710: Capitalize "Markdown" as proper noun.On lines 709 and 715, "markdown" should be "Markdown" (proper noun for the markup format).
🔎 Proposed fixes
Line 709:
- - **Tip**: Add `.md` to get markdown format: <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl.html.md> + - **Tip**: Add `.md` to get Markdown format: <https://docs.nvidia.com/cutlass/media/docs/pythonDSL/cute_dsl.html.md>Line 715:
- - **Tip**: Add `.md` to any page URL to get markdown format + - **Tip**: Add `.md` to any page URL to get Markdown formatAlso applies to: 715-716
.claude/skills/add-cuda-kernel/skill.md (1)
466-475: Add language specifier to file listing code block.The file listing on line 466 should include a language specifier. Since it shows file paths and operations,
textis most appropriate.🔎 Proposed fix
-``` +```text include/flashinfer/scale.cuh # NEW: CUDA kernel definition csrc/scale.cu # NEW: PyTorch launcher
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.claude/skills/add-cuda-kernel/skill.md(1 hunks).claude/skills/benchmark-kernel/skill.md(1 hunks).claude/skills/debug-cuda-crash/skill.md(1 hunks)CLAUDE.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
CLAUDE.md
[uncategorized] ~709-~709: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...dsl.html> - Tip: Add .md to get markdown format: <https://docs.nvidia.com/cutlas...
(MARKDOWN_NNP)
[uncategorized] ~715-~715: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...Tip**: Add .md to any page URL to get markdown format - Use for: Low-level instructi...
(MARKDOWN_NNP)
🪛 markdownlint-cli2 (0.18.1)
CLAUDE.md
415-415: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
.claude/skills/debug-cuda-crash/skill.md
105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
202-202: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
208-208: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
.claude/skills/add-cuda-kernel/skill.md
466-466: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (4)
.claude/skills/debug-cuda-crash/skill.md (1)
575-578: Cross-reference consistency verified.References to
CLAUDE.mdandflashinfer/api_logging.pyare consistent with the broader documentation structure and implementation references provided in the PR. ✅CLAUDE.md (1)
1-726: Comprehensive and well-structured developer guide.CLAUDE.md provides excellent breadth and depth for developers working with FlashInfer. The layout progression (quick start → concepts → architecture → debugging) is logical, and the cross-references to skills and external docs support self-service learning.
Key strengths:
- Quick reference table is practical for daily workflows
- JIT architecture explanation demystifies compilation
- Testing section with architecture checks prevents common gotchas
- Environment variable reference is complete and well-organized
- External documentation section acknowledges dependencies transparently
Minor suggestion: Consider adding a "Common Mistakes" section to preempt frequent issues (e.g., forgetting
--recursiveduring clone, or confusion around when to use--no-build-isolation), though current content is thorough..claude/skills/benchmark-kernel/skill.md (1)
1-413: Excellent benchmarking tutorial with clear dual-method approach.The tutorial effectively balances comprehensiveness with accessibility. Strengths:
- Clear method separation: flashinfer_benchmark.py (unified CLI) vs. bench_gpu_time() (programmatic) serve different user needs
- Realistic expectations: Upfront note that CUPTI is optional but recommended reduces barrier-to-entry
- Practical troubleshooting: "CUPTI Warning", "Inconsistent Results", and "Reference Check Failures" sections preempt common issues
- Copy-paste examples: Quick examples for decode, prefill, FP8 GEMM, MOE provide instant starting points
- Metrics explanation: min/max/mean/TFLOPS/TB/s breakdown helps interpret results
The comparison table at line 394–402 is especially valuable for decision-making.
To confirm correctness, please verify:
- CUPTI requirement (CUDA 13+) against
cupti-pythonpackage specs- Routine names in examples (e.g.,
BatchDecodeWithPagedKVCacheWrapper) match actual benchmark implementationbench_gpu_time()function signature and parameters match the flashinfer.testing module API.claude/skills/add-cuda-kernel/skill.md (1)
1-475: Exemplary step-by-step tutorial with complete working example.This is a high-quality developer guide that walks through a realistic workflow. Strengths:
- Progressive complexity: Steps build logically from kernel → launcher → binding → JIT → Python → tests → AOT → export
- Framework separation principle: Clear explanation of why
include/is framework-agnostic andcsrc/handles PyTorch bindings (lines 447–452)- TVM-FFI error handling: Concrete pattern with
TVM_FFI_THROW(ErrorType) << "message"and examples ofValueError,TypeError(lines 166–171)- Architecture specialization guidance: Optional section (lines 228–273) shows common patterns without overwhelming the simple case
- Test completeness: Covers correctness, in-place operations, error cases, and dtype variations
- End-to-end perspective: References section (464–475) lists all files created/modified for clarity
The use of element-wise scale as the running example is appropriate—simple enough to follow, complex enough to be realistic.
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (2)
.claude/skills/add-cuda-kernel/skill.md (2)
208-212: Add validation when copying source files in JIT generator.The code assumes
scale.cuandscale_jit_binding.cuexist atjit_env.FLASHINFER_CSRC_DIR. If these files are missing or moved, the copy silently fails. Add existence checks or error messages to help users debug issues.🔎 Proposed defensive addition
# Copy source files (no Jinja needed for this simple case) sources = [] for fname in ["scale.cu", "scale_jit_binding.cu"]: src_path = jit_env.FLASHINFER_CSRC_DIR / fname + if not src_path.exists(): + raise FileNotFoundError(f"Source file not found: {src_path}") dest_path = gen_directory / fname shutil.copy(src_path, dest_path) sources.append(dest_path)
228-374: Consider separating CUDA architecture guidance into a reference doc.The "Specifying Supported CUDA Architectures" section (lines 228–374) is comprehensive and valuable but interrupts the step-by-step flow of the main tutorial. Readers new to FlashInfer may find this overwhelming mid-tutorial. Consider moving this to a separate reference document (e.g.,
.claude/skills/cuda-architecture-reference.md) and linking to it from Step 4 with a note like: "⚠️ Advanced: If your kernel only supports specific GPU architectures, see CUDA Architecture Reference."This keeps the main tutorial focused on the happy path while preserving detailed guidance for advanced users.
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.claude/skills/add-cuda-kernel/skill.md(1 hunks)CLAUDE.md(1 hunks)
🧰 Additional context used
🪛 LanguageTool
CLAUDE.md
[uncategorized] ~508-~508: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...dsl.html> - Tip: Add .md to get markdown format: <https://docs.nvidia.com/cutlas...
(MARKDOWN_NNP)
[uncategorized] ~514-~514: Did you mean the formatting language “Markdown” (= proper noun)?
Context: ...Tip**: Add .md to any page URL to get markdown format - Use for: Low-level instructi...
(MARKDOWN_NNP)
🪛 markdownlint-cli2 (0.18.1)
CLAUDE.md
264-264: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
.claude/skills/add-cuda-kernel/skill.md
566-566: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (6)
CLAUDE.md (6)
31-54: Installation and quick-start section is clear and actionable.The instructions correctly emphasize the importance of the
--recursiveflag, the role of--no-build-isolation, and JIT's convenience for development. The section successfully communicates that no manual rebuild is needed after kernel changes.
174-240: JIT compilation architecture explanation is thorough and well-structured.The three-layer breakdown (JitSpec, Code Generation, Compilation and Loading) with concrete code examples and comments is excellent pedagogical content. The note about optional Jinja templates and the pattern template in Layer 2 clearly guide developers on when templating is necessary.
303-322: Adding a New Operation section provides good guidance with concrete examples.References to reference implementations (RMSNorm, sampling, decode) at different complexity levels help developers find appropriate starting points. The 9-step overview is clear and comprehensive.
459-475: TVM-FFI section clearly explains cross-language capabilities and current constraints.The explanation that FlashInfer currently provides PyTorch bindings while the underlying kernels are framework-agnostic is important context. This section successfully sets expectations for future multi-framework support.
490-524: External documentation resources section is well-curated and actionable.The inclusion of specific tips (e.g., "Read source code directly" for CUTLASS, "Add
.mdto get Markdown format" for PTX ISA) demonstrates domain knowledge. The "When to Consult These Docs" subsection provides practical guidance for developers on which resource to use for different tasks.
1-30: Request verification: Confirm that referenced skill.md files exist in this PR.This document references three external skill files at lines 158, 195, and 384:
.claude/skills/benchmark-kernel/skill.md.claude/skills/add-cuda-kernel/skill.md.claude/skills/debug-cuda-crash/skill.mdThese are important reference points for developers. Verify that these files are included in this PR and are accessible at the specified paths.
|
|
||
| ## Summary of Files Created/Modified | ||
|
|
||
| ``` |
There was a problem hiding this comment.
Add language identifier to code block.
The fenced code block listing file paths is missing a language specification. Add bash, text, or diff as appropriate.
-```
+```text
include/flashinfer/scale.cuh # NEW: CUDA kernel definition
csrc/scale.cu # NEW: PyTorch launcher
csrc/scale_jit_binding.cu # NEW: TVM-FFI binding
flashinfer/jit/scale.py # NEW: JIT generator
flashinfer/scale.py # NEW: Python API
flashinfer/__init__.py # MODIFIED: Export API
flashinfer/aot.py # MODIFIED: Register AOT
tests/test_scale.py # NEW: Unit tests
-```
+```🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
566-566: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In .claude/skills/add-cuda-kernel/skill.md around line 566, the fenced code
block that lists file paths is missing a language identifier; update the opening
fence from ``` to ```text (or ```bash/```diff if preferred) so the block becomes
a proper fenced code block with a language specifier, leaving the file list
contents and closing fence unchanged.
| ``` | ||
| flashinfer/ | ||
| ├── include/flashinfer/ # Header-only CUDA kernel templates | ||
| │ ├── attention/ # Attention kernels | ||
| │ ├── gemm/ # GEMM kernels | ||
| │ ├── comm/ # Communication kernels | ||
| │ ├── mma.cuh # Matrix multiply utilities | ||
| │ ├── utils.cuh # Common utilities | ||
| │ └── [...] | ||
| │ | ||
| ├── csrc/ # Framework bindings (via TVM-FFI) | ||
| │ ├── *.cu # Kernel launcher implementations | ||
| │ ├── *_jit_binding.cu # TVM-FFI exports | ||
| │ ├── *_customize_config.jinja # Type config templates (optional) | ||
| │ └── [...] | ||
| │ | ||
| ├── flashinfer/ # Python package | ||
| │ ├── jit/ | ||
| │ │ ├── core.py # JitSpec, compilation infrastructure | ||
| │ │ ├── cpp_ext.py # Ninja build generation | ||
| │ │ ├── env.py # Workspace paths | ||
| │ │ ├── attention/ # Attention module generators | ||
| │ │ ├── gemm/ # GEMM module generators | ||
| │ │ ├── fused_moe/ # MOE module generators | ||
| │ │ └── [...] | ||
| │ ├── *.py # High-level Python APIs | ||
| │ ├── aot.py # AOT compilation for pre-built packages | ||
| │ └── [...] | ||
| │ | ||
| └── build_backend.py # PEP 517 build backend | ||
| ``` |
There was a problem hiding this comment.
Add language specifier to directory structure code block.
The fenced code block at line 264 lacks a language specifier. While the tree structure is readable, adding a language identifier improves rendering consistency.
🔎 Proposed fix
-```
+```
flashinfer/
├── include/flashinfer/ # Header-only CUDA kernel templates
│ ├── attention/ # Attention kernelsNote: The language specifier for a plain tree structure may be left empty or use text, but consistency with other code blocks is recommended.
Committable suggestion skipped: line range outside the PR's diff.
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
264-264: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
🤖 Prompt for AI Agents
In CLAUDE.md around lines 264 to 294, the fenced code block containing the
repository tree lacks a language specifier; update the opening fence to include
a language (e.g., ```text) so the block renders consistently (use the same
specifier as other code blocks in the file for consistency).
| torch.cuda.synchronize() # ← Flushes printf output | ||
| ``` | ||
|
|
||
| ### ⚠️ Warp-Specialized Kernels: Use `lane_id == 0` |
There was a problem hiding this comment.
could this suggestion be too narrow? i think often warp-specialization is across a warp group with four warps, but lane id is the local id to one warp? and in theory there could also be more than one warp group involved in the same specialization group/task?
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com>
|
left inline suggestions and some comments, otherwise LGTM |
bkryu
left a comment
There was a problem hiding this comment.
Left some minor comments. This seems exciting!
Co-authored-by: Alex Yang <aleozlx@gmail.com>
| - Set appropriate tolerances for each dtype | ||
| - Test error cases | ||
|
|
||
| ## Step 7: Register in AOT |
There was a problem hiding this comment.
Should we mark this as optional?
|
|
||
| ## Step 3: Create TVM-FFI Binding in `csrc/` | ||
|
|
||
| Create `csrc/scale_jit_binding.cu`: |
There was a problem hiding this comment.
Why not have this in the same file as the launcher?
There was a problem hiding this comment.
Separating them can accelerate compilation time (in terms of latency), however, if kernel compilation time is long enough, it doesn't matter.
| Run all tests: | ||
|
|
||
| ```bash | ||
| pytest tests/ |
There was a problem hiding this comment.
Not recommended: some tests adjust state that will negatively impact subsequent tests (e.g., global variables). Other tests may forget to clean up space, leading to IMA's or segfaults in subsequent tests.
There was a problem hiding this comment.
setting aside IMA, for other reasons that sounds like the tests would be written incorrectly tho
this should be supported from least surprising principle, except IMA is an inevitable edge case if it occurs
There was a problem hiding this comment.
I agree about the root cause, but we want Claude to know that this is a current issue and may misinform it with lots of false negatives.
CLAUDE.md
Outdated
| 8. Register in `flashinfer/aot.py` for AOT compilation | ||
| 9. Export in `flashinfer/__init__.py` | ||
|
|
||
| **Reference implementations:** |
There was a problem hiding this comment.
Example implementations.
Reference sounds like it could be used for accuracy testing.
There was a problem hiding this comment.
Actionable comments posted: 3
♻️ Duplicate comments (4)
.claude/skills/add-cuda-kernel/skill.md (4)
463-463: Update Python API description to mention @flashinfer_api.The key points section should mention that adding the
@flashinfer_apidecorator enables logging and sets it apart from helper functions. Update the bullet point to include this guidance.🔎 Proposed fix
**Key points:** - Uses `@functools.cache` to cache compiled modules -- Clean Python API with docstring +- Clean Python API with docstring and `@flashinfer_api` decorator (enables logging and signals public API) - Handles output allocation - Validates inputs using `@backend_requirement` decorator
414-419: Add @flashinfer_api decorator to public API function.The
scalefunction is a public API that users are expected to call directly, but it's missing the@flashinfer_apidecorator. This decorator serves two purposes: (a) it signals that this is a public API (not a helper function), and (b) it enables function logging when logging mode is on. This was flagged in previous reviews and should be added.🔎 Proposed fix to add decorator
@backend_requirement( backend_checks={}, # No backend choices for this simple kernel common_check=_check_scale_problem_size, ) +@flashinfer_api def scale(input: torch.Tensor, factor: float, out: Optional[torch.Tensor] = None) -> torch.Tensor:
82-92: Add error handling to DISPATCH_DTYPE macro for unsupported dtypes.The macro silently does nothing if an unsupported dtype is passed. Tutorial readers will encounter mysterious silent failures. Add an else clause that throws an error to provide immediate, actionable feedback (as noted in previous reviews).
🔎 Proposed fix for error handling
#define DISPATCH_DTYPE(dtype, DType, ...) \ if (dtype == torch::kFloat16) { \ using DType = half; \ __VA_ARGS__ \ } else if (dtype == torch::kBFloat16) { \ using DType = __nv_bfloat16; \ __VA_ARGS__ \ } else if (dtype == torch::kFloat32) { \ using DType = float; \ __VA_ARGS__ \ - } + } else { \ + throw std::runtime_error("Unsupported dtype"); \ + }
730-730: Add language identifier to fenced code block.The code block listing file paths is missing a language specification. Add
textas the language identifier.🔎 Proposed fix
-``` +```text include/flashinfer/scale.cuh # NEW: CUDA kernel definition csrc/scale.cu # NEW: PyTorch launcher csrc/scale_jit_binding.cu # NEW: TVM-FFI binding flashinfer/jit/scale.py # NEW: JIT generator flashinfer/scale.py # NEW: Python API flashinfer/__init__.py # MODIFIED: Export API flashinfer/aot.py # MODIFIED: Register AOT tests/test_scale.py # NEW: Unit tests -``` +```
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.claude/skills/add-cuda-kernel/skill.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
.claude/skills/add-cuda-kernel/skill.md
🪛 markdownlint-cli2 (0.18.1)
.claude/skills/add-cuda-kernel/skill.md
489-489: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
512-512: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
545-545: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
730-730: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (1)
.claude/skills/add-cuda-kernel/skill.md (1)
1-20: Overall structure and scope are well-designed.The tutorial provides excellent step-by-step guidance for adding a CUDA kernel to FlashInfer, with clear examples, proper separation of concerns (CUDA kernel, launchers, bindings, JIT, Python API), comprehensive decorator documentation, and good test coverage. The architecture specification section (lines 228–356) is particularly thorough and valuable for readers. Once the error handling, decorator, and markdown formatting issues are addressed, this will be a high-quality teaching resource.
|
|
||
| Enforces backend and problem size requirements at runtime. There are three usage patterns: | ||
|
|
||
| **Pattern 1: Single Backend (No Backend Choices)** |
There was a problem hiding this comment.
Use proper markdown heading instead of emphasis.
Line 489 uses bold text (**Pattern 1: ...**) as a section heading. Use a proper markdown heading (e.g., #### Pattern 1: ...) for better document structure and accessibility.
🔎 Proposed fix
-**Pattern 1: Single Backend (No Backend Choices)**
+#### Pattern 1: Single Backend (No Backend Choices)📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Pattern 1: Single Backend (No Backend Choices)** | |
| #### Pattern 1: Single Backend (No Backend Choices) |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
489-489: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
In .claude/skills/add-cuda-kernel/skill.md around line 489, the section title is
formatted using bold text instead of a Markdown heading; replace the bolded line
with an appropriate heading level (for example "#### Pattern 1: Single Backend
(No Backend Choices)") to improve document structure and accessibility, ensuring
the heading matches surrounding hierarchy and removing the bold-only styling.
| pass | ||
| ``` | ||
|
|
||
| **Pattern 2: Multiple Backends** |
There was a problem hiding this comment.
Use proper markdown heading instead of emphasis.
Line 512 uses bold text (**Pattern 2: ...**) as a section heading. Use a proper markdown heading instead.
🔎 Proposed fix
-**Pattern 2: Multiple Backends**
+#### Pattern 2: Multiple Backends📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Pattern 2: Multiple Backends** | |
| #### Pattern 2: Multiple Backends |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
512-512: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
.in file .claude/skills/add-cuda-kernel/skill.md around line 512, the section
title is formatted as bold text ("**Pattern 2: Multiple Backends**") instead of
a markdown heading; replace the bold emphasis with an appropriate markdown
heading (e.g., prepend one or two hash marks like "# Pattern 2: Multiple
Backends" or "## Pattern 2: Multiple Backends" depending on document hierarchy)
so the section becomes a proper heading and renders correctly in markdown.
| pass | ||
| ``` | ||
|
|
||
| **Pattern 3: Auto Backend Selection** |
There was a problem hiding this comment.
Use proper markdown heading instead of emphasis.
Line 545 uses bold text (**Pattern 3: ...**) as a section heading. Use a proper markdown heading instead.
🔎 Proposed fix
-**Pattern 3: Auto Backend Selection**
+#### Pattern 3: Auto Backend Selection📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| **Pattern 3: Auto Backend Selection** | |
| #### Pattern 3: Auto Backend Selection |
🧰 Tools
🪛 markdownlint-cli2 (0.18.1)
545-545: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
🤖 Prompt for AI Agents
.claude/skills/add-cuda-kernel/skill.md around line 545: the section title is
written using bold emphasis (**Pattern 3: Auto Backend Selection**) instead of a
proper markdown heading; replace the bold text with an appropriate markdown
heading (e.g., prefix with one or more # characters depending on desired level,
such as "### Pattern 3: Auto Backend Selection") to make it a real section
header and update surrounding spacing to conform to markdown heading
conventions.
- Add documentation explaining when to use each error macro: - TVM_FFI_THROW for normal runtime errors - TVM_FFI_LOG_AND_THROW for construction-time errors that may not propagate - Add reminder to keep docs in sync with infrastructure changes 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (6)
.claude/skills/add-cuda-kernel/skill.md (6)
82-92: Else clause still missing from DISPATCH_DTYPE macro.This issue was raised in a prior review and marked as addressed, but the else clause that throws an error for unsupported dtypes is still absent. Readers following this tutorial will encounter silent failures if they pass an unsupported dtype.
🔎 Proposed fix
#define DISPATCH_DTYPE(dtype, DType, ...) \ if (dtype == torch::kFloat16) { \ using DType = half; \ __VA_ARGS__ \ } else if (dtype == torch::kBFloat16) { \ using DType = __nv_bfloat16; \ __VA_ARGS__ \ } else if (dtype == torch::kFloat32) { \ using DType = float; \ __VA_ARGS__ \ - } + } else { \ + throw std::runtime_error("Unsupported dtype"); \ + }
450-455: Add@flashinfer_apidecorator to public scale function.The
scale()function is a public API that users will call directly. Add the@flashinfer_apidecorator to mark it as public and enable logging support.🔎 Proposed fix
@backend_requirement( backend_checks={}, # No backend choices for this simple kernel common_check=_check_scale_problem_size, ) +@flashinfer_api def scale(input: torch.Tensor, factor: float, out: Optional[torch.Tensor] = None) -> torch.Tensor:
525-526: Convert bold text to proper markdown heading.Replace the bold text with a markdown heading (
####) for proper document structure and accessibility.🔎 Proposed fix
-**Pattern 1: Single Backend (No Backend Choices)** +#### Pattern 1: Single Backend (No Backend Choices)
548-549: Convert bold text to proper markdown heading.Replace the bold text with a markdown heading (
####) for proper document structure and accessibility.🔎 Proposed fix
-**Pattern 2: Multiple Backends** +#### Pattern 2: Multiple Backends
581-582: Convert bold text to proper markdown heading.Replace the bold text with a markdown heading (
####) for proper document structure and accessibility.🔎 Proposed fix
-**Pattern 3: Auto Backend Selection** +#### Pattern 3: Auto Backend Selection
765-775: Add language identifier to code block.The fenced code block listing file paths is missing a language specification. Add
textas the language identifier.🔎 Proposed fix
-``` +```text include/flashinfer/scale.cuh # NEW: CUDA kernel definition csrc/scale.cu # NEW: PyTorch launcher csrc/scale_jit_binding.cu # NEW: TVM-FFI binding flashinfer/jit/scale.py # NEW: JIT generator flashinfer/scale.py # NEW: Python API flashinfer/__init__.py # MODIFIED: Export API flashinfer/aot.py # MODIFIED: Register AOT tests/test_scale.py # NEW: Unit tests -``` +```
🧹 Nitpick comments (1)
.claude/skills/add-cuda-kernel/skill.md (1)
295-302: Clarify GPU architecture terminology.References to "SM90+" in this section should be more precise. Instead of "SM90+", specify exact supported architectures (e.g., "Hopper SM90, Blackwell SM100, and newer") to avoid implying unsupported future architectures. This aligns with FlashInfer's backwards compatibility approach.
🔎 Proposed fix
#### How CompilationContext Works **Automatic Detection** (default): ```python from flashinfer.compilation_context import CompilationContext ctx = CompilationContext() -# Automatically detects all GPUs in the system -# For SM90+, adds 'a' suffix (e.g., 9.0a for Hopper) +# Automatically detects all GPUs in the system +# For Hopper (SM90) and newer, adds 'a' suffix (e.g., 9.0a for Hopper) # Result: ctx.TARGET_CUDA_ARCHS = {(9, '0a'), (10, '0a'), ...}
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (2)
.claude/skills/add-cuda-kernel/skill.mdCLAUDE.md
🚧 Files skipped from review as they are similar to previous changes (1)
- CLAUDE.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
.claude/skills/add-cuda-kernel/skill.md
🪛 LanguageTool
.claude/skills/add-cuda-kernel/skill.md
[style] ~180-~180: ‘almost never’ might be wordy. Consider a shorter alternative.
Context: ...nitialization) 3. The error condition almost never fails in practice (e.g., internal error...
(EN_WORDINESS_PREMIUM_ALMOST_NEVER)
🪛 markdownlint-cli2 (0.18.1)
.claude/skills/add-cuda-kernel/skill.md
264-264: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
- Add @flashinfer_api decorator to scale example - Add nvidia-cutlass-dsl pip package mention for CuTe DSL - Add SM12x mention for Blackwell architecture - Add else clause to DISPATCH_DTYPE with TVM_FFI_THROW - Simplify warp specialization guidance (no assumed group size) - Add destination passing style API design guidance - Clarify AOT registration purpose - Add note about OOM handling in tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…mbreughe - Update rotate_buffers to cold_l2_cache=True - Add CUDA events accuracy note for fast kernels (5-50 us) - Add multi-GPU test command - Fix SM90+ statements to not imply future compatibility - Add @backend_requirement decorated APIs info - Add Python API subdirectories to directory structure - Add tests/ directory structure - Rename 'Reference implementations' to 'Example implementations' - Rename scale_run to run in binding example - Update test location guidance to use specific subdirs 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> Co-Authored-By: bkryu <bkryu@nvidia.com> Co-Authored-By: nvmbreughe <nvmbreughe@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (1)
.claude/skills/add-cuda-kernel/skill.md (1)
175-210: Simplify phrase on line 182.The phrase "almost never fails in practice" is slightly wordy. Consider "rarely fails in practice" for conciseness.
- 3. The error condition almost never fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros) + 3. The error condition rarely fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros)
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (4)
.claude/skills/add-cuda-kernel/skill.md.claude/skills/benchmark-kernel/skill.md.claude/skills/debug-cuda-crash/skill.mdCLAUDE.md
🚧 Files skipped from review as they are similar to previous changes (1)
- .claude/skills/benchmark-kernel/skill.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
.claude/skills/add-cuda-kernel/skill.md
🪛 LanguageTool
.claude/skills/add-cuda-kernel/skill.md
[style] ~182-~182: ‘almost never’ might be wordy. Consider a shorter alternative.
Context: ...nitialization) 3. The error condition almost never fails in practice (e.g., internal error...
(EN_WORDINESS_PREMIUM_ALMOST_NEVER)
🪛 markdownlint-cli2 (0.18.1)
CLAUDE.md
105-105: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
202-202: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
208-208: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
.claude/skills/debug-cuda-crash/skill.md
531-531: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
554-554: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (14)
CLAUDE.md (6)
1-65: LGTM on quick start sections.Clear, practical guidance for installation and initial setup. Commands are accurate and the emphasis on submodule initialization is appropriate.
105-119: Add language specifier to fenced code block.Line 105 code block is missing a language identifier for proper syntax highlighting.
🔎 Proposed fix
-**Example:** -```python +**Example:** +```python from flashinfer.utils import is_sm90a_supported
131-165: LGTM on benchmarking section.Clear guidance with practical examples and appropriate cross-references to skill documentation.
202-234: Add language specifiers to fenced code blocks.Lines 202 and 208 code blocks are missing language identifiers for syntax highlighting.
🔎 Proposed fix
-### Layer 2: Code Generation - -Every `gen_*_module()` function in `flashinfer/jit/` follows this pattern: - -``` +### Layer 2: Code Generation + +Every `gen_*_module()` function in `flashinfer/jit/` follows this pattern: + +```python def gen_some_module(dtype_in, dtype_out, ...): # 1. Compute unique identifier from parametersAlso fix line 254:
-```jinja +```jinja // Input template using DTypeIn = {{ dtype_in }};
268-334: LGTM on structure and adding operations sections.Clear directory organization with proper emphasis on framework separation. Good cross-references to step-by-step tutorials.
340-549: LGTM on architectural patterns and final sections.Comprehensive coverage of caching, dispatch macros, API logging, and development workflows. The closing section on AI agent best practices ties together the guidance well.
.claude/skills/debug-cuda-crash/skill.md (3)
1-100: LGTM on tutorial structure and initial sections.Clear step-by-step progression with practical examples. Environment variables and logging levels are well-documented.
300-400: LGTM on kernel debugging and advanced tools sections.Good coverage of compute-sanitizer, cuda-gdb, and kernel printf debugging. Warp-specialization guidance is clear.
400-569: LGTM on final sections and example.End-to-end debug session example is concrete and actionable. Summary effectively recaps the workflow.
.claude/skills/add-cuda-kernel/skill.md (5)
211-344: LGTM on JIT generator and architecture specification.Comprehensive guidance on module generation, architecture targeting, and CompilationContext usage. Real examples from the codebase are helpful.
414-508: LGTM on Python API implementation.Good coverage of decorators (@flashinfer_api, @backend_requirement), destination-passing pattern, and detailed decorator feature documentation. Docstring example is clear.
655-716: LGTM on testing section.Good parametrization strategy with multiple dtypes and sizes. Appropriate tolerance handling for different precisions and comprehensive error case coverage.
725-760: LGTM on AOT and export sections.Clear explanation of AOT purpose and practical examples for both registration and API export.
762-783: Add language specifier to file listing code block.Line 774 code block is missing a language identifier for proper rendering.
-## Summary of Files Created/Modified - -``` +## Summary of Files Created/Modified + +```text include/flashinfer/scale.cuh # NEW: CUDA kernel definition csrc/scale.cu # NEW: PyTorch launcher
|
@aleozlx @bkryu @nvmbreughe updated according to your suggestions! |
Updated CUDA kernel implementation and TVM-FFI bindings for scaling functionality. Improved error handling and type dispatching.
There was a problem hiding this comment.
Actionable comments posted: 0
🧹 Nitpick comments (6)
.claude/skills/add-cuda-kernel/skill.md (6)
498-498: Use proper markdown heading instead of emphasis.Convert bold emphasis to a level 4 heading for correct document structure.
-**Pattern 1: Single Backend (No Backend Choices)** +#### Pattern 1: Single Backend (No Backend Choices)
521-521: Use proper markdown heading instead of emphasis.Convert bold emphasis to a level 4 heading.
-**Pattern 2: Multiple Backends** +#### Pattern 2: Multiple Backends
554-554: Use proper markdown heading instead of emphasis.Convert bold emphasis to a level 4 heading.
-**Pattern 3: Auto Backend Selection** +#### Pattern 3: Auto Backend Selection
741-741: Add language identifier to code block.The fenced code block listing file paths is missing a language specification for proper syntax highlighting.
-``` +```text include/flashinfer/scale.cuh # NEW: CUDA kernel definition csrc/scale.cu # NEW: PyTorch launcher csrc/scale_jit_binding.cu # NEW: TVM-FFI binding flashinfer/jit/scale.py # NEW: JIT generator flashinfer/scale.py # NEW: Python API flashinfer/__init__.py # MODIFIED: Export API flashinfer/aot.py # MODIFIED: Register AOT tests/test_scale.py # NEW: Unit tests -``` +```
129-129: Simplify wordy phrase.Replace "almost never" with a more concise alternative.
- 3. The error condition almost never fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros) + 3. The error condition rarely fails in practice (e.g., internal errors, unsupported dtype combinations in dispatch macros)
240-247: Add language identifier to Python code block.The CompilationContext example missing a language identifier for proper syntax highlighting.
-```python +```python from flashinfer.compilation_context import CompilationContext ctx = CompilationContext()
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
.claude/skills/add-cuda-kernel/skill.md
🧰 Additional context used
🧠 Learnings (1)
📚 Learning: 2025-11-12T03:35:17.583Z
Learnt from: raayandhar
Repo: flashinfer-ai/flashinfer PR: 2070
File: include/flashinfer/gemm/bf16_gemm_cutlass_template.h:145-160
Timestamp: 2025-11-12T03:35:17.583Z
Learning: In flashinfer GEMM implementations (e.g., include/flashinfer/gemm/bf16_gemm_cutlass_template.h, fp8_gemm_cutlass_template.h), it is acceptable to catch and silently ignore std::runtime_error exceptions in getWorkspaceSizeImpl when probing multiple GEMM configurations, as some configurations may legitimately fail due to SMEM constraints. This pattern should include a comment like "// Swallow errors when SMEM exceeds maximum allowed" to document the rationale.
Applied to files:
.claude/skills/add-cuda-kernel/skill.md
🪛 LanguageTool
.claude/skills/add-cuda-kernel/skill.md
[style] ~129-~129: ‘almost never’ might be wordy. Consider a shorter alternative.
Context: ...nitialization) 3. The error condition almost never fails in practice (e.g., internal error...
(EN_WORDINESS_PREMIUM_ALMOST_NEVER)
🪛 markdownlint-cli2 (0.18.1)
.claude/skills/add-cuda-kernel/skill.md
498-498: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
521-521: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
554-554: Emphasis used instead of a heading
(MD036, no-emphasis-as-heading)
741-741: Fenced code blocks should have a language specified
(MD040, fenced-code-language)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
- GitHub Check: Deploy Docs
🔇 Additional comments (3)
.claude/skills/add-cuda-kernel/skill.md (3)
421-425: Verify decorator usage matches FlashInfer API.The example uses
@flashinfer_apiand@backend_requirementdecorators. Confirm these:
- Match the actual API surface in the codebase
- Accept the parameters shown (backend_checks, common_check)
- Are documented consistently across other examples
75-101: Verify TVM-FFI macro names and patterns.The launcher uses
DISPATCH_DLPACK_DTYPE_TO_CTYPE_FP32_FP16andTVM_FFI_ICHECKutilities. Confirm these macros:
- Match actual definitions in
csrc/tvm_ffi_utils.h- Support the dtypes claimed (FP32, FP16, BF16 if applicable)
- Are the recommended pattern for similar operations
162-171: Verify TVM-FFI export syntax.Line 170 uses
TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, scale_launcher). Confirm:
- This macro exists and is the current recommended export pattern
- The function signature and types are correctly handled
- Whether renaming the export to match the function name (as suggested in past comments) was considered
📌 Description
Add CLAUDE.md as contribution guide to agents (and human).
Add several skills (adding an CUDA operator to flashinfer, debug, profiling), this list will grow in the future.
🔍 Related Issues
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Co-Authored-By: Claude Opus 4.5 noreply@anthropic.com
Co-Authored-By: aleozlx aleyang@nvidia.com
Co-Authored-By: bkryu bkryu@nvidia.com
Co-Authored-By: nvmbreughe nvmbreughe@nvidia.com
Co-Authored-By: jimmyzho jimmzhou@nvidia.com
Co-Authored-By: cyx-6 yaxingc@nvidia.com
Summary by CodeRabbit
New Features
Tests
Chores
Documentation
✏️ Tip: You can customize this high-level summary in your review settings.