Skip to content

bugfix: fix claude skills#2275

Merged
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
yzh119:make-skills-effective
Dec 31, 2025
Merged

bugfix: fix claude skills#2275
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
yzh119:make-skills-effective

Conversation

@yzh119
Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 commented Dec 31, 2025

📌 Description

Skills defined in #2240 doesn't make effect because of missing metadata and wrong file name.
This PR fixes the issue.

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • New Features

    • Scale kernel now available as a public API for end-users
    • New benchmarking guide and tools for kernel performance measurement
  • Documentation

    • Updated tutorial documentation with structured metadata
    • Added comprehensive benchmarking guidance with examples
  • Tests

    • Implemented unit tests to validate kernel functionality

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 31, 2025

📝 Walkthrough

Walkthrough

This change introduces a new CUDA scale kernel to the flashinfer public API, including tutorial documentation with benchmarking guidance, a Python API wrapper, test coverage, and a benchmark script demonstrating performance measurement across multiple sizes and data types.

Changes

Cohort / File(s) Summary
Tutorial Documentation
.claude/skills/add-cuda-kernel/SKILL.md, .claude/skills/benchmark-kernel/SKILL.md, .claude/skills/debug-cuda-crash/SKILL.md
Added new "Step 10: Add Benchmark" tutorial section with benchmarks/bench_scale.py guidance; added YAML front matter metadata headers to skill files for better organization and discoverability.
Core API & Registration
flashinfer/scale.py, flashinfer/__init__.py, flashinfer/aot.py
Introduced new scale.py module as public Python API for the CUDA scale kernel; updated init.py to export the new API; modified aot.py to register AOT components.
Tests & Benchmarks
tests/test_scale.py, benchmarks/bench_scale.py
Added unit tests for scale kernel validation; created benchmark script measuring flashinfer.scale performance across multiple sizes and data types using CUPTI fallback timing.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Poem

🐰 A kernel scales so bright and new,
With benchmarks charting what it can do,
CUPTI whispers timings true,
From tiny ops to massive crews—
The scale kernel hops right through! ⚡

Pre-merge checks

❌ Failed checks (1 inconclusive)
Check name Status Explanation Resolution
Title check ❓ Inconclusive The title 'bugfix: fix claude skills' is vague and generic, using non-descriptive terms that don't convey meaningful information about what specific skills are being fixed or what the actual changes accomplish. Use a more specific title that describes the actual changes, such as 'Add YAML metadata to skill definition files' or 'Fix skill definitions with required metadata and correct filenames'.
✅ Passed checks (2 passed)
Check name Status Explanation
Description check ✅ Passed The description explains the issue (missing metadata and wrong filename in skills) and references PR #2240, meeting the basic requirements of the template with a clear explanation and completed checklists.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 835a015 and 48b6e4e.

📒 Files selected for processing (3)
  • .claude/skills/add-cuda-kernel/SKILL.md
  • .claude/skills/benchmark-kernel/SKILL.md
  • .claude/skills/debug-cuda-crash/SKILL.md
🧰 Additional context used
🧠 Learnings (8)
📓 Common learnings
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Keep documentation in CLAUDE.md and `.claude/skills/` files in sync with code changes, including infrastructure changes, new patterns, deprecated approaches, and new error handling utilities
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Use `FLASHINFER_CUDA_ARCH_LIST` environment variable to specify target GPU architectures (e.g., '8.0 9.0a') and `FLASHINFER_NVCC_THREADS` to control parallel compilation threads
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to tests/**/*.py : Test implementations should use `flashinfer.utils` functions (`get_compute_capability`, `is_sm90a_supported`, `is_sm100a_supported`, etc.) to skip tests on unsupported GPU architectures
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Keep documentation in CLAUDE.md and `.claude/skills/` files in sync with code changes, including infrastructure changes, new patterns, deprecated approaches, and new error handling utilities

Applied to files:

  • .claude/skills/benchmark-kernel/SKILL.md
  • .claude/skills/debug-cuda-crash/SKILL.md
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

  • .claude/skills/benchmark-kernel/SKILL.md
  • .claude/skills/add-cuda-kernel/SKILL.md
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API

Applied to files:

  • .claude/skills/add-cuda-kernel/SKILL.md
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support

Applied to files:

  • .claude/skills/add-cuda-kernel/SKILL.md
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Applied to files:

  • .claude/skills/add-cuda-kernel/SKILL.md
  • .claude/skills/debug-cuda-crash/SKILL.md
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to tests/**/*.py : Test implementations should use `flashinfer.utils` functions (`get_compute_capability`, `is_sm90a_supported`, `is_sm100a_supported`, etc.) to skip tests on unsupported GPU architectures

Applied to files:

  • .claude/skills/add-cuda-kernel/SKILL.md
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `functools.cache` decorator on Python API functions to implement module-level caching and avoid recompilation

Applied to files:

  • .claude/skills/add-cuda-kernel/SKILL.md
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs
🔇 Additional comments (4)
.claude/skills/benchmark-kernel/SKILL.md (1)

1-4: YAML front matter correctly adds metadata to activate skill.

The standard YAML format with name and description fields properly addresses the missing metadata issue preventing skill activation. Format is consistent with expected Claude skill system conventions.

.claude/skills/debug-cuda-crash/SKILL.md (1)

1-4: YAML front matter correctly formats skill metadata.

Consistent with other skill files, the front matter properly activates the skill with standardized name and description fields.

.claude/skills/add-cuda-kernel/SKILL.md (2)

1-4: YAML front matter properly activates the skill.

Metadata format is consistent across all three skill files and correctly addresses the missing metadata issue from PR objectives.


744-792: Step 10 benchmark guidance appropriately completes the kernel development workflow.

The new section:

  • Establishes benchmark requirement as best practice (line 746)
  • Provides a practical, complete example for simple kernels using bench_gpu_time (lines 750-783)
  • References the detailed benchmarking skill for complex cases (line 791)
  • Cross-references improve discoverability and prevent documentation fragmentation

The guidance is well-calibrated: simple kernels get a standalone example, complex kernels are directed to the unified flashinfer_benchmark.py framework and the dedicated benchmark-kernel skill tutorial.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request addresses a bug where Claude skills were not functioning as intended due to inconsistencies in file naming conventions and a lack of necessary metadata. By standardizing skill file names to SKILL.md and embedding descriptive YAML front matter, the PR ensures that these skills are properly recognized and activated. Additionally, it enhances the add-cuda-kernel tutorial by integrating a new section on benchmarking, providing a more comprehensive guide for developers.

Highlights

  • Skill File Renaming: All skill definition files within the .claude/skills directory have been renamed to use an uppercase SKILL.md extension (e.g., skill.md to SKILL.md) to ensure proper recognition and functionality.
  • Metadata Addition to Skill Files: YAML front matter, including name and description fields, has been added to each skill's SKILL.md file. This provides essential metadata for the Claude skills to be correctly identified and utilized.
  • Enhanced CUDA Kernel Tutorial: The add-cuda-kernel skill tutorial has been updated to include a new 'Step 10: Add Benchmark' section. This section provides guidance and a code example for creating benchmark scripts for new kernels, promoting performance tracking.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request fixes the Claude skills by renaming the skill files to SKILL.md and adding the required metadata. It also enhances the add-cuda-kernel skill by adding a new step for benchmarking. The changes are correct and align with the PR's goal. I've found a minor issue in the example code for benchmarking and provided a suggestion to fix it.

Comment on lines +751 to +782
import torch
from flashinfer.testing import bench_gpu_time

def bench_scale():
"""Benchmark scale kernel."""
import flashinfer

sizes = [1024, 4096, 16384, 65536, 262144]
dtypes = [torch.float16, torch.bfloat16]

print("Scale Kernel Benchmark")
print("-" * 60)
print(f"{'Size':>10} {'Dtype':>10} {'Time (us)':>12} {'Std (us)':>10}")
print("-" * 60)

for size in sizes:
for dtype in dtypes:
x = torch.randn(size, dtype=dtype, device="cuda")

# Benchmark with CUPTI (auto-fallback to CUDA events)
median_time, std_time = bench_gpu_time(
flashinfer.scale,
args=(x, 2.0),
enable_cupti=True,
dry_run_iters=10,
repeat_iters=100,
)

print(f"{size:>10} {str(dtype):>10} {median_time*1e6:>12.2f} {std_time*1e6:>10.2f}")

if __name__ == "__main__":
bench_scale()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The example benchmark script has a few issues that would prevent it from running correctly:

  1. It's missing an import for numpy, which is needed to calculate np.median and np.std.
  2. The bench_gpu_time function returns a list of execution times in milliseconds, not the median and standard deviation directly. The example code should be updated to calculate these statistics from the returned list.
  3. The conversion to microseconds (us) should be from milliseconds, so the multiplication factor should be 1000, not 1e6 (which would be for seconds-to-microseconds).

Here is a corrected version of the script.

import torch
import numpy as np
from flashinfer.testing import bench_gpu_time

def bench_scale():
    """Benchmark scale kernel."""
    import flashinfer

    sizes = [1024, 4096, 16384, 65536, 262144]
    dtypes = [torch.float16, torch.bfloat16]

    print("Scale Kernel Benchmark")
    print("-" * 60)
    print(f"{'Size':>10} {'Dtype':>10} {'Time (us)':>12} {'Std (us)':>10}")
    print("-" * 60)

    for size in sizes:
        for dtype in dtypes:
            x = torch.randn(size, dtype=dtype, device="cuda")

            # Benchmark with CUPTI (auto-fallback to CUDA events)
            times_ms = bench_gpu_time(
                flashinfer.scale,
                args=(x, 2.0),
                enable_cupti=True,
                dry_run_iters=10,
                repeat_iters=100,
            )
            # bench_gpu_time returns a list of times in milliseconds
            median_time_us = np.median(times_ms) * 1000
            std_time_us = np.std(times_ms) * 1000

            print(f"{size:>10} {str(dtype):>10} {median_time_us:>12.2f} {std_time_us:>10.2f}")

if __name__ == "__main__":
    bench_scale()

@yzh119 yzh119 merged commit 747b0cb into flashinfer-ai:main Dec 31, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants