-
Notifications
You must be signed in to change notification settings - Fork 12
Agent FFI integration #119
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
Note Other AI code review bot(s) detectedCodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review. WalkthroughThis pull request introduces comprehensive TVM FFI (Foreign Function Interface) support for CUDA kernel generation, compilation, and testing. It includes benchmark definitions, documentation, end-to-end examples, agent prompting infrastructure, and automated test utilities for validating LLM-generated CUDA kernels with TVM FFI bindings. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant TestPrompt as test_prompt.py
participant LLM as LLM API<br/>(OpenAI/Anthropic)
participant Compiler as tvm_ffi<br/>(Compiler)
participant TestKernel as CUDA Kernel
participant FileSystem as Output File
User->>TestPrompt: main()
loop For each model
TestPrompt->>LLM: call_*_model(prompt)
LLM-->>TestPrompt: CUDA code response
TestPrompt->>TestPrompt: extract_cuda_code()
TestPrompt->>Compiler: load_inline(cuda_sources)
Compiler-->>TestPrompt: compiled module
TestPrompt->>TestKernel: test_kernel(small_tensor)
TestKernel-->>TestPrompt: result
TestPrompt->>TestKernel: test_kernel(large_tensor)
TestKernel-->>TestPrompt: result
TestPrompt->>FileSystem: write results.txt<br/>(model, code, tests)
end
TestPrompt-->>User: summary report
sequenceDiagram
participant User
participant E2E as e2e_kernel.py
participant Benchmark as Benchmark<br/>(FlashInfer)
participant Agent as Agent Solution
participant TVM as TVM FFI<br/>(apply)
participant Torch as PyTorch
User->>E2E: main()
E2E->>Benchmark: load_traceset(path)
E2E->>Benchmark: run_all()
Benchmark-->>E2E: results (speedup)
E2E->>Torch: create A[1024×4096] FP16
E2E->>Torch: create B[4096×4096] FP16
E2E->>Torch: reference = A @ B.T
E2E->>TVM: apply(kernel_name,<br/>A, B,<br/>fallback=reference)
alt kernel execution
TVM->>Agent: execute gemm_n4096_k4096
Agent-->>TVM: result
TVM-->>E2E: kernel_result
else fallback
TVM-->>E2E: reference
end
E2E->>E2E: compare results<br/>vs reference
E2E-->>User: pass/tolerance verdict
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes
Areas requiring extra attention:
Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Tip 📝 Customizable high-level summaries are now available in beta!You can now customize how CodeRabbit generates the high-level summary in your pull requests — including its content, structure, tone, and formatting.
Example instruction:
Note: This feature is currently in beta for Pro-tier users, and pricing will be announced later. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @zanderjiang, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances FlashInfer-Bench by integrating agent-based kernel generation capabilities. It provides a structured framework for agents to produce optimized CUDA kernels with TVM FFI bindings, complete with detailed prompts and validation mechanisms. This integration aims to streamline the process of developing and benchmarking high-performance kernels by leveraging automated code generation, making it easier to explore and validate different kernel implementations. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces a new agent module for kernel generation using TVM FFI, which is a great addition. The documentation and examples are very comprehensive. I've found a few potential issues in the example code snippets related to integer types and kernel launch configurations. These could lead to incorrect code generation by agents or bugs in the examples themselves, especially for large inputs. Addressing these will improve the robustness and correctness of the provided agent prompts and examples.
| int64_t threads = 256; | ||
| int64_t blocks = (n + threads - 1) / threads; | ||
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.
| int64_t threads = 256; | |
| int64_t blocks = (n + threads - 1) / threads; | |
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); | |
| const int threads = 256; | |
| const dim3 blocks((n + threads - 1) / threads); | |
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); |
| int64_t threads = 256; | ||
| int64_t blocks = (n + threads - 1) / threads; | ||
| AddKernel<<<blocks, threads, 0, stream>>>(a_data, b_data, c_data, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.
| int64_t threads = 256; | |
| int64_t blocks = (n + threads - 1) / threads; | |
| AddKernel<<<blocks, threads, 0, stream>>>(a_data, b_data, c_data, n); | |
| const int threads = 256; | |
| const dim3 blocks((n + threads - 1) / threads); | |
| AddKernel<<<blocks, threads, 0, stream>>>(a_data, b_data, c_data, n); |
| int64_t threads = 256; | ||
| int64_t blocks = (n + threads - 1) / threads; | ||
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.
| int64_t threads = 256; | |
| int64_t blocks = (n + threads - 1) / threads; | |
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); | |
| const int threads = 256; | |
| const dim3 blocks((n + threads - 1) / threads); | |
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); |
| int64_t threads = 256; | ||
| int64_t blocks = (n + threads - 1) / threads; | ||
| cudaStream_t stream = | ||
| static_cast<cudaStream_t>(TVMFFIEnvGetStream(x.device().device_type, x.device().device_id)); | ||
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The blocks and threads variables are int64_t but are passed directly to the kernel launch configuration. This is incorrect as the grid and block dimensions for a kernel launch expect dim3 or unsigned int. Using int64_t will lead to truncation and incorrect behavior for large inputs.
| int64_t threads = 256; | |
| int64_t blocks = (n + threads - 1) / threads; | |
| cudaStream_t stream = | |
| static_cast<cudaStream_t>(TVMFFIEnvGetStream(x.device().device_type, x.device().device_id)); | |
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); | |
| const int threads = 256; | |
| const dim3 blocks((n + threads - 1) / threads); | |
| cudaStream_t stream = | |
| static_cast<cudaStream_t>(TVMFFIEnvGetStream(x.device().device_type, x.device().device_id)); | |
| AddOneKernel<<<blocks, threads, 0, stream>>>(x_data, y_data, n); |
| const __half* A, | ||
| const __half* B, | ||
| __half* C, | ||
| int M, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The use of int for dimension M in gemm_n4096_k4096_launch can lead to truncation if M exceeds INT_MAX. Since M is a variable dimension, it's safer to use int64_t. This also applies to the kernel implementation (gemm_kernel on line 273) and the host launcher function (gemm_n4096_k4096_launch on line 289). Consequently, the cast in main.cpp (line 353) would no longer be necessary.
| int M, | |
| int64_t M, |
|
|
||
| namespace my_kernels { | ||
|
|
||
| __global__ void AddOneKernel(float* x, float* y, int n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel parameter n is of type int, but it's used to represent the size of a tensor, which can exceed INT_MAX. This can lead to truncation and incorrect behavior for large tensors. It should be int64_t to match the type of x.size(0).
| __global__ void AddOneKernel(float* x, float* y, int n) { | |
| __global__ void AddOneKernel(float* x, float* y, int64_t n) { |
|
|
||
| namespace my_kernels { | ||
|
|
||
| __global__ void AddOneKernel(float* x, float* y, int n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel parameter n is of type int, but it's used to represent the size of a tensor, which can exceed INT_MAX. This can lead to truncation and incorrect behavior for large tensors. It should be int64_t to match the type of x.size(0).
| __global__ void AddOneKernel(float* x, float* y, int n) { | |
| __global__ void AddOneKernel(float* x, float* y, int64_t n) { |
|
|
||
| namespace tvm_ffi_example_cuda { | ||
|
|
||
| __global__ void AddOneKernel(float* x, float* y, int n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The kernel parameter n is of type int, but it's used to represent the size of a tensor, which can exceed INT_MAX. This can lead to truncation and incorrect behavior for large tensors. It should be int64_t to match the type of x.size(0).
| __global__ void AddOneKernel(float* x, float* y, int n) { | |
| __global__ void AddOneKernel(float* x, float* y, int64_t n) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
♻️ Duplicate comments (1)
flashinfer_bench/agents/test_prompt.py (1)
98-132: Sharedtest_kernelhelper to avoid duplication withtest_existing_prompt_sol.pyThe testing logic here matches the
test_kernelimplementation inflashinfer_bench/agents/test_existing_prompt_sol.py(same CUDA device assumption, tensor sizes, and assertions). As noted in the other file, factoring this into a shared helper would reduce duplication and keep your kernel tests in sync.
🧹 Nitpick comments (7)
flashinfer_bench/agents/load_inline.py (1)
39-41: Consider using or removing themodvariable.The variable
modis assigned but never used. If this is a demonstration script focused on compilation success, consider either:
- Using the module (e.g., calling a function from it), or
- Removing the assignment:
tvm_ffi.cpp.load_inline(name="add_one_cuda", cuda_sources=cuda_source)Apply this diff if the module isn't needed:
def main(): - mod: Module = tvm_ffi.cpp.load_inline(name="add_one_cuda", cuda_sources=cuda_source) + tvm_ffi.cpp.load_inline(name="add_one_cuda", cuda_sources=cuda_source) print("Compilation successful")flashinfer_bench/agents/test_results/o3.txt (1)
28-121: ElementwiseAdd TVM FFI wrapper looks solid; only edge-case is very large tensor sizesValidation (ndim, dtype, size, device, contiguity), stream acquisition via
TVMFFIEnvGetStream, andn == 0early-return are all well-structured and align with TVM FFI usage. The only theoretical edge-case is the cast tointwhen computingblocksfromint64_t n:const int threads = 256; const int blocks = static_cast<int>((n + threads - 1) / threads);For realistic benchmark sizes this is fine, but if you ever push to extremely large 1D tensors (beyond what a single CUDA grid dimension can hold in
int), you may want to guard or clampblocksagainst device limits.flashinfer_bench/agents/test_results/claude-opus-4-1-20250805.txt (1)
6-19: Consider explicitly including<cuda_runtime.h>for CUDA types/functionsThis code uses CUDA runtime types and functions (
cudaStream_t,cudaError_t,cudaGetLastError,cudaGetErrorString) but only includes TVM FFI headers. Given it already compiles, those headers are probably pulling in the CUDA headers transitively, but adding an explicit:#include <cuda_runtime.h>near the top (as in the
o3variant) would make the dependency clearer and less fragile to upstream header changes.flashinfer_bench/agents/test_existing_prompt_sol.py (2)
50-84: Deduplicatetest_kernelbetween this script andtest_prompt.py
test_kernelhere is functionally identical to the one inflashinfer_bench/agents/test_prompt.py(same tensor shapes, CUDA device usage, and assertions). Keeping two copies invites drift if you change test logic later (e.g., different tensor sizes or tolerances).Consider moving
test_kernelinto a small shared helper module (e.g.,agents/test_utils.py) and importing it from both scripts.
110-183: Tighten exception handling intest_saved_codefor clearer failure modes
test_saved_codewraps both compilation and high-level file processing in broadexcept Exceptionblocks. For a CLI this is workable, but it makes it harder to distinguish parse errors, compilation issues, and runtime test failures.You could:
- Keep the outermost
except Exceptionas a final safety net.- Narrow the inner
exceptto known failure types (e.g.,RuntimeErroror specific exceptions fromtvm_ffi.cpp.load_inline/torch), or at least log more structured context (file name, stage) before returning.This keeps the user-friendly summary while improving debuggability and aligns better with the Ruff hints (BLE001/TRY300).
flashinfer_bench/agents/test_prompt.py (2)
31-83: Make prompt handling consistent betweentest_model,call_openai_model, andcall_anthropic_modelRight now
test_modelbuildsfull_promptbut:
call_openai_modelusespromptonly for["o3", "o4-mini-2025-04-16"]; other OpenAI models ignore thepromptparameter and instead hardcodeELEMENTWISE_ADD_PROMPT/FFI_PROMPT_SIMPLE.call_anthropic_modelignores itspromptparameter entirely and also hardcodesELEMENTWISE_ADD_PROMPT/FFI_PROMPT_SIMPLE.For clarity and easier experimentation with different prompts, it would be cleaner to have both helpers accept and actually use a fully-formed
prompt(or(system, user)pair) fromtest_model, and drop unused parameters. That also resolves the Ruff ARG001 warning on the unusedpromptargument.
135-216: Narrow exception handling around model calls and compilation; clean minor style issues
test_modeluses broadexcept Exceptionblocks both around the main body and around theload_inlinecall. For a CLI harness this is acceptable, but you might get better diagnostics by:
- Catching expected failures explicitly (e.g., missing API keys, HTTP/API errors,
tvm_ffi.cpp.load_inlinecompilation problems) and emitting more specific messages.- Keeping a final
except Exceptionas a last-resort catch.Also, minor non-blocking cleanups:
- At line 183,
f"Compilation: SUCCESS\n"doesn’t need to be an f-string.- Similarly, other f-strings without placeholders (e.g., the
"Compilation: FAILED\n"in the failure path) can be plain strings.These are small polish items but would align with Ruff’s hints and make error handling more intentional.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (14)
.gitignore(1 hunks)examples/ffi/Example-FlashInfer-Trace/definitions/gemm_n4096_k4096.json(1 hunks)examples/ffi/Example-FlashInfer-Trace/workloads/gemm_n4096_k4096.jsonl(1 hunks)examples/ffi/agent.md(1 hunks)examples/ffi/e2e_kernel.py(1 hunks)flashinfer_bench/agents/ffi_prompt.py(1 hunks)flashinfer_bench/agents/load_inline.py(1 hunks)flashinfer_bench/agents/test_existing_prompt_sol.py(1 hunks)flashinfer_bench/agents/test_prompt.py(1 hunks)flashinfer_bench/agents/test_results/claude-opus-4-1-20250805.txt(1 hunks)flashinfer_bench/agents/test_results/gpt-5-2025-08-07.txt(1 hunks)flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt(1 hunks)flashinfer_bench/agents/test_results/o3.txt(1 hunks)flashinfer_bench/agents/test_results/o4-mini-2025-04-16.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
flashinfer_bench/agents/test_prompt.py (2)
flashinfer_bench/agents/test_existing_prompt_sol.py (2)
test_kernel(50-84)main(186-233)flashinfer_bench/agents/load_inline.py (1)
main(39-41)
flashinfer_bench/agents/test_existing_prompt_sol.py (2)
flashinfer_bench/agents/test_prompt.py (2)
test_kernel(98-132)main(219-257)flashinfer_bench/agents/load_inline.py (1)
main(39-41)
examples/ffi/e2e_kernel.py (4)
flashinfer_bench/bench/benchmark.py (2)
Benchmark(16-181)run_all(61-181)flashinfer_bench/bench/config.py (1)
BenchmarkConfig(8-64)flashinfer_bench/data/trace_set.py (2)
TraceSet(23-477)from_path(85-145)flashinfer_bench/apply/apply_api.py (2)
enable_apply(142-180)disable_apply(183-192)
🪛 GitHub Actions: .github/workflows/linting.yaml
examples/ffi/agent.md
[error] 1-1: pre-commit end-of-file-fixer failed. Files were modified by this hook.
[error] 1-1: pre-commit trailing-whitespace failed. Files were modified by this hook.
[error] 1-1: pre-commit black formatting check failed. File was reformatted by this hook.
examples/ffi/e2e_kernel.py
[error] 1-1: pre-commit end-of-file-fixer failed. Files were modified by this hook.
[error] 1-1: pre-commit trailing-whitespace failed. Files were modified by this hook.
[error] 1-1: pre-commit black formatting check failed. File was reformatted by this hook.
[error] 1-1: pre-commit isort formatting check failed. File was reformatted by this hook.
🪛 LanguageTool
flashinfer_bench/agents/test_results/o4-mini-2025-04-16.txt
[style] ~83-~83: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4073 characters long)
Context: ...dev = a.device(); if (dev.device_type != kDLCUDA) { TVM_FFI_THROW(RuntimeError) << "elementwise_add: only CUDA device supported"; } if (b.device().device_type != dev.device_type || b.device().device_id != dev.device_id) { TVM_FFI_THROW(ValueError) << "elementwise_add: b must be on the same CUDA device as a"; } if (c.device().device_type != dev.device_type || c.device().device_id != dev.device_id) { TVM_FFI_THROW(Val...
(EN_EXCESSIVE_EXCLAMATION)
flashinfer_bench/agents/test_results/gpt-5-2025-08-07.txt
[style] ~71-~71: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~97-~97: Using many exclamation marks might seem excessive (in this case: 13 exclamation marks for a text that’s 3750 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...
(EN_EXCESSIVE_EXCLAMATION)
flashinfer_bench/agents/test_results/o3.txt
[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~123-~123: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4501 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...
(EN_EXCESSIVE_EXCLAMATION)
flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt
[style] ~82-~82: Using many exclamation marks might seem excessive (in this case: 12 exclamation marks for a text that’s 4202 characters long)
Context: ...; DLDevice dc_dev = c.device(); if (!(da_dev.device_type == db_dev.device_typ...
(EN_EXCESSIVE_EXCLAMATION)
flashinfer_bench/agents/test_results/claude-opus-4-1-20250805.txt
[style] ~110-~110: Using many exclamation marks might seem excessive (in this case: 23 exclamation marks for a text that’s 4049 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...
(EN_EXCESSIVE_EXCLAMATION)
🪛 Ruff (0.14.4)
flashinfer_bench/agents/test_prompt.py
43-43: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Unused function argument: prompt
(ARG001)
130-130: Do not catch blind exception: Exception
(BLE001)
151-151: Abstract raise to an inner function
(TRY301)
151-151: Avoid specifying long messages outside the exception class
(TRY003)
183-183: f-string without any placeholders
Remove extraneous f prefix
(F541)
191-197: Consider moving this statement to an else block
(TRY300)
199-199: Do not catch blind exception: Exception
(BLE001)
204-204: f-string without any placeholders
Remove extraneous f prefix
(F541)
205-205: Use explicit conversion flag
Replace with conversion flag
(RUF010)
214-214: Do not catch blind exception: Exception
(BLE001)
flashinfer_bench/agents/load_inline.py
40-40: Local variable mod is assigned to but never used
Remove assignment to unused variable mod
(F841)
flashinfer_bench/agents/test_existing_prompt_sol.py
45-45: Avoid specifying long messages outside the exception class
(TRY003)
82-82: Do not catch blind exception: Exception
(BLE001)
151-158: Consider moving this statement to an else block
(TRY300)
160-160: Do not catch blind exception: Exception
(BLE001)
176-176: Do not catch blind exception: Exception
(BLE001)
examples/ffi/e2e_kernel.py
17-17: f-string without any placeholders
Remove extraneous f prefix
(F541)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: Run unit tests on ubuntu-latest and Python 3.12
- GitHub Check: Run unit tests on ubuntu-latest and Python 3.10
- GitHub Check: Run unit tests on ubuntu-latest and Python 3.13
- GitHub Check: Run unit tests on ubuntu-latest and Python 3.11
🔇 Additional comments (7)
.gitignore (1)
50-51: LGTM!Adding
.envto.gitignoreis a security best practice to prevent accidental commits of environment variables and secrets. IncludingAGENTS.mdaligns with the PR's agent-focused changes. Both entries are appropriately formatted and necessary.flashinfer_bench/agents/load_inline.py (1)
8-36: LGTM!The inline CUDA source is well-structured with proper TVM FFI headers, correct stream management via
TVMFFIEnvGetStream, and appropriate FFI export binding.examples/ffi/Example-FlashInfer-Trace/definitions/gemm_n4096_k4096.json (1)
1-48: LGTM!The GEMM definition is well-structured with correct shapes, data types, and a clear reference implementation. The operation
C = A @ B.Tis properly specified.examples/ffi/Example-FlashInfer-Trace/workloads/gemm_n4096_k4096.jsonl (1)
1-43: LGTM!The workload file provides comprehensive test coverage with M values ranging from 1 to 8192, including edge cases and realistic sizes. All entries properly reference the
gemm_n4096_k4096definition.flashinfer_bench/agents/test_results/gpt-5-2025-08-07.txt (1)
1-111: LGTM!The generated elementwise addition kernel demonstrates proper TVM FFI usage with comprehensive input validation, correct error handling, and appropriate CUDA stream management. The test results confirm successful compilation and execution.
flashinfer_bench/agents/test_results/o4-mini-2025-04-16.txt (1)
1-122: LGTM!The generated code demonstrates correct TVM FFI patterns with thorough validation using
TVM_FFI_ICHECKandTVM_FFI_THROW, proper stream retrieval, and successful test execution.flashinfer_bench/agents/ffi_prompt.py (1)
1-1868: LGTM!The FFI prompt definitions provide comprehensive TVM FFI documentation for agent-based kernel generation. The prompts include:
- Essential API reference with TensorView, error handling, and stream management
- Complete examples demonstrating proper FFI usage patterns
- Multiple levels of detail (SIMPLE, FULL, and standard PROMPT)
These prompts will effectively guide LLM agents to generate correct TVM FFI-compatible CUDA kernels.
examples/ffi/agent.md
Outdated
| # Agent Instructions: CUDA GEMM Implementation with TVM FFI | ||
|
|
||
| ## Task Overview | ||
|
|
||
| Write a complete CUDA implementation solving the GEMM definition `gemm_n4096_k4096` and output it as a JSON file to: | ||
|
|
||
| **Output Path**: `Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json` | ||
|
|
||
| The implementation must use TVM FFI bindings and conform to the Solution JSON schema. | ||
|
|
||
| ## Target Operation | ||
|
|
||
| **Operation**: General Matrix Multiply (GEMM) | ||
| **Formula**: `C = A @ B.T` | ||
| **Shapes**: | ||
| - A: `[M, K]` where M is variable, K = 4096 | ||
| - B: `[N, K]` where N = 4096, K = 4096 | ||
| - C: `[M, N]` (output) | ||
| - **Data type**: `float16` (FP16) | ||
|
|
||
| **Note**: This is computing `A @ B.T` (transpose of B), not `A @ B`. | ||
|
|
||
| ## Solution Structure Requirements | ||
|
|
||
| Your solution **must** include exactly 3 source files with these names: | ||
|
|
||
| 1. **`kernel.h`**: Header file with function declarations and shared definitions | ||
| 2. **`kernel.cu`**: CUDA kernel device code implementation | ||
| 3. **`main.cpp`**: TVM FFI host code with bindings | ||
|
|
||
| ## TVM FFI Requirements | ||
|
|
||
| ### Required Headers in main.cpp | ||
| ```cpp | ||
| #include <tvm/ffi/container/tensor.h> // TensorView: tensor arguments | ||
| #include <tvm/ffi/function.h> // TVM_FFI_DLL_EXPORT_TYPED_FUNC | ||
| #include <tvm/ffi/error.h> // TVM_FFI_ICHECK, TVM_FFI_THROW | ||
| #include <tvm/ffi/extra/c_env_api.h> // TVMFFIEnvGetStream | ||
| #include <cuda_fp16.h> | ||
| #include "kernel.h" | ||
| ``` | ||
|
|
||
| ### Function Signature | ||
| The exported function **must** be named `run` and match the definition's input/output names: | ||
|
|
||
| ```cpp | ||
| void run(tvm::ffi::TensorView A, tvm::ffi::TensorView B, tvm::ffi::TensorView C); | ||
| ``` | ||
|
|
||
| **Important**: The function takes A, B, and C as parameters. C is pre-allocated by the caller. | ||
|
|
||
| ### TVM FFI Binding | ||
| Use TVM_FFI_DLL_EXPORT_TYPED_FUNC to expose the function: | ||
|
|
||
| ```cpp | ||
| TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, run); | ||
| ``` | ||
|
|
||
| ### Input Validation | ||
| Validate inputs using TVM FFI error handling: | ||
|
|
||
| ```cpp | ||
| // Check dimensions | ||
| TVM_FFI_ICHECK_EQ(A.ndim(), 2) << "A must be 2D"; | ||
| TVM_FFI_ICHECK_EQ(B.ndim(), 2) << "B must be 2D"; | ||
| TVM_FFI_ICHECK_EQ(C.ndim(), 2) << "C must be 2D"; | ||
|
|
||
| // Check shapes | ||
| TVM_FFI_ICHECK_EQ(A.size(1), 4096) << "A.shape[1] must be 4096 (K)"; | ||
| TVM_FFI_ICHECK_EQ(B.size(0), 4096) << "B.shape[0] must be 4096 (N)"; | ||
| TVM_FFI_ICHECK_EQ(B.size(1), 4096) << "B.shape[1] must be 4096 (K)"; | ||
|
|
||
| // Check shape compatibility | ||
| TVM_FFI_ICHECK_EQ(A.size(1), B.size(1)) << "K dimension mismatch"; | ||
| TVM_FFI_ICHECK_EQ(C.size(0), A.size(0)) << "M dimension mismatch"; | ||
| TVM_FFI_ICHECK_EQ(C.size(1), B.size(0)) << "N dimension mismatch"; | ||
|
|
||
| // Check data types (float16) | ||
| TVM_FFI_ICHECK_EQ(A.dtype().code, kDLFloat) << "A must be float type"; | ||
| TVM_FFI_ICHECK_EQ(A.dtype().bits, 16) << "A must be float16"; | ||
| TVM_FFI_ICHECK_EQ(B.dtype().code, kDLFloat) << "B must be float type"; | ||
| TVM_FFI_ICHECK_EQ(B.dtype().bits, 16) << "B must be float16"; | ||
| TVM_FFI_ICHECK_EQ(C.dtype().code, kDLFloat) << "C must be float type"; | ||
| TVM_FFI_ICHECK_EQ(C.dtype().bits, 16) << "C must be float16"; | ||
|
|
||
| // Check device (must be CUDA) | ||
| TVM_FFI_ICHECK_EQ(A.device().device_type, kDLCUDA) << "A must be on CUDA"; | ||
| TVM_FFI_ICHECK_EQ(B.device().device_type, kDLCUDA) << "B must be on CUDA"; | ||
| TVM_FFI_ICHECK_EQ(C.device().device_type, kDLCUDA) << "C must be on CUDA"; | ||
| ``` | ||
|
|
||
| ### CUDA Stream Management | ||
| Get the CUDA stream from TVM FFI environment: | ||
|
|
||
| ```cpp | ||
| DLDevice dev = A.device(); | ||
| cudaStream_t stream = static_cast<cudaStream_t>( | ||
| TVMFFIEnvGetStream(dev.device_type, dev.device_id)); | ||
|
|
||
| // Launch kernel on the stream | ||
| kernel_launch<<<grid, block, 0, stream>>>(args...); | ||
| ``` | ||
|
|
||
| ### Memory Access | ||
| Access tensor data through TensorView API: | ||
|
|
||
| ```cpp | ||
| const __half* A_data = static_cast<const __half*>(A.data_ptr()); | ||
| const __half* B_data = static_cast<const __half*>(B.data_ptr()); | ||
| __half* C_data = static_cast<__half*>(C.data_ptr()); | ||
|
|
||
| int64_t M = A.size(0); | ||
| int64_t K = A.size(1); | ||
| int64_t N = B.size(0); | ||
| ``` | ||
|
|
||
| ## CUDA Kernel Implementation Guidelines | ||
|
|
||
| ### Recommended Approach | ||
| Implement a tiled GEMM kernel optimized for float16: | ||
|
|
||
| 1. **Use shared memory** for tile caching | ||
| 2. **Leverage Tensor Cores** if targeting modern GPUs (use `__half` or `half2`) | ||
| 3. **Thread block tiling**: Typical tile sizes like 128×128 or 256×128 | ||
| 4. **Handle transposition**: Since we compute `A @ B.T`, adjust memory access patterns | ||
|
|
||
| ### Kernel Signature Example | ||
| ```cpp | ||
| __global__ void gemm_kernel_device( | ||
| const half* __restrict__ A, | ||
| const half* __restrict__ B, | ||
| half* __restrict__ C, | ||
| int M, int N, int K | ||
| ); | ||
| ``` | ||
|
|
||
| ### Performance Considerations | ||
| - Use `__half` or `half2` types for FP16 operations | ||
| - Ensure coalesced memory access | ||
| - Minimize bank conflicts in shared memory | ||
| - Consider using warp-level primitives for reductions | ||
|
|
||
| ## File Organization | ||
|
|
||
| ### Required File Structure | ||
|
|
||
| **File 1: `kernel.h`** | ||
| - CUDA kernel function declarations | ||
| - Host launcher function declarations | ||
| - Shared constants and type definitions | ||
| - Include guards | ||
|
|
||
| **File 2: `kernel.cu`** | ||
| - `__global__` kernel implementations | ||
| - `__device__` helper functions | ||
| - Host-side kernel launcher function | ||
| - CUDA-specific optimizations (shared memory, tensor cores, etc.) | ||
|
|
||
| **File 3: `main.cpp`** | ||
| - TVM FFI bindings | ||
| - `run` function that matches definition signature: `void run(TensorView A, TensorView B, TensorView C)` | ||
| - Input validation using `TVM_FFI_ICHECK_*` macros | ||
| - Stream management via `TVMFFIEnvGetStream()` | ||
| - `TVM_FFI_DLL_EXPORT_TYPED_FUNC` for function export | ||
|
|
||
| ## JSON Schema Format | ||
|
|
||
| The output JSON must conform to the Solution schema and be written to: | ||
|
|
||
| **`Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json`** | ||
|
|
||
| ### JSON Structure | ||
|
|
||
| ```json | ||
| { | ||
| "name": "agent_vibecode_gemm", | ||
| "definition": "gemm_n4096_k4096", | ||
| "description": "High-performance CUDA GEMM implementation for C = A @ B.T using TVM FFI bindings", | ||
| "author": "vibecode-agent", | ||
| "spec": { | ||
| "language": "cuda", | ||
| "target_hardware": [ | ||
| "NVIDIA_H100", | ||
| "NVIDIA_A100" | ||
| ], | ||
| "dependencies": [], | ||
| "entry_point": "main.cpp::run" | ||
| }, | ||
| "sources": [ | ||
| { | ||
| "path": "kernel.h", | ||
| "content": "... complete header file content as string ..." | ||
| }, | ||
| { | ||
| "path": "kernel.cu", | ||
| "content": "... complete CUDA kernel code as string ..." | ||
| }, | ||
| { | ||
| "path": "main.cpp", | ||
| "content": "... complete TVM FFI binding code as string ..." | ||
| } | ||
| ] | ||
| } | ||
| ``` | ||
|
|
||
| ### Critical Schema Fields | ||
|
|
||
| | Field | Value | Notes | | ||
| |-------|-------|-------| | ||
| | `name` | `"agent_vibecode_gemm"` | Unique identifier for this solution | | ||
| | `definition` | `"gemm_n4096_k4096"` | **Must** match the definition name exactly | | ||
| | `language` | `"cuda"` | Lowercase, primary language | | ||
| | `target_hardware` | Array of strings | e.g., `["NVIDIA_H100", "NVIDIA_A100"]` | | ||
| | `entry_point` | `"main.cpp::run"` | Format: `{filename}::{function_name}` | | ||
| | `sources` | Array of 3 file objects | Each with `path` and `content` fields | | ||
|
|
||
| ### Entry Point Convention | ||
|
|
||
| The entry point specifies which function the benchmarker will call: | ||
| - Format: `"main.cpp::run"` | ||
| - The function `run` must be exposed via `TVM_FFI_DLL_EXPORT_TYPED_FUNC` | ||
| - The benchmarker will: | ||
| 1. Compile all source files into a TVM FFI shared library | ||
| 2. Load the compiled module using TVM FFI | ||
| 3. Call the `run` function with test inputs `A`, `B`, and pre-allocated `C` | ||
| 4. Validate the output C against the reference | ||
|
|
||
| ## Complete Implementation Example | ||
|
|
||
| Below is a skeleton showing the structure of all three files: | ||
|
|
||
| ### kernel.h | ||
| ```cpp | ||
| #ifndef GEMM_N4096_K4096_KERNEL_H | ||
| #define GEMM_N4096_K4096_KERNEL_H | ||
|
|
||
| #include <cuda_runtime.h> | ||
| #include <cuda_fp16.h> | ||
| #include <cstdint> | ||
|
|
||
| // Constants from definition | ||
| constexpr int GEMM_N_CONST = 4096; | ||
| constexpr int GEMM_K_CONST = 4096; | ||
|
|
||
| // Kernel launcher function | ||
| void gemm_n4096_k4096_launch( | ||
| const __half* A, | ||
| const __half* B, | ||
| __half* C, | ||
| int M, | ||
| cudaStream_t stream | ||
| ); | ||
|
|
||
| #endif // GEMM_N4096_K4096_KERNEL_H | ||
| ``` | ||
|
|
||
| ### kernel.cu | ||
| ```cpp | ||
| #include "kernel.h" | ||
| #include <mma.h> // For tensor cores | ||
|
|
||
| using namespace nvcuda; | ||
|
|
||
| // Kernel configuration | ||
| constexpr int BLOCK_M = 128; | ||
| constexpr int BLOCK_N = 256; | ||
| constexpr int BLOCK_K = 64; | ||
|
|
||
| __global__ void gemm_kernel( | ||
| const __half* __restrict__ A, | ||
| const __half* __restrict__ B, | ||
| __half* __restrict__ C, | ||
| int M | ||
| ) { | ||
| // Implement optimized GEMM with: | ||
| // - Shared memory tiling | ||
| // - WMMA/Tensor Core operations | ||
| // - Coalesced memory access | ||
| // - Proper synchronization | ||
|
|
||
| // C = A @ B.T | ||
| // A is [M, 4096], B is [4096, 4096], C is [M, 4096] | ||
| } | ||
|
|
||
| void gemm_n4096_k4096_launch( | ||
| const __half* A, | ||
| const __half* B, | ||
| __half* C, | ||
| int M, | ||
| cudaStream_t stream | ||
| ) { | ||
| if (M <= 0) return; | ||
|
|
||
| dim3 block(256); | ||
| dim3 grid((GEMM_N_CONST + BLOCK_N - 1) / BLOCK_N, | ||
| (M + BLOCK_M - 1) / BLOCK_M); | ||
|
|
||
| gemm_kernel<<<grid, block, 0, stream>>>(A, B, C, M); | ||
| } | ||
| ``` | ||
|
|
||
| ### main.cpp | ||
| ```cpp | ||
| #include <tvm/ffi/container/tensor.h> | ||
| #include <tvm/ffi/function.h> | ||
| #include <tvm/ffi/error.h> | ||
| #include <tvm/ffi/extra/c_env_api.h> | ||
| #include <cuda_fp16.h> | ||
| #include "kernel.h" | ||
|
|
||
| void run(tvm::ffi::TensorView A, tvm::ffi::TensorView B, tvm::ffi::TensorView C) { | ||
| // Input validation - dimensions | ||
| TVM_FFI_ICHECK_EQ(A.ndim(), 2) << "A must be 2D"; | ||
| TVM_FFI_ICHECK_EQ(B.ndim(), 2) << "B must be 2D"; | ||
| TVM_FFI_ICHECK_EQ(C.ndim(), 2) << "C must be 2D"; | ||
|
|
||
| // Get dimensions | ||
| int64_t M = A.size(0); | ||
| int64_t K = A.size(1); | ||
| int64_t N = B.size(0); | ||
|
|
||
| // Check shapes | ||
| TVM_FFI_ICHECK_EQ(K, 4096) << "A.shape[1] must be 4096 (K)"; | ||
| TVM_FFI_ICHECK_EQ(N, 4096) << "B.shape[0] must be 4096 (N)"; | ||
| TVM_FFI_ICHECK_EQ(B.size(1), 4096) << "B.shape[1] must be 4096 (K)"; | ||
| TVM_FFI_ICHECK_EQ(C.size(0), M) << "C.shape[0] must match A.shape[0] (M)"; | ||
| TVM_FFI_ICHECK_EQ(C.size(1), N) << "C.shape[1] must be 4096 (N)"; | ||
|
|
||
| // Check data types (float16) | ||
| TVM_FFI_ICHECK_EQ(A.dtype().code, kDLFloat) << "A must be float type"; | ||
| TVM_FFI_ICHECK_EQ(A.dtype().bits, 16) << "A must be float16"; | ||
| TVM_FFI_ICHECK_EQ(B.dtype().code, kDLFloat) << "B must be float type"; | ||
| TVM_FFI_ICHECK_EQ(B.dtype().bits, 16) << "B must be float16"; | ||
| TVM_FFI_ICHECK_EQ(C.dtype().code, kDLFloat) << "C must be float type"; | ||
| TVM_FFI_ICHECK_EQ(C.dtype().bits, 16) << "C must be float16"; | ||
|
|
||
| // Check device (must be CUDA) | ||
| TVM_FFI_ICHECK_EQ(A.device().device_type, kDLCUDA) << "A must be on CUDA"; | ||
| TVM_FFI_ICHECK_EQ(B.device().device_type, kDLCUDA) << "B must be on CUDA"; | ||
| TVM_FFI_ICHECK_EQ(C.device().device_type, kDLCUDA) << "C must be on CUDA"; | ||
|
|
||
| // Get data pointers | ||
| const __half* A_data = static_cast<const __half*>(A.data_ptr()); | ||
| const __half* B_data = static_cast<const __half*>(B.data_ptr()); | ||
| __half* C_data = static_cast<__half*>(C.data_ptr()); | ||
|
|
||
| // Get CUDA stream from TVM FFI environment | ||
| DLDevice dev = A.device(); | ||
| cudaStream_t stream = static_cast<cudaStream_t>( | ||
| TVMFFIEnvGetStream(dev.device_type, dev.device_id)); | ||
|
|
||
| // Launch kernel | ||
| gemm_n4096_k4096_launch(A_data, B_data, C_data, static_cast<int>(M), stream); | ||
| } | ||
|
|
||
| // Export the function with TVM FFI | ||
| TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, run); | ||
| ``` | ||
|
|
||
| ## Performance Optimization Guidelines | ||
|
|
||
| Your CUDA kernel should include: | ||
|
|
||
| 1. **Tensor Core Usage (WMMA)**: Use `nvcuda::wmma` for 16x16x16 matrix operations | ||
| 2. **Shared Memory Tiling**: Cache tiles of A and B in shared memory | ||
| 3. **Memory Coalescing**: Ensure threads access consecutive memory addresses | ||
| 4. **Bank Conflict Avoidance**: Add padding to shared memory arrays | ||
| 5. **Compute Intensity**: Maximize compute-to-memory-access ratio | ||
| 6. **Register Optimization**: Minimize register usage for higher occupancy | ||
| 7. **Stream Pipelining**: Overlap compute and memory operations | ||
|
|
||
| ## Output Format | ||
|
|
||
| Write the complete JSON solution to: | ||
| **`Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json`** | ||
|
|
||
| The JSON must be valid and contain: | ||
| - All required schema fields | ||
| - Complete source code for all 3 files in the `content` fields | ||
| - Properly escaped strings (use JSON encoding) | ||
|
|
||
| ## Validation Checklist | ||
|
|
||
| Before finalizing, verify: | ||
| - [ ] File names are exactly: `kernel.h`, `kernel.cu`, `main.cpp` | ||
| - [ ] Entry point is `"main.cpp::run"` | ||
| - [ ] Function signature: `void run(tvm::ffi::TensorView A, tvm::ffi::TensorView B, tvm::ffi::TensorView C)` | ||
| - [ ] TVM_FFI_DLL_EXPORT_TYPED_FUNC exposes the `run` function | ||
| - [ ] All three files included in `sources` array | ||
| - [ ] Input validation with `TVM_FFI_ICHECK_*` macros | ||
| - [ ] Kernel implements `C = A @ B.T` (transpose of B) | ||
| - [ ] Data type is `__half` (float16) | ||
| - [ ] CUDA stream from `TVMFFIEnvGetStream()` | ||
| - [ ] Checks that all tensors are on CUDA device | ||
| - [ ] JSON is valid and properly formatted | ||
| - [ ] All TVM FFI headers included correctly | ||
|
|
||
| ## Expected Agent Behavior | ||
|
|
||
| 1. **Read** the GEMM definition from `definitions/gemm_n4096_k4096.json` | ||
| 2. **Understand** the operation: `C = A @ B.T` with shapes [M,K] × [N,K] → [M,N] | ||
| 3. **Implement** a high-performance CUDA kernel with tiling and tensor cores | ||
| 4. **Create** TVM FFI bindings following the API guidelines | ||
| 5. **Package** all source code into the Solution JSON format | ||
| 6. **Write** the JSON to `Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json` | ||
|
|
||
| The JSON file should be ready to be consumed by the flashinfer-bench benchmarking system. | ||
|
|
||
| ## Summary | ||
|
|
||
| This agent.md provides complete instructions for generating a CUDA GEMM kernel implementation using TVM FFI bindings. The key points are: | ||
|
|
||
| - **3 files required**: `kernel.h`, `kernel.cu`, `main.cpp` | ||
| - **Entry point**: `main.cpp::run` with signature `void run(TensorView A, TensorView B, TensorView C)` | ||
| - **TVM FFI export**: Use `TVM_FFI_DLL_EXPORT_TYPED_FUNC(run, run)` | ||
| - **Validation**: Use `TVM_FFI_ICHECK_*` macros for input validation | ||
| - **Stream management**: Get stream via `TVMFFIEnvGetStream()` | ||
| - **Output**: Write complete JSON to `Example-FlashInfer-Trace/solutions/agent_vibecode_gemm.json` | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix formatting issues flagged by pre-commit hooks.
The linting pipeline has detected formatting issues in this file:
- Missing newline at end of file
- Trailing whitespace
- Code blocks need reformatting
These should be resolved to pass CI checks.
Run the following commands to fix:
#!/bin/bash
# Fix formatting issues automatically
pre-commit run --files examples/ffi/agent.md🧰 Tools
🪛 GitHub Actions: .github/workflows/linting.yaml
[error] 1-1: pre-commit end-of-file-fixer failed. Files were modified by this hook.
[error] 1-1: pre-commit trailing-whitespace failed. Files were modified by this hook.
[error] 1-1: pre-commit black formatting check failed. File was reformatted by this hook.
🤖 Prompt for AI Agents
In examples/ffi/agent.md around lines 1 to 419, the file fails pre-commit
formatting: add a final newline at EOF, remove all trailing whitespace, and
reformat the fenced code blocks (ensure language tags and indentation are
correct); run the repository's pre-commit hooks or the provided command
(pre-commit run --files examples/ffi/agent.md) or your formatter (e.g.,
markdownlint/clang-format/prettier as configured) to automatically apply these
fixes and re-check CI.
examples/ffi/e2e_kernel.py
Outdated
| for def_name, traces in result_traceset.traces.items(): | ||
| print(f"\n{def_name}:") | ||
| for trace in traces: | ||
| print(f" Solution: {trace.solution.name}") | ||
| print(f" Status: {trace.evaluation.status.value}") | ||
| if trace.evaluation.performance: | ||
| print(f" Speedup: {trace.evaluation.performance.speedup_factor:.2f}x") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fix attribute access on solution.
Based on the Trace class definition, trace.solution is a string (the solution name), not an object with a .name attribute. Line 29 will raise an AttributeError at runtime.
Apply this diff:
print("\nBenchmark Complete")
for def_name, traces in result_traceset.traces.items():
print(f"\n{def_name}:")
for trace in traces:
- print(f" Solution: {trace.solution.name}")
+ print(f" Solution: {trace.solution}")
print(f" Status: {trace.evaluation.status.value}")
if trace.evaluation.performance:
print(f" Speedup: {trace.evaluation.performance.speedup_factor:.2f}x")📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| for def_name, traces in result_traceset.traces.items(): | |
| print(f"\n{def_name}:") | |
| for trace in traces: | |
| print(f" Solution: {trace.solution.name}") | |
| print(f" Status: {trace.evaluation.status.value}") | |
| if trace.evaluation.performance: | |
| print(f" Speedup: {trace.evaluation.performance.speedup_factor:.2f}x") | |
| for def_name, traces in result_traceset.traces.items(): | |
| print(f"\n{def_name}:") | |
| for trace in traces: | |
| print(f" Solution: {trace.solution}") | |
| print(f" Status: {trace.evaluation.status.value}") | |
| if trace.evaluation.performance: | |
| print(f" Speedup: {trace.evaluation.performance.speedup_factor:.2f}x") |
🤖 Prompt for AI Agents
In examples/ffi/e2e_kernel.py around lines 26 to 33, trace.solution is a string
(the solution name) not an object with a .name attribute, so accessing
trace.solution.name raises AttributeError; replace trace.solution.name with
trace.solution (e.g., print(f" Solution: {trace.solution}")) and keep the rest
of the printing logic unchanged.
| import re | ||
| from pathlib import Path |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove unused imports.
The imports re and Path are not used anywhere in this file.
Apply this diff:
-import re
-from pathlib import Path
-
import torch📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| import re | |
| from pathlib import Path | |
| import torch |
🤖 Prompt for AI Agents
In flashinfer_bench/agents/load_inline.py lines 1-2 the imports "import re" and
"from pathlib import Path" are unused; remove these two import lines so the file
only imports what it actually uses (delete both lines or remove the unused
names), then run a quick lint/flake8 to confirm no unused-import warnings
remain.
| // shape / size validation | ||
| int64_t n = a.size(0); | ||
| if (b.size(0) != n) { | ||
| TVM_FFI_THROW(ValueError) << "elementwise_add: size mismatch between 'a' and 'b' (" | ||
| << n << " vs " << b.size(0) << ")"; | ||
| } | ||
| if (c.size(0) != n) { | ||
| TVM_FFI_THROW(ValueError) << "elementwise_add: output 'c' size must match inputs (expected " | ||
| << n << ", got " << c.size(0) << ")"; | ||
| } | ||
|
|
||
| // Ensure tensors are contiguous for simple indexing | ||
| if (!a.IsContiguous()) { | ||
| TVM_FFI_THROW(ValueError) << "elementwise_add: input 'a' must be contiguous"; | ||
| } | ||
| if (!b.IsContiguous()) { | ||
| TVM_FFI_THROW(ValueError) << "elementwise_add: input 'b' must be contiguous"; | ||
| } | ||
| if (!c.IsContiguous()) { | ||
| TVM_FFI_THROW(ValueError) << "elementwise_add: output 'c' must be contiguous"; | ||
| } | ||
|
|
||
| // Ensure all tensors are on the same device | ||
| DLDevice da_dev = a.device(); | ||
| DLDevice db_dev = b.device(); | ||
| DLDevice dc_dev = c.device(); | ||
| if (!(da_dev.device_type == db_dev.device_type && da_dev.device_id == db_dev.device_id && | ||
| da_dev.device_type == dc_dev.device_type && da_dev.device_id == dc_dev.device_id)) { | ||
| TVM_FFI_THROW(ValueError) << "elementwise_add: tensors must be on the same device"; | ||
| } | ||
|
|
||
| // Pointers to data | ||
| float* a_ptr = static_cast<float*>(a.data_ptr()); | ||
| float* b_ptr = static_cast<float*>(b.data_ptr()); | ||
| float* c_ptr = static_cast<float*>(c.data_ptr()); | ||
|
|
||
| // Dispatch by device type | ||
| if (da_dev.device_type == kDLCUDA) { | ||
| // GPU path: launch CUDA kernel on environment-managed stream | ||
| TVMFFIStreamHandle raw_stream = TVMFFIEnvGetStream(da_dev.device_type, da_dev.device_id); | ||
| cudaStream_t stream = static_cast<cudaStream_t>(raw_stream); | ||
|
|
||
| const int threads = 256; | ||
| const int64_t blocks = (n + threads - 1) / threads; | ||
| ElementwiseAddKernel<<<static_cast<int>(blocks), threads, 0, stream>>>(a_ptr, b_ptr, c_ptr, n); | ||
|
|
||
| // Check kernel launch error | ||
| cudaError_t err = cudaGetLastError(); | ||
| if (err != cudaSuccess) { | ||
| TVM_FFI_THROW(RuntimeError) << "elementwise_add: CUDA kernel launch failed: " | ||
| << cudaGetErrorString(err); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Handle n == 0 explicitly to avoid invalid CUDA launches
In this variant there’s no early-return for n == 0, so the CUDA path computes:
const int threads = 256;
const int64_t blocks = (n + threads - 1) / threads; // blocks == 0 when n == 0
ElementwiseAddKernel<<<static_cast<int>(blocks), threads, 0, stream>>>(...);Launching a kernel with 0 blocks is invalid and will surface as a CUDA error instead of treating zero-length tensors as a no-op (the CPU path is fine because the loop simply doesn’t run).
You can mirror the behavior of the o3 implementation and avoid spurious errors with:
int64_t n = a.size(0);
// ... existing shape checks ...
if (n == 0) {
return;
}placed before the device-type dispatch.
---
🏁 Script executed:
```shell
#!/bin/bash
# Verify file existence and locate elementwise_add implementations
echo "=== Checking for the test results file ==="
find . -name "gpt-5-mini-2025-08-07.txt" -o -name "*elementwise_add*" | head -20
echo ""
echo "=== Searching for elementwise_add implementations ==="
rg -l "ElementwiseAddKernel" --type cpp --type cu
echo ""
echo "=== Looking for 'o3' variant references ==="
rg -n "o3.*elementwise\|elementwise.*o3" --type cpp --type cu
Length of output: 265
🌐 Web query:
CUDA kernel launch with zero blocks grid size invalid behavior
💡 Result:
Short answer: launching a kernel with any grid or block dimension equal to zero is invalid — the CUDA runtime returns an "invalid configuration" / CUDA_ERROR_INVALID_VALUE (or similar) and you should guard against it and skip the launch. See vendor docs for launch return codes and community discussion showing zero-dimension launches produce "invalid configuration" errors. [1][2][3]
Details and recommendations
- cuLaunchKernel and the runtime list invalid-value / invalid-configuration among possible return/error codes for improper launch parameters. Check launch return values (cudaGetLastError / return status) to detect this. [2]
- Community reports and Q&A show kernel<<<0, ...>>> or kernel<<<...,0>>> leads to "invalid configuration argument" and interferes with proper error checking; the recommended pattern is to test the dimensions in host code and skip the kernel call when gridDim or blockDim is zero. [1][3]
Minimal guard example (host-side):
- if (blocks > 0 && threads > 0) kernel<<<blocks, threads>>>(...);
References
- StackOverflow discussion about running zero blocks in CUDA (reports "invalid configuration argument"). [1]
- NVIDIA cuLaunchKernel documentation (lists CUDA_ERROR_INVALID_VALUE / invalid launch among returns). [2]
- CUDA C++ Programming Guide (general kernel launch rules / errors). [3]
---
🏁 Script executed:
```shell
#!/bin/bash
# Examine the test results file to understand its nature
echo "=== Content of test results file (first 20 lines) ==="
head -20 ./git/flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt
echo ""
echo "=== Search for elementwise_add in all files (broader search) ==="
rg "elementwise_add|ElementwiseAdd" --type-list | head -5
rg "elementwise_add|ElementwiseAdd" -g "!test_results" | head -30
echo ""
echo "=== Check agents directory structure ==="
find ./git/flashinfer_bench/agents -type f -name "*.txt" -o -name "*.cc" -o -name "*.h" | head -20
Length of output: 1460
🏁 Script executed:
#!/bin/bash
# Find repository root and structure
echo "=== Repository root and top-level structure ==="
pwd
ls -la | head -15
echo ""
echo "=== Find any .txt files with 'gpt' in name ==="
find . -name "*gpt*.txt" 2>/dev/null
echo ""
echo "=== List flashinfer_bench/agents directory structure ==="
find flashinfer_bench/agents -type d 2>/dev/null | head -20
echo ""
echo "=== Check if test_results directory exists under agents ==="
ls -la flashinfer_bench/agents/test_results 2>/dev/null || echo "test_results directory not found"
echo ""
echo "=== Full path of any file matching the review file name ==="
find . -name "*2025-08-07*" 2>/dev/nullLength of output: 1928
🏁 Script executed:
#!/bin/bash
# Examine the specific lines in the gpt-5-mini file
echo "=== Lines 56-107 of gpt-5-mini-2025-08-07.txt ==="
sed -n '56,107p' ./flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt
echo ""
echo "=== Full content of o3.txt (for comparison) ==="
cat ./flashinfer_bench/agents/test_results/o3.txt
echo ""
echo "=== Check if o3.txt has n==0 early return ==="
grep -n "n == 0\|n==0" ./flashinfer_bench/agents/test_results/o3.txt || echo "No n==0 check found in o3.txt"Length of output: 7122
Add early-return guard for zero-length tensors to prevent invalid CUDA launches
The gpt-5-mini implementation lacks an n == 0 check before kernel dispatch. When n=0, the code computes blocks = (0 + 255) / 256 = 0 and attempts ElementwiseAddKernel<<<0, ...>>>, which is invalid and raises a CUDA error instead of treating empty tensors as a no-op.
The o3 implementation correctly includes this guard at line 106:
if (n == 0) return; // nothing to doAdd this check immediately after shape validation and before the device-type dispatch to align with the o3 pattern:
int64_t n = a.size(0);
// ... existing shape and contiguity checks ...
if (n == 0) {
return;
}
// Dispatch by device type
if (da_dev.device_type == kDLCUDA) {
// ...
}🧰 Tools
🪛 LanguageTool
[style] ~82-~82: Using many exclamation marks might seem excessive (in this case: 12 exclamation marks for a text that’s 4202 characters long)
Context: ...; DLDevice dc_dev = c.device(); if (!(da_dev.device_type == db_dev.device_typ...
(EN_EXCESSIVE_EXCLAMATION)
🤖 Prompt for AI Agents
In flashinfer_bench/agents/test_results/gpt-5-mini-2025-08-07.txt around lines
56 to 107, the implementation dispatches a CUDA kernel even when n == 0 which
yields a zero-block launch error; add an early-return guard immediately after
the shape/contiguity/device checks (and before the device-type dispatch) that
returns when n == 0 so empty tensors become a no-op and no kernel is launched.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
♻️ Duplicate comments (7)
tests/agent/load_inline.py (2)
13-13: Previous review comment about kernel parameter type still applies.The kernel parameter
nisint, which can truncate when the tensor size exceedsINT_MAX. As noted in a previous review, it should beint64_tto match the type ofx.size(0).
24-28: Previous review comment about launch configuration types still applies.The variables
blocksandthreadsare declared asint64_tbut are passed directly to the kernel launch configuration, which expectsdim3orunsigned int. As noted in a previous review, this can lead to truncation and incorrect behavior for large inputs.examples/ffi/e2e_kernel.py (2)
20-20: Previous review comment about unnecessary f-string still applies.The f-string has no placeholders, making the
fprefix unnecessary. This was flagged in a previous review.
32-32: Previous review comment about attribute access still applies.As noted in a previous review,
trace.solutionis a string (the solution name), not an object with a.nameattribute. This line will raise anAttributeErrorat runtime.examples/ffi/agent.md (2)
250-250: Previous review comment about parameter type still applies.As noted in a previous review, using
intfor the variable dimensionMcan lead to truncation ifMexceedsINT_MAX. Usingint64_twould be safer for this instructional example.
1-418: Previous review comment about formatting issues still applies.As noted in a previous review, the file has formatting issues flagged by pre-commit hooks (missing newline at EOF, trailing whitespace). These should be resolved to pass CI checks.
tests/agent/test_results/gpt-5-mini-2025-08-07.txt (1)
56-107: Previous review comment about zero-length tensor guard still applies.As noted in a previous review, this implementation lacks an
n == 0check before kernel dispatch, which can lead to invalid CUDA launches whenn=0. The recommendation to add an early-return guard before the device-type dispatch remains valid.
🧹 Nitpick comments (11)
tests/agent/test_existing_prompt_sol.py (4)
9-36: Regex-based extraction looks correct but is tightly coupled to the file formatThe two-stage regex (
code_patternwithTest Results:sentinel, thenalt_pattern) matches the current layout fromtest_prompt.pyand gives a decent fallback when the results section is missing. Just be aware this is tightly coupled to the exact header strings ("Generated Code:", 80'=',"Test Results:"); any future format drift will break parsing. If you expect multiple formats long‑term, consider centralizing these constants or using a small parser instead of hard‑coding the patterns in multiple places.
39-73: Consolidatetest_kerneland consider narrowing the exception
test_kernelhere is effectively the same as intests/agent/test_prompt.py, which means any future change to the test semantics must be done in two places. Consider moving this helper to a shared module (e.g.,tests/agent/utils.py) and importing it in both scripts.Also,
except Exceptionwill swallow unexpected issues (e.g., configuration bugs) and only surface a boolean. If you care about distinguishing test failures from infrastructure errors, you might narrow the exception type or re‑raise after logging in the outer orchestration.
76-97: Result-shape assumptions inupdate_test_results_in_file
update_test_results_in_fileassumesresultsalways has a'compilation'key and either'error'or both'test_small'and'test_large'. That matches current call sites, but it’s easy to break accidentally. If you expect to extend the result schema (e.g., add more tests), consider:
- validating the presence of required keys up front, or
- making this function accept a structured object (dataclass/dict with defaults) rather than ad‑hoc keys.
99-172: Nested try/except blocks and broad exception handling
test_saved_codehas an outertry/except Exceptionand an innertry/except Exceptionaround compilation. This works, but it makes control flow harder to follow and can hide unexpected failures under generic"error"statuses.Two possible simplifications:
- Factor compilation + test execution into a helper that returns a structured result, and let
test_saved_codejust orchestrate and write results.- Narrow the inner
except(e.g., to compilation-related errors) and let truly unexpected exceptions propagate to the outer handler or even to the top level when running as a script.This would make debugging failures easier without changing current behavior.
tests/agent/test_prompt.py (7)
19-28: Prompt definition is clear but could be centralized with FFI prompt usage
ELEMENTWISE_ADD_PROMPTclearly states the kernel contract. Since you also haveFFI_PROMPT_SIMPLEproviding FFI‑side requirements, consider documenting (in a comment near here) that these two prompts must stay in sync with the expectations inexamples/ffi/agent.mdand the FFI templates. That will help future contributors avoid drifting the human‑readable description from the FFI constraints.
31-43: Model configuration is fine; consider validating API keys explicitly
get_model_configcleanly maps known model names to providers. One possible improvement is to check that the relevant API key is non‑empty and raise a clearer error (or skip the model) rather than letting the downstream client fail with a less obvious message when keys are missing.
46-68: Unify prompt handling for OpenAI modelsRight now:
- For
"o3"/"o4-mini-2025-04-16",call_openai_modeluses the passedpromptas the sole user message.- For other models, it ignores the
promptparameter and reconstructs messages fromELEMENTWISE_ADD_PROMPTandFFI_PROMPT_SIMPLE.This works but makes
test_model’sfull_promptslightly misleading and couples prompting strategy to the caller and callee in different ways.A cleaner pattern would be one of:
- Have
test_modelconstruct a singleuser_promptand letcall_openai_modelalways use that (with a per‑modelsystemmessage if needed), or- Let
call_openai_modelfully own the prompt structure and drop thepromptargument from the signature.Either way, keeping a single source of truth for prompt composition will make future prompt tweaks less error‑prone.
70-82:promptparameter is unused incall_anthropic_model
call_anthropic_modelignores itspromptargument and always sendsFFI_PROMPT_SIMPLEas the user content. Giventest_modelcurrently passesfull_prompt, this is surprising.Two options:
- Use the
promptparameter in the Anthropic call (e.g.,messages=[{"role": "user", "content": prompt}]), or- Remove the parameter and adjust the caller to make it clear Anthropic uses a fixed prompt layout.
Either change would make the interface less confusing and align with how you intend to manage prompts across providers.
85-95: CUDA code extraction heuristic may skip valid kernels or over-include textThe heuristic prefers fenced code blocks containing
TVM_FFIorTensorView, otherwise it falls back to returning the entire response. That’s reasonable for current prompts, but:
- If a valid kernel omits these tokens (e.g., uses different macros or pure CUDA), you’ll end up sending the whole response—including prose—to the compiler.
- If multiple code blocks exist, you always pick the first that matches those tokens, which may not be the actual kernel.
Consider an additional fallback: if no matching block is found, use the first fenced block, and only as a last resort return the full response. That keeps you resilient to prompt or model changes while still prioritizing FFI‑aware code.
98-132: Duplicatetest_kernelhelper across filesAs in
test_existing_prompt_sol.py,test_kernelis duplicated here. Since this is core to validating LLM‑generated kernels, having a single shared implementation (e.g., imported from a smalltests/agent/kernel_tests.pymodule) would reduce the risk of the two scripts drifting apart over time.
135-217: test_model orchestration is solid; consider simplifying error handling and prompt wiringThe high-level flow in
test_model—get config → call model → extract code → write file → compile → run tests—is clear and the returned result dict is easy to consume.Two optional improvements:
- Error handling: there are three layers of
try/except Exception. You might keep the inner compilation try/except (to always write results), but consider narrowing exception types or letting truly unexpected exceptions bubble out of the outermost block so they fail the script loudly during development.- Prompt wiring: given the comments above on
call_openai_model/call_anthropic_model, you could move prompt composition entirely into those helpers and dropfull_prompthere, reducing duplication and making it clearer where to update prompts in the future.Neither change is required for correctness, but they would make the script easier to extend.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (10)
examples/ffi/agent.md(1 hunks)examples/ffi/e2e_kernel.py(1 hunks)tests/agent/load_inline.py(1 hunks)tests/agent/test_existing_prompt_sol.py(1 hunks)tests/agent/test_prompt.py(1 hunks)tests/agent/test_results/claude-opus-4-1-20250805.txt(1 hunks)tests/agent/test_results/gpt-5-2025-08-07.txt(1 hunks)tests/agent/test_results/gpt-5-mini-2025-08-07.txt(1 hunks)tests/agent/test_results/o3.txt(1 hunks)tests/agent/test_results/o4-mini-2025-04-16.txt(1 hunks)
🧰 Additional context used
🧬 Code graph analysis (3)
tests/agent/test_existing_prompt_sol.py (2)
tests/agent/test_prompt.py (1)
main(219-257)tests/agent/load_inline.py (1)
main(36-44)
tests/agent/test_prompt.py (2)
tests/agent/test_existing_prompt_sol.py (2)
test_kernel(39-73)main(175-222)tests/agent/load_inline.py (1)
main(36-44)
examples/ffi/e2e_kernel.py (4)
flashinfer_bench/bench/benchmark.py (2)
Benchmark(16-181)run_all(61-181)flashinfer_bench/bench/config.py (1)
BenchmarkConfig(8-64)flashinfer_bench/data/trace_set.py (2)
TraceSet(23-477)from_path(85-145)flashinfer_bench/apply/apply_api.py (2)
disable_apply(183-192)enable_apply(142-180)
🪛 LanguageTool
tests/agent/test_results/gpt-5-2025-08-07.txt
[style] ~71-~71: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~97-~97: Using many exclamation marks might seem excessive (in this case: 13 exclamation marks for a text that’s 3750 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...
(EN_EXCESSIVE_EXCLAMATION)
tests/agent/test_results/gpt-5-mini-2025-08-07.txt
[style] ~82-~82: Using many exclamation marks might seem excessive (in this case: 12 exclamation marks for a text that’s 4202 characters long)
Context: ...; DLDevice dc_dev = c.device(); if (!(da_dev.device_type == db_dev.device_typ...
(EN_EXCESSIVE_EXCLAMATION)
tests/agent/test_results/o3.txt
[style] ~101-~101: Three successive sentences begin with the same word. Consider rewording the sentence or use a thesaurus to find a synonym.
Context: ...Contiguous() || !b.IsContiguous() || !c.IsContiguous()) { TVM_FFI_THROW(ValueError) << "...
(ENGLISH_WORD_REPEAT_BEGINNING_RULE)
[style] ~123-~123: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4501 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...
(EN_EXCESSIVE_EXCLAMATION)
tests/agent/test_results/o4-mini-2025-04-16.txt
[style] ~83-~83: Using many exclamation marks might seem excessive (in this case: 19 exclamation marks for a text that’s 4073 characters long)
Context: ...dev = a.device(); if (dev.device_type != kDLCUDA) { TVM_FFI_THROW(RuntimeError) << "elementwise_add: only CUDA device supported"; } if (b.device().device_type != dev.device_type || b.device().device_id != dev.device_id) { TVM_FFI_THROW(ValueError) << "elementwise_add: b must be on the same CUDA device as a"; } if (c.device().device_type != dev.device_type || c.device().device_id != dev.device_id) { TVM_FFI_THROW(Val...
(EN_EXCESSIVE_EXCLAMATION)
tests/agent/test_results/claude-opus-4-1-20250805.txt
[style] ~110-~110: Using many exclamation marks might seem excessive (in this case: 23 exclamation marks for a text that’s 4049 characters long)
Context: ...r_t err = cudaGetLastError(); if (err != cudaSuccess) { TVM_FFI_THROW(Runti...
(EN_EXCESSIVE_EXCLAMATION)
🪛 Ruff (0.14.4)
tests/agent/test_existing_prompt_sol.py
34-34: Avoid specifying long messages outside the exception class
(TRY003)
71-71: Do not catch blind exception: Exception
(BLE001)
140-147: Consider moving this statement to an else block
(TRY300)
149-149: Do not catch blind exception: Exception
(BLE001)
165-165: Do not catch blind exception: Exception
(BLE001)
tests/agent/test_prompt.py
43-43: Avoid specifying long messages outside the exception class
(TRY003)
70-70: Unused function argument: prompt
(ARG001)
130-130: Do not catch blind exception: Exception
(BLE001)
151-151: Abstract raise to an inner function
(TRY301)
151-151: Avoid specifying long messages outside the exception class
(TRY003)
191-197: Consider moving this statement to an else block
(TRY300)
199-199: Do not catch blind exception: Exception
(BLE001)
205-205: Use explicit conversion flag
Replace with conversion flag
(RUF010)
214-214: Do not catch blind exception: Exception
(BLE001)
examples/ffi/e2e_kernel.py
20-20: f-string without any placeholders
Remove extraneous f prefix
(F541)
🔇 Additional comments (5)
tests/agent/test_results/o3.txt (1)
1-141: Excellent reference implementation.This generated code demonstrates comprehensive best practices:
- Thorough input validation (dimensionality, dtype, shape, device, contiguity)
- Zero-length tensor guard (line 106) to avoid invalid CUDA launches
- Proper CUDA stream management via TVM FFI
- Debug-mode kernel launch error checking
- Clear structure and documentation
This serves as a strong reference for evaluating other model outputs.
tests/agent/test_results/claude-opus-4-1-20250805.txt (1)
1-124: Solid implementation with proper validation.This implementation correctly handles:
- Input validation for dimensions, dtype, shapes, device placement, and contiguity
- Empty tensor edge case (lines 92-94)
- CUDA stream retrieval from TVM environment
- Kernel launch error detection
The code follows TVM FFI patterns appropriately and includes comprehensive error reporting.
tests/agent/test_results/gpt-5-2025-08-07.txt (1)
1-111: Well-optimized implementation.This implementation includes several optimizations and best practices:
- Use of
__restrict__pointers (lines 16-18) for potential compiler optimizations- Proper type casting to
unsigned intfor block count (line 89)- Zero-length tensor guard (line 81)
- Comprehensive input validation
The code correctly implements TVM FFI patterns and includes appropriate error checking.
tests/agent/test_existing_prompt_sol.py (1)
175-222: Main re-test loop is straightforward and readableThe
main()function’s discovery oftest_results/*.txtand the final summary output are clear and match the expectations set by the file format. The truncation of long error messages is also a nice touch for CLI readability.tests/agent/test_prompt.py (1)
219-257: Main driver is clear and matches the companion re-test scriptThe main loop over
models, creation oftest_resultsdirectory, and the final summary output mirror the workflow intest_existing_prompt_sol.main(). This symmetry makes it straightforward to understand how new runs produce.txtartifacts that the re-test script later consumes.
| // 6) Get CUDA stream from TVM runtime | ||
| cudaStream_t stream = | ||
| static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type, | ||
| dev.device_id)); | ||
|
|
||
| // 7) Launch kernel | ||
| constexpr int THREADS = 256; | ||
| int blocks = static_cast<int>((n + THREADS - 1) / THREADS); | ||
| ElementwiseAddKernel<<<blocks, THREADS, 0, stream>>>( | ||
| a_ptr, b_ptr, c_ptr, n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add zero-length tensor guard to prevent invalid CUDA launches.
When n == 0, the code computes blocks = (0 + 255) / 256 = 0 and attempts to launch a kernel with 0 blocks, which is invalid and will raise a CUDA error. The reference implementation in o3.txt (line 106) correctly includes an early-return guard for this case.
Add this check before the CUDA stream retrieval:
// 6) Get CUDA stream from TVM runtime
+ if (n == 0) {
+ return; // nothing to do
+ }
+
cudaStream_t stream =
static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type,
dev.device_id));📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| // 6) Get CUDA stream from TVM runtime | |
| cudaStream_t stream = | |
| static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type, | |
| dev.device_id)); | |
| // 7) Launch kernel | |
| constexpr int THREADS = 256; | |
| int blocks = static_cast<int>((n + THREADS - 1) / THREADS); | |
| ElementwiseAddKernel<<<blocks, THREADS, 0, stream>>>( | |
| a_ptr, b_ptr, c_ptr, n); | |
| // 6) Get CUDA stream from TVM runtime | |
| if (n == 0) { | |
| return; // nothing to do | |
| } | |
| cudaStream_t stream = | |
| static_cast<cudaStream_t>(TVMFFIEnvGetStream(dev.device_type, | |
| dev.device_id)); | |
| // 7) Launch kernel | |
| constexpr int THREADS = 256; | |
| int blocks = static_cast<int>((n + THREADS - 1) / THREADS); | |
| ElementwiseAddKernel<<<blocks, THREADS, 0, stream>>>( | |
| a_ptr, b_ptr, c_ptr, n); |
🤖 Prompt for AI Agents
In tests/agent/test_results/o4-mini-2025-04-16.txt around lines 103 to 112, add
a guard to return early when n == 0 before computing blocks or retrieving the
CUDA stream; this prevents computing blocks == 0 and launching a kernel with
zero blocks. Modify the function so it checks if n == 0 and returns (or no-ops)
immediately, then proceed to get the stream and launch the kernel only when n >
0. Ensure the check is placed before any CUDA-related calls to avoid invalid
CUDA launches or errors.
This PR introduces FlashInfer-Bench agents module, a library agent tools to aid with kernel generation. First of which is a set of TVM-FFI instruction prompts. There are two versions:
FFI_PROMPT_SIMPLE: provides instructions on minimum modules required to for agents to generate correct TVM FFI bindings and host-side kernel launch function, including TensorView objects, env-stream management, assertion checks via
TVM_FFI_ICHECK, and FFI binding macro.
FFI_PROMPT_FULL: provides all kernel side C++ methods inside TVM FFI for more advanced use-cases.
Model testing results can be found here: https://gist.github.com/zanderjiang/0c2255bfdac6f4090f0f41e8c80368c3
All models were able to write correct FFI bindings with the prompts.
In addition, the PR contains a vibe-coding example, an agent markdown prompt that allows coding agents (Cursor, ClaudeCode, Codex, etc.) to generate a complete solution with FFI binding, as well as a minimal script to benchmark the generated kernel and apply it on a test input.
Summary by CodeRabbit
New Features
Documentation
Tests
Chores