Skip to content

Tiny fix bench tgv gemm#2277

Merged
yzh119 merged 3 commits intoflashinfer-ai:mainfrom
bzhng-development:vz/erfactor-bpmm
Jan 5, 2026
Merged

Tiny fix bench tgv gemm#2277
yzh119 merged 3 commits intoflashinfer-ai:mainfrom
bzhng-development:vz/erfactor-bpmm

Conversation

@vincentzed
Copy link
Copy Markdown
Contributor

@vincentzed vincentzed commented Dec 31, 2025

📌 Description

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

flashinfer ❯ FLASHINFER_DISABLE_VERSION_CHECK=1 python benchmarks/bench_tgv_gemm.py
Starting BF16 TGV GEMM SM100 Tests
==================================================

=== Testing correctness ===
Cosine similarity: 1.000000
Max difference: 1.000000
Mean difference: 0.036133
✓ Correctness test PASSED

=== Testing tgv_gemm_bf16_sm100 with different sizes ===

--- deepseekv3, o_proj, tp=8: M=1, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.006327 ms, 4.640 TFLOPS
2025-12-31 20:13:54,908 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:13:54,938 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005773 ms, 5.086 TFLOPS, speedup: 1.10x

Testing with PDL...
PDL average time: 0.004935 ms, 5.949 TFLOPS, speedup: 1.28x

--- deepseekv3, o_proj, tp=8: M=4, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.005939 ms, 19.773 TFLOPS
2025-12-31 20:13:56,853 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:13:56,882 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005521 ms, 21.273 TFLOPS, speedup: 1.08x

Testing with PDL...
PDL average time: 0.004974 ms, 23.609 TFLOPS, speedup: 1.19x

--- deepseekv3, o_proj, tp=8: M=8, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.005265 ms, 44.609 TFLOPS
2025-12-31 20:13:58,645 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:13:58,661 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005214 ms, 45.047 TFLOPS, speedup: 1.01x

Testing with PDL...
PDL average time: 0.004740 ms, 49.548 TFLOPS, speedup: 1.11x

--- deepseekv3, o_proj, tp=8: M=16, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.005853 ms, 80.256 TFLOPS
2025-12-31 20:14:00,615 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:00,631 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005446 ms, 86.265 TFLOPS, speedup: 1.07x

Testing with PDL...
PDL average time: 0.004678 ms, 100.424 TFLOPS, speedup: 1.25x

--- deepseekv3, o_proj, tp=8: M=32, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.006024 ms, 155.956 TFLOPS
2025-12-31 20:14:02,470 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:02,488 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005762 ms, 163.056 TFLOPS, speedup: 1.05x

Testing with PDL...
PDL average time: 0.004952 ms, 189.740 TFLOPS, speedup: 1.22x

--- deepseekv3, o_proj, tp=8: M=64, N=7168, K=2048, has_bias=False ---
CUBLAS average time: 0.006218 ms, 302.190 TFLOPS
2025-12-31 20:14:04,326 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:04,345 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006346 ms, 296.077 TFLOPS, speedup: 0.98x

Testing with PDL...
PDL average time: 0.005899 ms, 318.527 TFLOPS, speedup: 1.05x

--- deepseekv3, q_b_proj, tp=8: M=1, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.006919 ms, 1.364 TFLOPS
2025-12-31 20:14:06,119 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:06,134 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004602 ms, 2.051 TFLOPS, speedup: 1.50x

Testing with PDL...
PDL average time: 0.004133 ms, 2.283 TFLOPS, speedup: 1.67x

--- deepseekv3, q_b_proj, tp=8: M=4, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.005842 ms, 6.461 TFLOPS
2025-12-31 20:14:08,004 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:08,032 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004786 ms, 7.887 TFLOPS, speedup: 1.22x

Testing with PDL...
PDL average time: 0.003971 ms, 9.506 TFLOPS, speedup: 1.47x

--- deepseekv3, q_b_proj, tp=8: M=8, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.005846 ms, 12.915 TFLOPS
2025-12-31 20:14:09,741 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:09,757 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004584 ms, 16.471 TFLOPS, speedup: 1.28x

Testing with PDL...
PDL average time: 0.004153 ms, 18.178 TFLOPS, speedup: 1.41x

--- deepseekv3, q_b_proj, tp=8: M=16, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.004388 ms, 34.412 TFLOPS
2025-12-31 20:14:11,529 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:11,545 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004785 ms, 31.557 TFLOPS, speedup: 0.92x

Testing with PDL...
PDL average time: 0.003980 ms, 37.934 TFLOPS, speedup: 1.10x

--- deepseekv3, q_b_proj, tp=8: M=32, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.004577 ms, 65.983 TFLOPS
2025-12-31 20:14:13,403 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:13,419 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004798 ms, 62.940 TFLOPS, speedup: 0.95x

Testing with PDL...
PDL average time: 0.004133 ms, 73.075 TFLOPS, speedup: 1.11x

--- deepseekv3, q_b_proj, tp=8: M=64, N=3072, K=1536, has_bias=False ---
CUBLAS average time: 0.004911 ms, 122.992 TFLOPS
2025-12-31 20:14:15,211 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:15,228 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004883 ms, 123.690 TFLOPS, speedup: 1.01x

Testing with PDL...
PDL average time: 0.004382 ms, 137.827 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, qkv_proj, tp=4: M=1, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.007843 ms, 0.940 TFLOPS
2025-12-31 20:14:17,051 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:17,067 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006012 ms, 1.226 TFLOPS, speedup: 1.30x

Testing with PDL...
PDL average time: 0.005520 ms, 1.336 TFLOPS, speedup: 1.42x

--- gpt-oss-120b, qkv_proj, tp=4: M=4, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.007246 ms, 4.070 TFLOPS
2025-12-31 20:14:18,836 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:18,865 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006200 ms, 4.757 TFLOPS, speedup: 1.17x

Testing with PDL...
PDL average time: 0.005407 ms, 5.455 TFLOPS, speedup: 1.34x

--- gpt-oss-120b, qkv_proj, tp=4: M=8, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.007241 ms, 8.145 TFLOPS
2025-12-31 20:14:20,748 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:20,764 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006087 ms, 9.690 TFLOPS, speedup: 1.19x

Testing with PDL...
PDL average time: 0.005399 ms, 10.925 TFLOPS, speedup: 1.34x

--- gpt-oss-120b, qkv_proj, tp=4: M=16, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006055 ms, 19.483 TFLOPS
2025-12-31 20:14:22,591 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:22,607 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.005997 ms, 19.670 TFLOPS, speedup: 1.01x

Testing with PDL...
PDL average time: 0.005415 ms, 21.786 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, qkv_proj, tp=4: M=32, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006202 ms, 38.039 TFLOPS
2025-12-31 20:14:24,459 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:24,476 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006008 ms, 39.268 TFLOPS, speedup: 1.03x

Testing with PDL...
PDL average time: 0.005514 ms, 42.785 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, qkv_proj, tp=4: M=64, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006204 ms, 76.059 TFLOPS
2025-12-31 20:14:26,203 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:26,221 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006421 ms, 73.485 TFLOPS, speedup: 0.97x

Testing with PDL...
PDL average time: 0.005615 ms, 84.033 TFLOPS, speedup: 1.10x

--- gpt-oss-120b, qkv_proj, tp=4: M=128, N=1280, K=2880, has_bias=True ---
CUBLAS average time: 0.006359 ms, 148.408 TFLOPS
2025-12-31 20:14:28,057 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:28,075 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.006428 ms, 146.816 TFLOPS, speedup: 0.99x

Testing with PDL...
PDL average time: 0.005672 ms, 166.388 TFLOPS, speedup: 1.12x

--- gpt-oss-120b, o_proj, tp=4: M=1, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.005422 ms, 1.088 TFLOPS
2025-12-31 20:14:29,911 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:29,925 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004073 ms, 1.448 TFLOPS, speedup: 1.33x

Testing with PDL...
PDL average time: 0.003560 ms, 1.657 TFLOPS, speedup: 1.52x

--- gpt-oss-120b, o_proj, tp=4: M=4, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004809 ms, 4.906 TFLOPS
2025-12-31 20:14:31,788 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:31,816 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.003976 ms, 5.934 TFLOPS, speedup: 1.21x

Testing with PDL...
PDL average time: 0.003555 ms, 6.636 TFLOPS, speedup: 1.35x

--- gpt-oss-120b, o_proj, tp=4: M=8, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004791 ms, 9.848 TFLOPS
2025-12-31 20:14:33,632 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:33,648 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004183 ms, 11.280 TFLOPS, speedup: 1.15x

Testing with PDL...
PDL average time: 0.003485 ms, 13.541 TFLOPS, speedup: 1.37x

--- gpt-oss-120b, o_proj, tp=4: M=16, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.003918 ms, 24.084 TFLOPS
2025-12-31 20:14:35,393 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:35,409 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004205 ms, 22.442 TFLOPS, speedup: 0.93x

Testing with PDL...
PDL average time: 0.003552 ms, 26.572 TFLOPS, speedup: 1.10x

--- gpt-oss-120b, o_proj, tp=4: M=32, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004081 ms, 46.244 TFLOPS
2025-12-31 20:14:37,317 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:37,333 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004162 ms, 45.344 TFLOPS, speedup: 0.98x

Testing with PDL...
PDL average time: 0.003576 ms, 52.782 TFLOPS, speedup: 1.14x

--- gpt-oss-120b, o_proj, tp=4: M=64, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004178 ms, 90.358 TFLOPS
2025-12-31 20:14:39,160 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:39,178 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004539 ms, 83.172 TFLOPS, speedup: 0.92x

Testing with PDL...
PDL average time: 0.003768 ms, 100.177 TFLOPS, speedup: 1.11x

--- gpt-oss-120b, o_proj, tp=4: M=128, N=2880, K=1024, has_bias=True ---
CUBLAS average time: 0.004576 ms, 164.985 TFLOPS
2025-12-31 20:14:40,987 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2025-12-31 20:14:41,006 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV average time: 0.004792 ms, 157.552 TFLOPS, speedup: 0.95x

Testing with PDL...
PDL average time: 0.004384 ms, 172.222 TFLOPS, speedup: 1.04x

=== Writing results to bf16_tgv_gemm_benchmark_results.csv ===
Benchmark results saved to bf16_tgv_gemm_benchmark_results.csv
Total test cases: 26

==================================================
All BF16 TGV GEMM SM100 tests completed successfully!
  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Chores

    • Standardized GPU timing across benchmark variants using a unified timing utility.
    • Switched reported metrics from average to median and updated speedup reporting.
    • Updated benchmark logs and CSV output to include new timing fields for all paths.
  • Refactor

    • Improved benchmark profiling logic for associating kernel launches with iterations, making profiling more efficient and robust.

✏️ Tip: You can customize this high-level summary in your review settings.

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Dec 31, 2025

Note

Other AI code review bot(s) detected

CodeRabbit has detected other AI code review bot(s) in this pull request and will avoid duplicating their findings in the review comments. This may lead to a less comprehensive review.

📝 Walkthrough

Walkthrough

Replaces manual per-iteration CUDA-graph timing in the GEMM benchmark with centralized bench_gpu_time calls (median GPU timings) and refactors iteration-to-launch mapping in the GPU profiling utility to use binary search and precomputed mappings for performance.

Changes

Cohort / File(s) Summary
Benchmark timing update
benchmarks/bench_tgv_gemm.py
Replaces explicit CUDA-graph timing and manual time measurement with bench_gpu_time (imported from flashinfer.testing.utils); uses median GPU times for CUBLAS, TGV, and PDL; updates logging and speedup calculations to use medians; writes cublas_time_ms, tgv_time_ms, pdl_time_ms to CSV; adds numpy import and removes direct time usage.
Profiling utility optimization
flashinfer/testing/utils.py
Replaces linear scan for associating launches to iterations with a bisect-based approach: sorts launches, builds launch_starts and corr_id_to_kernels mappings, and uses binary search plus lookup to assemble per-iteration kernels (O(log M + R) per iteration). Preserves kernel-name checks and error behavior; adds a local type annotation.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers

  • cyx-6
  • nvmbreughe
  • Anerudhan

Poem

🐇 I hopped through logs and kernels deep,
Swapped ticking clocks for medians I keep.
Bisect and map made my paths concise,
Faster lookups, fewer hops to splice.
Hop, bench, bloom — performance feels sweet! 🥕

Pre-merge checks and finishing touches

❌ Failed checks (2 warnings, 1 inconclusive)
Check name Status Explanation Resolution
Description check ⚠️ Warning The description is the template with only a subset of checklist items marked; the Description section lacks any actual explanation of changes, and Related Issues is empty. Fill in the Description section with details about replacing CUDA graph timing with bench_gpu_time, optimizing kernel association logic, and updating benchmark reporting from average to median times. Link any related issues.
Docstring Coverage ⚠️ Warning Docstring coverage is 66.67% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title "Tiny fix bench tgv gemm" is vague and generic. It uses non-descriptive terms like "Tiny fix" that don't convey meaningful information about the actual changes (refactoring benchmark timing methodology). Replace with a more specific title describing the main change, such as "Refactor TGV GEMM benchmark to use bench_gpu_time utility" or "Replace manual CUDA graph timing with bench_gpu_time in TGV GEMM benchmark".
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @vincentzed, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request refactors the TGV GEMM benchmarking script by integrating a new, centralized utility function for GPU time measurement. The change aims to standardize the benchmarking process, improve code readability, and ensure consistent performance evaluation across CUBLAS, TGV, and PDL GEMM implementations.

Highlights

  • Benchmarking Utility Integration: Replaced manual CUDA graph benchmarking logic with a dedicated utility function, bench_gpu_time_with_cudagraph, to standardize and streamline performance measurements for GEMM operations.
  • Code Simplification and Maintainability: Removed redundant manual CUDA graph setup, warmup, and timing code, resulting in a cleaner, more concise, and easier-to-maintain benchmark script.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the bench_tgv_gemm.py benchmark script to use the bench_gpu_time_with_cudagraph utility function, which simplifies the code and improves maintainability by removing boilerplate CUDA graph benchmarking logic.

While the refactoring is a good improvement, I've noticed a change in the benchmarking methodology. The number of iterations captured within the CUDA graph has been implicitly changed from 100 to the default of 10. This can affect the benchmark results by changing how kernel launch overhead is amortized. I've added comments with suggestions to restore the original number of iterations to ensure benchmark consistency.

Comment on lines +83 to +88
cublas_times = bench_gpu_time_with_cudagraph(
lambda: F.linear(A, B.T, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous implementation captured 100 iterations within the CUDA graph to amortize launch overhead. The bench_gpu_time_with_cudagraph function defaults to num_iters_within_graph=10. To maintain consistency with the previous benchmarking methodology and ensure better amortization of kernel launch overhead, it's recommended to explicitly set num_iters_within_graph=100.

Suggested change
cublas_times = bench_gpu_time_with_cudagraph(
lambda: F.linear(A, B.T, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
cublas_times = bench_gpu_time_with_cudagraph(
lambda: F.linear(A, B.T, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
num_iters_within_graph=100,
)

Comment on lines +101 to +106
tgv_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous implementation captured 100 iterations within the CUDA graph to amortize launch overhead. The bench_gpu_time_with_cudagraph function defaults to num_iters_within_graph=10. To maintain consistency with the previous benchmarking methodology and ensure better amortization of kernel launch overhead, it's recommended to explicitly set num_iters_within_graph=100.

Suggested change
tgv_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
tgv_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
num_iters_within_graph=100,
)

Comment on lines +114 to +119
pdl_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The previous implementation captured 100 iterations within the CUDA graph to amortize launch overhead. The bench_gpu_time_with_cudagraph function defaults to num_iters_within_graph=10. To maintain consistency with the previous benchmarking methodology and ensure better amortization of kernel launch overhead, it's recommended to explicitly set num_iters_within_graph=100.

Suggested change
pdl_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
)
pdl_times = bench_gpu_time_with_cudagraph(
lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
dry_run_time_ms=100,
repeat_time_ms=500,
cold_l2_cache=False,
num_iters_within_graph=100,
)

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (3)
benchmarks/bench_tgv_gemm.py (3)

83-89: Consider using input_args to avoid capturing loop variables in lambda.

The lambda captures A, B, and bias from the loop scope. While this works because bench_gpu_time_with_cudagraph executes immediately, using input_args would be more explicit and eliminate the static analysis warning.

🔎 Suggested refactor using input_args

Per the bench_gpu_time_with_cudagraph docstring, you can pass arguments explicitly:

-cublas_times = bench_gpu_time_with_cudagraph(
-    lambda: F.linear(A, B.T, bias),
-    dry_run_time_ms=100,
-    repeat_time_ms=500,
-    cold_l2_cache=False,
-)
+cublas_times = bench_gpu_time_with_cudagraph(
+    F.linear,
+    dry_run_time_ms=100,
+    repeat_time_ms=500,
+    cold_l2_cache=False,
+    input_args=(A, B.T, bias),
+)

101-107: Consider using input_args to avoid capturing loop variables in lambda.

Same pattern as the CUBLAS benchmark: the lambda captures loop-scoped variables. Using input_args would eliminate the static analysis warning.

🔎 Suggested refactor using input_args
-tgv_times = bench_gpu_time_with_cudagraph(
-    lambda: tgv_gemm_sm100(A, B, bias),
-    dry_run_time_ms=100,
-    repeat_time_ms=500,
-    cold_l2_cache=False,
-)
+tgv_times = bench_gpu_time_with_cudagraph(
+    tgv_gemm_sm100,
+    dry_run_time_ms=100,
+    repeat_time_ms=500,
+    cold_l2_cache=False,
+    input_args=(A, B, bias),
+)

114-120: Consider using input_args and input_kwargs to avoid capturing loop variables in lambda.

Same lambda closure pattern, but with a keyword argument. Using input_args and input_kwargs would eliminate the static analysis warning.

🔎 Suggested refactor using input_args and input_kwargs
-pdl_times = bench_gpu_time_with_cudagraph(
-    lambda: tgv_gemm_sm100(A, B, bias, pdl=True),
-    dry_run_time_ms=100,
-    repeat_time_ms=500,
-    cold_l2_cache=False,
-)
+pdl_times = bench_gpu_time_with_cudagraph(
+    tgv_gemm_sm100,
+    dry_run_time_ms=100,
+    repeat_time_ms=500,
+    cold_l2_cache=False,
+    input_args=(A, B, bias),
+    input_kwargs={"pdl": True},
+)
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 747b0cb and 46f87f8.

📒 Files selected for processing (1)
  • benchmarks/bench_tgv_gemm.py
🧰 Additional context used
🧬 Code graph analysis (1)
benchmarks/bench_tgv_gemm.py (1)
flashinfer/testing/utils.py (1)
  • bench_gpu_time_with_cudagraph (1259-1481)
🪛 Ruff (0.14.10)
benchmarks/bench_tgv_gemm.py

84-84: Function definition does not bind loop variable A

(B023)


84-84: Function definition does not bind loop variable B

(B023)


84-84: Function definition does not bind loop variable bias

(B023)


102-102: Function definition does not bind loop variable A

(B023)


102-102: Function definition does not bind loop variable B

(B023)


102-102: Function definition does not bind loop variable bias

(B023)


115-115: Function definition does not bind loop variable A

(B023)


115-115: Function definition does not bind loop variable B

(B023)


115-115: Function definition does not bind loop variable bias

(B023)

🔇 Additional comments (1)
benchmarks/bench_tgv_gemm.py (1)

10-10: LGTM!

The import of bench_gpu_time_with_cudagraph enables cleaner timing logic by replacing manual CUDA graph capture and replay.

Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we switch to bench_gpu_time_with_cupti? I suppose the motivation is to get kernel duration close to nsys measured results in end-to-end serving.

cc @bkryu @vadiklyutiy

@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Jan 2, 2026

Shall we switch to bench_gpu_time_with_cupti? I suppose the motivation is to get kernel duration close to nsys measured results in end-to-end serving.

cc @bkryu @vadiklyutiy

Yes bench_gpu_time_with_cupti or bench_gpu_time with enable_cupti=True and cold_l2_cache=True should be the recommended way. Please see docs and example usage

Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
Signed-off-by: vincentzed <207368749+vincentzed@users.noreply.github.com>
@vincentzed
Copy link
Copy Markdown
Contributor Author

vincentzed commented Jan 5, 2026

TGV + cupti ~ cold l2

flashinfer ❯ FLASHINFER_DISABLE_VERSION_CHECK=1 python benchmarks/bench_tgv_gemm.py
Starting BF16 TGV GEMM SM100 Tests
==================================================

=== Testing correctness ===
Cosine similarity: 1.007812
Max difference: 1.000000
Mean difference: 0.035889
✓ Correctness test PASSED

=== Testing tgv_gemm_bf16_sm100 with different sizes ===

--- deepseekv3, o_proj, tp=8: M=1, N=7168, K=2048, has_bias=False ---
CUBLAS median time: 0.009024 ms, 3.254 TFLOPS
2026-01-05 02:16:03,383 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:16:03,415 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.009024 ms, 3.254 TFLOPS, speedup: 1.00x

Testing with PDL...
PDL median time: 0.009056 ms, 3.242 TFLOPS, speedup: 1.00x

--- deepseekv3, o_proj, tp=8: M=4, N=7168, K=2048, has_bias=False ---
CUBLAS median time: 0.009184 ms, 12.788 TFLOPS
2026-01-05 02:16:15,586 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:16:15,616 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.009024 ms, 13.014 TFLOPS, speedup: 1.02x

Testing with PDL...
PDL median time: 0.009024 ms, 13.014 TFLOPS, speedup: 1.02x

--- deepseekv3, o_proj, tp=8: M=8, N=7168, K=2048, has_bias=False ---
CUBLAS median time: 0.008928 ms, 26.308 TFLOPS
2026-01-05 02:16:27,436 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:16:27,454 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.009088 ms, 25.845 TFLOPS, speedup: 0.98x

Testing with PDL...
PDL median time: 0.009118 ms, 25.760 TFLOPS, speedup: 0.98x

--- deepseekv3, o_proj, tp=8: M=16, N=7168, K=2048, has_bias=False ---
CUBLAS median time: 0.008928 ms, 52.617 TFLOPS
2026-01-05 02:16:39,315 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:16:39,331 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.009050 ms, 51.907 TFLOPS, speedup: 0.99x

Testing with PDL...
PDL median time: 0.009051 ms, 51.902 TFLOPS, speedup: 0.99x

--- deepseekv3, o_proj, tp=8: M=32, N=7168, K=2048, has_bias=False ---
CUBLAS median time: 0.009248 ms, 101.592 TFLOPS
2026-01-05 02:16:51,386 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:16:51,405 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.009344 ms, 100.548 TFLOPS, speedup: 0.99x

Testing with PDL...
PDL median time: 0.009310 ms, 100.916 TFLOPS, speedup: 0.99x

--- deepseekv3, o_proj, tp=8: M=64, N=7168, K=2048, has_bias=False ---
CUBLAS median time: 0.009312 ms, 201.788 TFLOPS
2026-01-05 02:17:02,874 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:17:02,894 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.009665 ms, 194.418 TFLOPS, speedup: 0.96x

Testing with PDL...
PDL median time: 0.009664 ms, 194.438 TFLOPS, speedup: 0.96x

--- deepseekv3, q_b_proj, tp=8: M=1, N=3072, K=1536, has_bias=False ---
CUBLAS median time: 0.007296 ms, 1.293 TFLOPS
2026-01-05 02:17:15,155 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:17:15,170 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006144 ms, 1.536 TFLOPS, speedup: 1.19x

Testing with PDL...
PDL median time: 0.006139 ms, 1.537 TFLOPS, speedup: 1.19x

--- deepseekv3, q_b_proj, tp=8: M=4, N=3072, K=1536, has_bias=False ---
CUBLAS median time: 0.006333 ms, 5.961 TFLOPS
2026-01-05 02:17:27,390 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:17:27,419 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006016 ms, 6.275 TFLOPS, speedup: 1.05x

Testing with PDL...
PDL median time: 0.005983 ms, 6.309 TFLOPS, speedup: 1.06x

--- deepseekv3, q_b_proj, tp=8: M=8, N=3072, K=1536, has_bias=False ---
CUBLAS median time: 0.006272 ms, 12.037 TFLOPS
2026-01-05 02:17:39,742 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:17:39,758 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006304 ms, 11.976 TFLOPS, speedup: 0.99x

Testing with PDL...
PDL median time: 0.006304 ms, 11.976 TFLOPS, speedup: 0.99x

--- deepseekv3, q_b_proj, tp=8: M=16, N=3072, K=1536, has_bias=False ---
CUBLAS median time: 0.005824 ms, 25.926 TFLOPS
2026-01-05 02:17:51,659 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:17:51,677 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.005984 ms, 25.233 TFLOPS, speedup: 0.97x

Testing with PDL...
PDL median time: 0.005978 ms, 25.258 TFLOPS, speedup: 0.97x

--- deepseekv3, q_b_proj, tp=8: M=32, N=3072, K=1536, has_bias=False ---
CUBLAS median time: 0.005888 ms, 51.289 TFLOPS
2026-01-05 02:18:04,116 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:18:04,134 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006144 ms, 49.152 TFLOPS, speedup: 0.96x

Testing with PDL...
PDL median time: 0.006110 ms, 49.426 TFLOPS, speedup: 0.96x

--- deepseekv3, q_b_proj, tp=8: M=64, N=3072, K=1536, has_bias=False ---
CUBLAS median time: 0.005990 ms, 100.831 TFLOPS
2026-01-05 02:18:16,494 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:18:16,511 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006336 ms, 95.325 TFLOPS, speedup: 0.95x

Testing with PDL...
PDL median time: 0.006336 ms, 95.325 TFLOPS, speedup: 0.95x

--- gpt-oss-120b, qkv_proj, tp=4: M=1, N=1280, K=2880, has_bias=True ---
CUBLAS median time: 0.008000 ms, 0.922 TFLOPS
2026-01-05 02:18:28,769 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:18:28,785 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.007744 ms, 0.952 TFLOPS, speedup: 1.03x

Testing with PDL...
PDL median time: 0.007715 ms, 0.956 TFLOPS, speedup: 1.04x

--- gpt-oss-120b, qkv_proj, tp=4: M=4, N=1280, K=2880, has_bias=True ---
CUBLAS median time: 0.007520 ms, 3.922 TFLOPS
2026-01-05 02:18:41,012 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:18:41,042 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006848 ms, 4.307 TFLOPS, speedup: 1.10x

Testing with PDL...
PDL median time: 0.006848 ms, 4.307 TFLOPS, speedup: 1.10x

--- gpt-oss-120b, qkv_proj, tp=4: M=8, N=1280, K=2880, has_bias=True ---
CUBLAS median time: 0.007488 ms, 7.877 TFLOPS
2026-01-05 02:18:53,429 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:18:53,445 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006784 ms, 8.694 TFLOPS, speedup: 1.10x

Testing with PDL...
PDL median time: 0.006784 ms, 8.694 TFLOPS, speedup: 1.10x

--- gpt-oss-120b, qkv_proj, tp=4: M=16, N=1280, K=2880, has_bias=True ---
CUBLAS median time: 0.007008 ms, 16.833 TFLOPS
2026-01-05 02:19:05,056 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:19:05,072 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.007712 ms, 15.296 TFLOPS, speedup: 0.91x

Testing with PDL...
PDL median time: 0.007712 ms, 15.296 TFLOPS, speedup: 0.91x

--- gpt-oss-120b, qkv_proj, tp=4: M=32, N=1280, K=2880, has_bias=True ---
CUBLAS median time: 0.006976 ms, 33.820 TFLOPS
2026-01-05 02:19:17,077 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:19:17,095 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.007072 ms, 33.361 TFLOPS, speedup: 0.99x

Testing with PDL...
PDL median time: 0.007072 ms, 33.361 TFLOPS, speedup: 0.99x

--- gpt-oss-120b, qkv_proj, tp=4: M=64, N=1280, K=2880, has_bias=True ---
CUBLAS median time: 0.007167 ms, 65.838 TFLOPS
2026-01-05 02:19:29,507 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:19:29,525 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006973 ms, 67.669 TFLOPS, speedup: 1.03x

Testing with PDL...
PDL median time: 0.006975 ms, 67.650 TFLOPS, speedup: 1.03x

--- gpt-oss-120b, qkv_proj, tp=4: M=128, N=1280, K=2880, has_bias=True ---
CUBLAS median time: 0.007295 ms, 129.365 TFLOPS
2026-01-05 02:19:41,394 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:19:41,412 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.008512 ms, 110.869 TFLOPS, speedup: 0.86x

Testing with PDL...
PDL median time: 0.008544 ms, 110.454 TFLOPS, speedup: 0.85x

--- gpt-oss-120b, o_proj, tp=4: M=1, N=2880, K=1024, has_bias=True ---
CUBLAS median time: 0.005760 ms, 1.024 TFLOPS
2026-01-05 02:19:53,477 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:19:53,492 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.005440 ms, 1.084 TFLOPS, speedup: 1.06x

Testing with PDL...
PDL median time: 0.005440 ms, 1.084 TFLOPS, speedup: 1.06x

--- gpt-oss-120b, o_proj, tp=4: M=4, N=2880, K=1024, has_bias=True ---
CUBLAS median time: 0.005312 ms, 4.441 TFLOPS
2026-01-05 02:20:05,062 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:20:05,091 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.005091 ms, 4.634 TFLOPS, speedup: 1.04x

Testing with PDL...
PDL median time: 0.005120 ms, 4.608 TFLOPS, speedup: 1.04x

--- gpt-oss-120b, o_proj, tp=4: M=8, N=2880, K=1024, has_bias=True ---
CUBLAS median time: 0.005472 ms, 8.623 TFLOPS
2026-01-05 02:20:17,432 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:20:17,448 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.005120 ms, 9.216 TFLOPS, speedup: 1.07x

Testing with PDL...
PDL median time: 0.005120 ms, 9.216 TFLOPS, speedup: 1.07x

--- gpt-oss-120b, o_proj, tp=4: M=16, N=2880, K=1024, has_bias=True ---
CUBLAS median time: 0.005184 ms, 18.204 TFLOPS
2026-01-05 02:20:29,292 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:20:29,309 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.005311 ms, 17.769 TFLOPS, speedup: 0.98x

Testing with PDL...
PDL median time: 0.005313 ms, 17.762 TFLOPS, speedup: 0.98x

--- gpt-oss-120b, o_proj, tp=4: M=32, N=2880, K=1024, has_bias=True ---
CUBLAS median time: 0.005216 ms, 36.186 TFLOPS
2026-01-05 02:20:41,542 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:20:41,560 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.005216 ms, 36.186 TFLOPS, speedup: 1.00x

Testing with PDL...
PDL median time: 0.005216 ms, 36.186 TFLOPS, speedup: 1.00x

--- gpt-oss-120b, o_proj, tp=4: M=64, N=2880, K=1024, has_bias=True ---
CUBLAS median time: 0.005312 ms, 71.063 TFLOPS
2026-01-05 02:20:53,486 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:20:53,503 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.005504 ms, 68.584 TFLOPS, speedup: 0.97x

Testing with PDL...
PDL median time: 0.005504 ms, 68.584 TFLOPS, speedup: 0.97x

--- gpt-oss-120b, o_proj, tp=4: M=128, N=2880, K=1024, has_bias=True ---
CUBLAS median time: 0.005664 ms, 133.294 TFLOPS
2026-01-05 02:21:05,181 - INFO - autotuner.py:256 - flashinfer.jit: [Autotuner]: Autotuning process starts ...
2026-01-05 02:21:05,199 - INFO - autotuner.py:262 - flashinfer.jit: [Autotuner]: Autotuning process ends
TGV median time: 0.006016 ms, 125.494 TFLOPS, speedup: 0.94x

Testing with PDL...
PDL median time: 0.006016 ms, 125.494 TFLOPS, speedup: 0.94x

=== Writing results to bf16_tgv_gemm_benchmark_results.csv ===
Benchmark results saved to bf16_tgv_gemm_benchmark_results.csv
Total test cases: 26

==================================================
All BF16 TGV GEMM SM100 tests completed successfully!

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
flashinfer/testing/utils.py (1)

1232-1247: Minor: Ambiguous variable name and import location.

Per the static analysis hint (Ruff E741), the variable l in the lambda and list comprehension can be confused with 1 (one) in some fonts. Consider using a more descriptive name like launch.

Additionally, the bisect import inside the function is unconventional for a standard library module. Consider moving it to the top of the file with other imports for consistency.

🔎 Proposed fix

Move the import to the top of the file (around line 17-20):

import bisect

Then update lines 1235-1237:

-    import bisect
-
     # Step 1: Sort launches by start timestamp - O(M log M)
-    sorted_launches = sorted(launches, key=lambda l: l[0])
-    launch_starts = [l[0] for l in sorted_launches]
+    sorted_launches = sorted(launches, key=lambda launch: launch[0])
+    launch_starts = [launch[0] for launch in sorted_launches]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between fff4ca6 and 17a478d.

📒 Files selected for processing (1)
  • flashinfer/testing/utils.py
🧰 Additional context used
📓 Path-based instructions (1)
flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

  • flashinfer/testing/utils.py
🪛 Ruff (0.14.10)
flashinfer/testing/utils.py

1236-1236: Ambiguous variable name: l

(E741)


1237-1237: Ambiguous variable name: l

(E741)

🔇 Additional comments (1)
flashinfer/testing/utils.py (1)

1249-1264: LGTM! Binary search optimization is correct.

The algorithm correctly uses:

  • bisect_left to find the first launch with start >= start_cpu
  • bisect_right to find the position after the last launch with start <= end_cpu

This gives O(log M) lookup per iteration instead of O(M) linear scan, which addresses the performance concern from the commit message ("basically scan all cupti... too slow").

Copy link
Copy Markdown
Collaborator

@yzh119 yzh119 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@yzh119 yzh119 enabled auto-merge (squash) January 5, 2026 06:55
@yzh119 yzh119 merged commit ff41a8f into flashinfer-ai:main Jan 5, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants