Skip to content

Revert "fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops"#2737

Merged
bkryu merged 3 commits intomainfrom
revert-2716-fix/sm120-fp4-pdl-gdc-flags
Mar 12, 2026
Merged

Revert "fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops"#2737
bkryu merged 3 commits intomainfrom
revert-2716-fix/sm120-fp4-pdl-gdc-flags

Conversation

@aleozlx
Copy link
Copy Markdown
Collaborator

@aleozlx aleozlx commented Mar 10, 2026

Proposing to revert #2716 in order to unblock 0.6.6 release

#2716 seems to have broken AOT packages

https://github.com/flashinfer-ai/flashinfer/actions/runs/22870567870/job/66353637447?pr=2730

Summary by CodeRabbit

  • Bug Fixes
    • Removed legacy GPU compilation flags related to GDC enablement for certain GPU tiers during JIT GEMM generation, reducing extra compile flags and build noise; GDC-related flags for the latest GPU tier remain enabled where still applicable.

@aleozlx aleozlx added the v0.6.6 release blocker label for 0.6.6 label Mar 10, 2026
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request reverts a previous change that introduced specific CUTLASS GDC compilation flags to several GEMM kernel generation functions. The original change, intended to fix NaN issues under concurrency, inadvertently broke Ahead-Of-Time (AOT) packages. By reverting these flags, the PR aims to restore the stability of AOT packages and unblock the ongoing 0.6.6 release.

Highlights

  • Reversion of CUTLASS GDC Flags: The pull request reverts the addition of -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 compilation flags from various GEMM kernel generation functions.
  • Fix for AOT Package Breakage: This reversion addresses issues where the previously added flags caused breakage in Ahead-Of-Time (AOT) packages, as evidenced by failed CI runs.
  • Unblocking Release: The primary motivation for this change is to unblock the 0.6.6 release, which was hindered by the AOT package problems.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • flashinfer/jit/gemm/core.py
    • Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm100_module_cutlass_fp4.
    • Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm103_module_cutlass_fp4.
    • Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm120_module_cutlass_fp4.
    • Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm100_module_cutlass_fp8.
    • Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm100_module_cutlass_mxfp8.
    • Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm120_module.
Activity
  • No human activity has been recorded on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Mar 10, 2026

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 027233c6-974b-4991-aed0-51e66705bdf7

📥 Commits

Reviewing files that changed from the base of the PR and between 95992b1 and b7cca53.

📒 Files selected for processing (1)
  • flashinfer/jit/gemm/core.py

📝 Walkthrough

Walkthrough

Removed the -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 compile flags from multiple GEMM JIT generator configurations in flashinfer/jit/gemm/core.py. No other functional or control-flow changes.

Changes

Cohort / File(s) Summary
GEMM JIT GDC flag edits
flashinfer/jit/gemm/core.py
Removed -DCUTLASS_ENABLE_GDC_FOR_SM90=1 and -DCUTLASS_ENABLE_GDC_FOR_SM100=1 occurrences from extra_cuda_cflags across multiple gen_gemm_* generator variants (fp4, fp8, mxfp8, bf16, various SM variants). Note: SM100 GDC flag still exists in other contexts outside these generators. Review CUDA compile-flag sets and generator-specific flag lists.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested reviewers

  • yzh119
  • jimmyzho
  • jiahanc
  • nv-yunzheq

Poem

🐰 I hopped through flags at break of dawn,
I nudged two flags and quietly moved on,
SM90 took a tiny leap,
SM100 keeps some things to keep,
Builds hum softer now — let carrots spawn! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check ❓ Inconclusive The PR description provides context for the revert (unblocking 0.6.6 release) and references a breaking issue in PR #2716, but is missing key template sections like checklist confirmations and detailed explanation of what the revert accomplishes. Complete the PR template by adding sections for pre-commit checks, test status, and a more detailed description of how reverting this commit resolves the AOT package breakage.
✅ Passed checks (1 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: reverting a prior fix commit related to GEMM kernels and GDC flags.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
  • 📝 Generate docstrings (stacked PR)
  • 📝 Generate docstrings (commit on current branch)
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch revert-2716-fix/sm120-fp4-pdl-gdc-flags

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@claude
Copy link
Copy Markdown

claude bot commented Mar 10, 2026

Code Review: Revert of GDC Flags (#2737)

Context: This PR reverts #2716 to unblock the v0.6.6 release. The change itself is minimal and mechanically correct — it removes -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 from 6 JIT module generators in flashinfer/jit/gemm/core.py.


Critical Concern: Restoring a Known Race Condition

The core trade-off here deserves explicit acknowledgment before merging.

PR #2716 fixed a real, verified, production-impacting bug:

  • CUTLASS FP4/FP8 GEMM kernels use enablePDL=true but without GDC flags, wait_on_dependent_grids() and launch_dependent_grids() in cutlass/arch/grid_dependency_control.h compile as empty no-ops
  • This creates a device-side race condition where output tiles are silently not written (NaN/garbage data)
  • Verified on SM120 (RTX 5090 / RTX PRO 6000) with 64+ concurrent requests in SGLang with TP=8
  • Confirmed to be a race condition via CUDA_LAUNCH_BLOCKING=1 eliminating the bug

This revert restores that race condition for SM90 and SM100/SM103/SM120 users. This is a data-correctness issue, not just a performance one.


Missing Information

The PR description states the flags "seem to have broken AOT packages" with a link to a CI job. Before merging, it would be valuable to understand:

  1. What exactly failed in the AOT build? A compile error, link error, or test failure? The GDC flags are standard preprocessor defines — it's not immediately obvious why they would break AOT builds but not JIT builds.

  2. Is the AOT failure specific to certain SM versions or all targets? If it only affects certain targets (e.g., SM90 on older CUDA), a more targeted fix might be possible (e.g., conditionally apply the flag based on CUDA version).

  3. Was there any investigation of why the flags caused a build failure? Understanding the root cause would enable a fix that keeps the correctness benefit without the AOT breakage.


Inconsistency After This Revert

Note that gen_tgv_gemm_sm10x_module() (line 641) will still have -DCUTLASS_ENABLE_GDC_FOR_SM100=1 after this revert — that flag was already there before PR #2716. This creates an asymmetry where tgv_gemm modules get correct PDL synchronization but the FP4/FP8/MXFP8 GEMM modules do not.


Recommendation

If the 0.6.6 release is blocked and this revert is necessary to ship:

  1. Document the known regression in the release notes — users running SM120/SM100 hardware under high concurrency with TP should be aware of the NaN risk.

  2. Add a follow-up issue immediately to re-add the GDC flags with a proper investigation of the AOT build failure.

  3. Consider a targeted workaround: if PDL can be disabled at the kernel launch level for the affected modules (without the GDC flags), that would avoid both the compile failure and the race condition — though this may require kernel-level changes.

  4. Consider keeping the flag only for SM120 modules (gen_gemm_sm120_module_cutlass_fp4, gen_gemm_sm120_module) since those were the hardware where the bug was most clearly observed, if the AOT build failure was only on SM90/SM100 targets.


Summary

Aspect Assessment
Code correctness The revert is mechanically correct
Safety Restores a known race condition (silent NaN outputs)
Root cause documented Not in this PR — missing analysis of AOT failure
Justification Unblocking release is a valid operational reason
Mitigation needed Yes — follow-up issue + release note warning

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request reverts a previous fix for a concurrency issue in GEMM kernels that could lead to NaN results, aiming to unblock the 0.6.6 release by resolving a build failure in AOT packages. However, a security audit identified several high-severity issues in the codebase, including Command Injection (e.g., FLASHINFER_CUDA_ARCH_LIST, nvcc_flags, CXX, FLASHINFER_NVCC_THREADS used without sanitization in flashinfer/jit/gemm/core.py), Path Traversal (artifact download in flashinfer/artifacts.py and flashinfer/jit/cubin_loader.py using unsanitized filenames), and Path Traversal via Workspace Directory (FLASHINFER_CUDA_ARCH_LIST used unsanitized for workspace path in flashinfer/jit/env.py). While the revert is a valid short-term solution, it re-introduces a known correctness bug. It is critical to address the AOT build issue and re-instate GDC flags for GEMM operations. Additionally, strict validation for all environment variables and sanitization of all filenames used in path construction are strongly recommended to mitigate the identified security vulnerabilities.

I am having trouble creating individual review comments. Click here to see my feedback.

flashinfer/jit/gemm/core.py (94-95)

critical

Removing these GDC (Global Data Coherency) flags re-introduces a known concurrency bug that can cause GEMM kernels to produce NaN values. The original fix was to ensure PDL (Producer-Driven Load) synchronization barriers are not compiled as no-ops. While I understand this revert is necessary to unblock the release, please ensure a high-priority task is created to investigate the AOT build failure and re-apply this fix. The correctness of concurrent GEMM operations depends on it. This feedback applies to all similar removals of these flags in this file.

flashinfer/jit/gemm/core.py (519)

security-high high

The nvcc_flags variable used here is constructed from the FLASHINFER_CUDA_ARCH_LIST environment variable without sufficient sanitization. An attacker who can control this environment variable can inject arbitrary shell commands or malicious compiler flags (e.g., by including shell metacharacters in the version string), leading to potential Remote Code Execution (RCE) when the JIT compilation is triggered. While this usage is not new, it is part of the modified code in this PR. It is recommended to validate the environment variable in flashinfer/compilation_context.py to ensure it only contains valid numeric versions.

@aleozlx aleozlx added the run-ci label Mar 10, 2026
@yzh119
Copy link
Copy Markdown
Collaborator

yzh119 commented Mar 10, 2026

It's weird to me that this change will break compilation, ideally it should only affect functionalities relevant to PDL.

Let me double check.

@claude
Copy link
Copy Markdown

claude bot commented Mar 10, 2026

Code Review

This is a partial revert of #2716, removing only the -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags while keeping -DCUTLASS_ENABLE_GDC_FOR_SM100=1. Here is my analysis.


Critical Concern: Potential NaN Bug Regression for SM120

The original PR #2716 explicitly called out a key detail in its root cause analysis:

Note: DCUTLASS_ENABLE_GDC_FOR_SM90 is needed because the SM120 CUTLASS kernel (sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp) guards launch_dependent_grids() with #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 instead of SM100 (upstream CUTLASS bug).

If this is accurate, removing CUTLASS_ENABLE_GDC_FOR_SM90=1 from:

  • gen_gemm_sm120_module_cutlass_fp4()
  • gen_gemm_sm120_module()

…re-enables the race condition for SM120 kernels. These modules still have enablePDL=true at the host level (confirmed by fp4_gemm_template_sm120.h:188 and fp4_gemm_template_sm100.h:276), but the device-side griddepcontrol.wait barrier would again compile as a no-op, causing silent data corruption under concurrency.

CUTLASS_ENABLE_GDC_FOR_SM100=1 alone is insufficient for SM120 targets that use the SM90-guarded code path in CUTLASS.


Missing Root Cause Explanation

The PR states that #2716 "broke AOT packages" and links to a CI failure. However, it does not document:

  1. What specific failure occurred (compile error vs. runtime error)
  2. Which module(s) triggered the failure
  3. Whether the failure was on SM90-specific code paths or something else entirely

Without this context, it is difficult to assess whether removing the flag is the right fix or just a workaround that silently re-introduces the original correctness bug.


Suggested Targeted Fix

If the issue is that CUTLASS_ENABLE_GDC_FOR_SM90=1 causes problems only for SM100/SM103 modules (perhaps AOT packages are built on non-SM90 hardware where the flag triggers unexpected behavior), a more targeted fix would be:

  • Remove -DCUTLASS_ENABLE_GDC_FOR_SM90=1 from gen_gemm_sm100_module_cutlass_fp4(), gen_gemm_sm103_module_cutlass_fp4(), gen_gemm_sm100_module_cutlass_fp8(), and gen_gemm_sm100_module_cutlass_mxfp8() — where only SM100 GDC semantics matter
  • Keep -DCUTLASS_ENABLE_GDC_FOR_SM90=1 in gen_gemm_sm120_module_cutlass_fp4() and gen_gemm_sm120_module() — where the SM90-guarded CUTLASS code path is needed for correctness

This preserves the NaN fix for SM120 while addressing whatever AOT issue the SM90 flag introduced in SM100-only modules.


Minor Notes

  • gen_gemm_sm90_module() has no GDC flags at all. It's worth verifying whether the SM90 group GEMM kernels use PDL, and if so, whether they also need this flag.
  • The tgv_gemm module still has CUTLASS_ENABLE_GDC_FOR_SM100=1 and uses cutlass::arch::launch_dependent_grids() / wait_on_dependent_grids() directly via the TRT-LLM extension header — those inline PTX calls don't go through the CUTLASS_GDC_ENABLED macro, so they are unaffected by this change.

Summary

The concern is correctness, not style. Before merging, please confirm whether SM120 CUTLASS FP4 kernels (and gen_gemm_sm120_module()) still correctly execute PDL synchronization after this change. If they do not — i.e., griddepcontrol.wait is compiled out — the original high-concurrency NaN bug on RTX 5090 / RTX PRO 6000 class hardware is silently reintroduced by this PR.

AI-assisted review using Claude Code

This reverts commit 95992b1.
@aleozlx aleozlx removed the v0.6.6 release blocker label for 0.6.6 label Mar 10, 2026
@claude
Copy link
Copy Markdown

claude bot commented Mar 10, 2026

PR Review: Revert of #2716 (GDC Flags Fix)

Thank you for the quick action to unblock the 0.6.6 release. However, this revert deserves careful consideration before merging, as it re-introduces a known data correctness bug. Here is a detailed analysis:


What This Revert Does

This PR removes the -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 compile flags from six GEMM JIT modules:

  • fp4_gemm_cutlass (SM100)
  • fp4_gemm_cutlass_sm103 (SM103)
  • fp4_gemm_cutlass_sm120 (SM120)
  • fp8_gemm_cutlass (SM100)
  • mxfp8_gemm_cutlass (SM100)
  • gemm_sm120 (SM120)

Concern 1: Re-introduces a Correctness Bug (High Severity)

The original fix (#2716) addressed a real race condition — without CUTLASS_GDC_ENABLED, the wait_on_dependent_grids() / launch_dependent_grids() barriers in CUTLASS grid_dependency_control.h compile as empty no-ops, while PDL launches still happen at the host level via cudaLaunchAttributeProgrammaticStreamSerialization. The result is silent NaN output under high concurrency, confirmed on SM120 hardware.

Reverting this fix means users with SM90/SM100/SM120 GPUs running concurrent workloads are vulnerable to returning incorrect (NaN/garbage) results from GEMM operations, with no error raised. This is a production correctness regression.

Concern 2: Inconsistency — tgv_gemm Keeps Its GDC Flag

After this revert, gen_tgv_gemm_sm10x_module() (line 641 in core.py) still retains -DCUTLASS_ENABLE_GDC_FOR_SM100=1:

extra_cuda_cflags=[
    "--expt-relaxed-constexpr",
    "-DCUTLASS_ENABLE_GDC_FOR_SM100=1",  # Still present
]

This creates an inconsistency: tgv_gemm is protected against the race condition, but the FP4/FP8/MXFP8 GEMM modules are not. The fix in #2716 was symmetric by design.

Concern 3: Root Cause of AOT Failure Not Documented

The PR description states that #2716 "broke AOT packages" and links to a CI job, but does not explain why adding compile flags broke AOT builds. Understanding the root cause is important before choosing whether a full revert is the right approach, or whether a more targeted fix (e.g., a flag guarded by CUDA version, or fixing the AOT pipeline) is preferable.

Some possible root causes worth investigating:

  • Do the GDC flags require a minimum CUDA version that the AOT build environment doesn't meet?
  • Does the AOT path compile for SM versions where these flags are not supported?
  • Is there a linking or symbol conflict when building the pre-compiled packages?

Suggested Alternative Approach

Rather than a full revert, consider:

  1. Keep the JIT fix, fix the AOT pipeline separately — if the AOT build fails because of the flags, investigate and fix the AOT build (e.g., conditional inclusion based on CUDA version), rather than removing the fix from JIT users who have the bug today.

  2. Disable PDL for affected modules as a short-term workaround — this avoids removing the GDC flags while still providing a safe fallback for the release, without exposing users to silent NaN bugs.

  3. Add CUTLASS_ENABLE_GDC_FOR_SM90 conditionally for SM120 only — since the SM120 kernel uses this flag due to an upstream CUTLASS guard (#ifdef CUTLASS_ENABLE_GDC_FOR_SM90), at a minimum the SM120 modules should keep this flag.


Summary

Issue Assessment
Re-introduces NaN race condition under concurrency High severity regression
Inconsistency with tgv_gemm keeping GDC flag Medium — creates confusing asymmetry
Root cause of AOT failure unexplained Needs investigation before merging
Release urgency Understandable motivation, but correctness should take priority

Recommendation: Before merging, please investigate and document why the AOT build fails with the GDC flags. If the AOT issue cannot be quickly resolved, consider disabling PDL rather than removing the synchronization barriers, to avoid shipping a release with a known silent correctness bug on SM90+ hardware.

If this revert is merged anyway, please open a follow-up issue to re-add the fix with an AOT-compatible approach, and document the known risk in the release notes.

@bkryu
Copy link
Copy Markdown
Collaborator

bkryu commented Mar 11, 2026

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !399 has been created, and the CI pipeline #45837015 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

[FAILED] Pipeline #45837015: 7/20 passed

@bkryu bkryu added the ready label Mar 11, 2026
@bkryu bkryu mentioned this pull request Mar 12, 2026
5 tasks
@bkryu bkryu enabled auto-merge (squash) March 12, 2026 00:43
@bkryu bkryu merged commit f07faf9 into main Mar 12, 2026
103 of 139 checks passed
@bkryu bkryu deleted the revert-2716-fix/sm120-fp4-pdl-gdc-flags branch March 12, 2026 01:34
voipmonitor added a commit to voipmonitor/flashinfer that referenced this pull request Mar 13, 2026
Re-applies the fix from flashinfer-ai#2716 (reverted in flashinfer-ai#2737) using only
-DCUTLASS_ENABLE_GDC_FOR_SM100=1, without -DCUTLASS_ENABLE_GDC_FOR_SM90=1.

The SM90 flag was the cause of the AOT build failure: it triggers a
direct #ifdef in sm90_gemm_tma_warpspecialized_cooperative.hpp (line 794)
that calls scheduler.is_last_tile() — but SM100+/SM120 schedulers
(PersistentTileSchedulerSm100StreamK) don't have that method.

The SM100 flag alone is sufficient because CUTLASS 4.2.1's
grid_dependency_control.h defines CUTLASS_GDC_ENABLED for the entire
SM100 family (SM100, SM101, SM103, SM120, SM121) when
CUTLASS_ENABLE_GDC_FOR_SM100 is set.

All affected GEMM kernels use enablePDL=true, so the device-side
GDC barriers (griddepcontrol.wait / griddepcontrol.launch_dependents)
must be compiled in — otherwise PDL enables host-side kernel overlap
but device-side synchronization is compiled out as no-ops, causing
a race condition (NaN/garbage in output tiles under concurrency).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@aleozlx aleozlx mentioned this pull request Mar 18, 2026
5 tasks
frankwang28 pushed a commit to frankwang28/flashinfer that referenced this pull request Mar 18, 2026
…g GDC flags cause PDL synchronization barriers to compile as no-ops" (flashinfer-ai#2737)

Proposing to revert flashinfer-ai#2716 in order to unblock
0.6.6 release

flashinfer-ai#2716 seems to have
broken AOT packages


https://github.com/flashinfer-ai/flashinfer/actions/runs/22870567870/job/66353637447?pr=2730


<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Removed legacy GPU compilation flags related to GDC enablement for
certain GPU tiers during JIT GEMM generation, reducing extra compile
flags and build noise; GDC-related flags for the latest GPU tier remain
enabled where still applicable.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: yzh119 <zihaoy@nvidia.com>
ameynaik-hub pushed a commit to ameynaik-hub/flashinfer that referenced this pull request Mar 18, 2026
…g GDC flags cause PDL synchronization barriers to compile as no-ops" (flashinfer-ai#2737)

Proposing to revert flashinfer-ai#2716 in order to unblock
0.6.6 release

flashinfer-ai#2716 seems to have
broken AOT packages

https://github.com/flashinfer-ai/flashinfer/actions/runs/22870567870/job/66353637447?pr=2730

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

* **Bug Fixes**
* Removed legacy GPU compilation flags related to GDC enablement for
certain GPU tiers during JIT GEMM generation, reducing extra compile
flags and build noise; GDC-related flags for the latest GPU tier remain
enabled where still applicable.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: yzh119 <zihaoy@nvidia.com>
Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
aleozlx pushed a commit that referenced this pull request Mar 20, 2026
## Summary

Re-applies #2716 (reverted in #2737) with the fix for the AOT build
failure.

**Only `-DCUTLASS_ENABLE_GDC_FOR_SM100=1`** is added. The
`-DCUTLASS_ENABLE_GDC_FOR_SM90=1` flag that broke AOT builds is
intentionally omitted.

## Why the original PR broke AOT

`sm90_gemm_tma_warpspecialized_cooperative.hpp:794` has a direct `#ifdef
CUTLASS_ENABLE_GDC_FOR_SM90` guard (not `CUTLASS_GDC_ENABLED`) that
calls `scheduler.is_last_tile()`. When compiling SM120 kernels with that
flag, the SM120 scheduler (`PersistentTileSchedulerSm100StreamK`)
doesn't have `is_last_tile()` → compilation error.

## Why SM100 flag alone is sufficient

CUTLASS 4.2.1 `grid_dependency_control.h` defines `CUTLASS_GDC_ENABLED`
for the entire SM100 family (SM100/101/103/120/121) when
`CUTLASS_ENABLE_GDC_FOR_SM100` is set. This enables
`griddepcontrol.wait` and `griddepcontrol.launch_dependents` device-side
barriers for all affected architectures.

## Why this is needed

All affected GEMM kernels hardcode `enablePDL=true`, which enables
host-side kernel overlap. Without the GDC compile flag, the device-side
synchronization barriers compile as no-ops → race condition →
NaN/garbage output tiles under concurrency.

## Affected modules

- `fp4_gemm_cutlass` (SM100)
- `fp4_gemm_cutlass_sm103` (SM103)
- `fp4_gemm_cutlass_sm120` (SM120)
- `fp8_gemm_cutlass` (SM100)
- `mxfp8_gemm_cutlass` (SM100)
- `gemm_sm120` (SM120 FP8 groupwise)

(`tgv_gemm` already had the SM100 flag.)

## Test plan

- [ ] AOT build with `FLASHINFER_CUDA_ARCH_LIST="12.1a"` (the exact
config that broke before)
- [ ] AOT build with full arch list `"7.5 8.0 8.9 9.0a 10.0a 12.0a"`
- [ ] FP4 GEMM correctness under concurrent streams on SM120

🤖 Generated with [Claude Code](https://claude.com/claude-code)

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->

## Summary by CodeRabbit

* **Chores**
* Updated CUDA compilation configurations for matrix multiplication
kernels across multiple data format variants (FP4, FP8, MXFP8, BF16)
supporting additional GPU architectures.

<!-- end of auto-generated comment: release notes by coderabbit.ai -->

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants