Revert "fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops" by aleozlx · Pull Request #2737 · flashinfer-ai/flashinfer

aleozlx · 2026-03-10T01:32:02Z

Proposing to revert #2716 in order to unblock 0.6.6 release

#2716 seems to have broken AOT packages

https://github.com/flashinfer-ai/flashinfer/actions/runs/22870567870/job/66353637447?pr=2730

Summary by CodeRabbit

Bug Fixes
- Removed legacy GPU compilation flags related to GDC enablement for certain GPU tiers during JIT GEMM generation, reducing extra compile flags and build noise; GDC-related flags for the latest GPU tier remain enabled where still applicable.

…g GDC fl…" This reverts commit 4c4013b.

gemini-code-assist · 2026-03-10T01:32:16Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request reverts a previous change that introduced specific CUTLASS GDC compilation flags to several GEMM kernel generation functions. The original change, intended to fix NaN issues under concurrency, inadvertently broke Ahead-Of-Time (AOT) packages. By reverting these flags, the PR aims to restore the stability of AOT packages and unblock the ongoing 0.6.6 release.

Highlights

Reversion of CUTLASS GDC Flags: The pull request reverts the addition of -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 compilation flags from various GEMM kernel generation functions.
Fix for AOT Package Breakage: This reversion addresses issues where the previously added flags caused breakage in Ahead-Of-Time (AOT) packages, as evidenced by failed CI runs.
Unblocking Release: The primary motivation for this change is to unblock the 0.6.6 release, which was hindered by the AOT package problems.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

flashinfer/jit/gemm/core.py
- Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm100_module_cutlass_fp4.
- Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm103_module_cutlass_fp4.
- Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm120_module_cutlass_fp4.
- Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm100_module_cutlass_fp8.
- Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm100_module_cutlass_mxfp8.
- Removed -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags from gen_gemm_sm120_module.

Activity

No human activity has been recorded on this pull request yet.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for GitHub and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

coderabbitai · 2026-03-10T01:32:46Z

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 027233c6-974b-4991-aed0-51e66705bdf7

📥 Commits

Reviewing files that changed from the base of the PR and between 95992b1 and b7cca53.

📒 Files selected for processing (1)

flashinfer/jit/gemm/core.py

📝 Walkthrough

Walkthrough

Removed the -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 compile flags from multiple GEMM JIT generator configurations in flashinfer/jit/gemm/core.py. No other functional or control-flow changes.

Changes

Cohort / File(s)	Summary
GEMM JIT GDC flag edits `flashinfer/jit/gemm/core.py`	Removed `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` and `-DCUTLASS_ENABLE_GDC_FOR_SM100=1` occurrences from `extra_cuda_cflags` across multiple `gen_gemm_*` generator variants (fp4, fp8, mxfp8, bf16, various SM variants). Note: SM100 GDC flag still exists in other contexts outside these generators. Review CUDA compile-flag sets and generator-specific flag lists.

Sequence Diagram(s)

(omitted)

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

fix(jit): GEMM kernels produce NaN under concurrency — missing GDC flags cause PDL synchronization barriers to compile as no-ops #2716: Modifies CUTLASS GDC compile flags in the same file — adds SM90/SM100 GDC flags while this PR removes SM90/SM100 occurrences.

Suggested reviewers

yzh119
jimmyzho
jiahanc
nv-yunzheq

Poem

🐰 I hopped through flags at break of dawn,
I nudged two flags and quietly moved on,
SM90 took a tiny leap,
SM100 keeps some things to keep,
Builds hum softer now — let carrots spawn! 🥕

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (1 warning, 1 inconclusive)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.
Description check	❓ Inconclusive	The PR description provides context for the revert (unblocking 0.6.6 release) and references a breaking issue in PR `#2716`, but is missing key template sections like checklist confirmations and detailed explanation of what the revert accomplishes.	Complete the PR template by adding sections for pre-commit checks, test status, and a more detailed description of how reverting this commit resolves the AOT package breakage.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: reverting a prior fix commit related to GEMM kernels and GDC flags.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

📝 Generate docstrings (stacked PR)
📝 Generate docstrings (commit on current branch)

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch revert-2716-fix/sm120-fp4-pdl-gdc-flags

Tip

Try Coding Plans. Let us write the prompt for your AI agent so you can ship faster (with fewer bugs).
Share your feedback on Discord.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

claude · 2026-03-10T01:33:58Z

Code Review: Revert of GDC Flags (#2737)

Context: This PR reverts #2716 to unblock the v0.6.6 release. The change itself is minimal and mechanically correct — it removes -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 from 6 JIT module generators in flashinfer/jit/gemm/core.py.

Critical Concern: Restoring a Known Race Condition

The core trade-off here deserves explicit acknowledgment before merging.

PR #2716 fixed a real, verified, production-impacting bug:

CUTLASS FP4/FP8 GEMM kernels use enablePDL=true but without GDC flags, wait_on_dependent_grids() and launch_dependent_grids() in cutlass/arch/grid_dependency_control.h compile as empty no-ops
This creates a device-side race condition where output tiles are silently not written (NaN/garbage data)
Verified on SM120 (RTX 5090 / RTX PRO 6000) with 64+ concurrent requests in SGLang with TP=8
Confirmed to be a race condition via CUDA_LAUNCH_BLOCKING=1 eliminating the bug

This revert restores that race condition for SM90 and SM100/SM103/SM120 users. This is a data-correctness issue, not just a performance one.

Missing Information

The PR description states the flags "seem to have broken AOT packages" with a link to a CI job. Before merging, it would be valuable to understand:

What exactly failed in the AOT build? A compile error, link error, or test failure? The GDC flags are standard preprocessor defines — it's not immediately obvious why they would break AOT builds but not JIT builds.
Is the AOT failure specific to certain SM versions or all targets? If it only affects certain targets (e.g., SM90 on older CUDA), a more targeted fix might be possible (e.g., conditionally apply the flag based on CUDA version).
Was there any investigation of why the flags caused a build failure? Understanding the root cause would enable a fix that keeps the correctness benefit without the AOT breakage.

Inconsistency After This Revert

Note that gen_tgv_gemm_sm10x_module() (line 641) will still have -DCUTLASS_ENABLE_GDC_FOR_SM100=1 after this revert — that flag was already there before PR #2716. This creates an asymmetry where tgv_gemm modules get correct PDL synchronization but the FP4/FP8/MXFP8 GEMM modules do not.

Recommendation

If the 0.6.6 release is blocked and this revert is necessary to ship:

Document the known regression in the release notes — users running SM120/SM100 hardware under high concurrency with TP should be aware of the NaN risk.
Add a follow-up issue immediately to re-add the GDC flags with a proper investigation of the AOT build failure.
Consider a targeted workaround: if PDL can be disabled at the kernel launch level for the affected modules (without the GDC flags), that would avoid both the compile failure and the race condition — though this may require kernel-level changes.
Consider keeping the flag only for SM120 modules (gen_gemm_sm120_module_cutlass_fp4, gen_gemm_sm120_module) since those were the hardware where the bug was most clearly observed, if the AOT build failure was only on SM90/SM100 targets.

Summary

Aspect	Assessment
Code correctness	The revert is mechanically correct
Safety	Restores a known race condition (silent NaN outputs)
Root cause documented	Not in this PR — missing analysis of AOT failure
Justification	Unblocking release is a valid operational reason
Mitigation needed	Yes — follow-up issue + release note warning

gemini-code-assist

Code Review

This pull request reverts a previous fix for a concurrency issue in GEMM kernels that could lead to NaN results, aiming to unblock the 0.6.6 release by resolving a build failure in AOT packages. However, a security audit identified several high-severity issues in the codebase, including Command Injection (e.g., FLASHINFER_CUDA_ARCH_LIST, nvcc_flags, CXX, FLASHINFER_NVCC_THREADS used without sanitization in flashinfer/jit/gemm/core.py), Path Traversal (artifact download in flashinfer/artifacts.py and flashinfer/jit/cubin_loader.py using unsanitized filenames), and Path Traversal via Workspace Directory (FLASHINFER_CUDA_ARCH_LIST used unsanitized for workspace path in flashinfer/jit/env.py). While the revert is a valid short-term solution, it re-introduces a known correctness bug. It is critical to address the AOT build issue and re-instate GDC flags for GEMM operations. Additionally, strict validation for all environment variables and sanitization of all filenames used in path construction are strongly recommended to mitigate the identified security vulnerabilities.

I am having trouble creating individual review comments. Click here to see my feedback.

flashinfer/jit/gemm/core.py (94-95)

Removing these GDC (Global Data Coherency) flags re-introduces a known concurrency bug that can cause GEMM kernels to produce NaN values. The original fix was to ensure PDL (Producer-Driven Load) synchronization barriers are not compiled as no-ops. While I understand this revert is necessary to unblock the release, please ensure a high-priority task is created to investigate the AOT build failure and re-apply this fix. The correctness of concurrent GEMM operations depends on it. This feedback applies to all similar removals of these flags in this file.

flashinfer/jit/gemm/core.py (519)

The nvcc_flags variable used here is constructed from the FLASHINFER_CUDA_ARCH_LIST environment variable without sufficient sanitization. An attacker who can control this environment variable can inject arbitrary shell commands or malicious compiler flags (e.g., by including shell metacharacters in the version string), leading to potential Remote Code Execution (RCE) when the JIT compilation is triggered. While this usage is not new, it is part of the modified code in this PR. It is recommended to validate the environment variable in flashinfer/compilation_context.py to ensure it only contains valid numeric versions.

yzh119 · 2026-03-10T03:09:05Z

It's weird to me that this change will break compilation, ideally it should only affect functionalities relevant to PDL.

Let me double check.

claude · 2026-03-10T22:10:20Z

Code Review

This is a partial revert of #2716, removing only the -DCUTLASS_ENABLE_GDC_FOR_SM90=1 flags while keeping -DCUTLASS_ENABLE_GDC_FOR_SM100=1. Here is my analysis.

Critical Concern: Potential NaN Bug Regression for SM120

The original PR #2716 explicitly called out a key detail in its root cause analysis:

Note: DCUTLASS_ENABLE_GDC_FOR_SM90 is needed because the SM120 CUTLASS kernel (sm120_gemm_tma_warpspecialized_cooperative_asymmetric_dma.hpp) guards launch_dependent_grids() with #ifdef CUTLASS_ENABLE_GDC_FOR_SM90 instead of SM100 (upstream CUTLASS bug).

If this is accurate, removing CUTLASS_ENABLE_GDC_FOR_SM90=1 from:

gen_gemm_sm120_module_cutlass_fp4()
gen_gemm_sm120_module()

…re-enables the race condition for SM120 kernels. These modules still have enablePDL=true at the host level (confirmed by fp4_gemm_template_sm120.h:188 and fp4_gemm_template_sm100.h:276), but the device-side griddepcontrol.wait barrier would again compile as a no-op, causing silent data corruption under concurrency.

CUTLASS_ENABLE_GDC_FOR_SM100=1 alone is insufficient for SM120 targets that use the SM90-guarded code path in CUTLASS.

Missing Root Cause Explanation

The PR states that #2716 "broke AOT packages" and links to a CI failure. However, it does not document:

What specific failure occurred (compile error vs. runtime error)
Which module(s) triggered the failure
Whether the failure was on SM90-specific code paths or something else entirely

Without this context, it is difficult to assess whether removing the flag is the right fix or just a workaround that silently re-introduces the original correctness bug.

Suggested Targeted Fix

If the issue is that CUTLASS_ENABLE_GDC_FOR_SM90=1 causes problems only for SM100/SM103 modules (perhaps AOT packages are built on non-SM90 hardware where the flag triggers unexpected behavior), a more targeted fix would be:

Remove -DCUTLASS_ENABLE_GDC_FOR_SM90=1 from gen_gemm_sm100_module_cutlass_fp4(), gen_gemm_sm103_module_cutlass_fp4(), gen_gemm_sm100_module_cutlass_fp8(), and gen_gemm_sm100_module_cutlass_mxfp8() — where only SM100 GDC semantics matter
Keep -DCUTLASS_ENABLE_GDC_FOR_SM90=1 in gen_gemm_sm120_module_cutlass_fp4() and gen_gemm_sm120_module() — where the SM90-guarded CUTLASS code path is needed for correctness

This preserves the NaN fix for SM120 while addressing whatever AOT issue the SM90 flag introduced in SM100-only modules.

Minor Notes

gen_gemm_sm90_module() has no GDC flags at all. It's worth verifying whether the SM90 group GEMM kernels use PDL, and if so, whether they also need this flag.
The tgv_gemm module still has CUTLASS_ENABLE_GDC_FOR_SM100=1 and uses cutlass::arch::launch_dependent_grids() / wait_on_dependent_grids() directly via the TRT-LLM extension header — those inline PTX calls don't go through the CUTLASS_GDC_ENABLED macro, so they are unaffected by this change.

Summary

The concern is correctness, not style. Before merging, please confirm whether SM120 CUTLASS FP4 kernels (and gen_gemm_sm120_module()) still correctly execute PDL synchronization after this change. If they do not — i.e., griddepcontrol.wait is compiled out — the original high-concurrency NaN bug on RTX 5090 / RTX PRO 6000 class hardware is silently reintroduced by this PR.

AI-assisted review using Claude Code

This reverts commit 95992b1.

claude · 2026-03-10T23:58:51Z

PR Review: Revert of #2716 (GDC Flags Fix)

Thank you for the quick action to unblock the 0.6.6 release. However, this revert deserves careful consideration before merging, as it re-introduces a known data correctness bug. Here is a detailed analysis:

What This Revert Does

This PR removes the -DCUTLASS_ENABLE_GDC_FOR_SM100=1 and -DCUTLASS_ENABLE_GDC_FOR_SM90=1 compile flags from six GEMM JIT modules:

fp4_gemm_cutlass (SM100)
fp4_gemm_cutlass_sm103 (SM103)
fp4_gemm_cutlass_sm120 (SM120)
fp8_gemm_cutlass (SM100)
mxfp8_gemm_cutlass (SM100)
gemm_sm120 (SM120)

Concern 1: Re-introduces a Correctness Bug (High Severity)

The original fix (#2716) addressed a real race condition — without CUTLASS_GDC_ENABLED, the wait_on_dependent_grids() / launch_dependent_grids() barriers in CUTLASS grid_dependency_control.h compile as empty no-ops, while PDL launches still happen at the host level via cudaLaunchAttributeProgrammaticStreamSerialization. The result is silent NaN output under high concurrency, confirmed on SM120 hardware.

Reverting this fix means users with SM90/SM100/SM120 GPUs running concurrent workloads are vulnerable to returning incorrect (NaN/garbage) results from GEMM operations, with no error raised. This is a production correctness regression.

Concern 2: Inconsistency — `tgv_gemm` Keeps Its GDC Flag

After this revert, gen_tgv_gemm_sm10x_module() (line 641 in core.py) still retains -DCUTLASS_ENABLE_GDC_FOR_SM100=1:

extra_cuda_cflags=[
    "--expt-relaxed-constexpr",
    "-DCUTLASS_ENABLE_GDC_FOR_SM100=1",  # Still present
]

This creates an inconsistency: tgv_gemm is protected against the race condition, but the FP4/FP8/MXFP8 GEMM modules are not. The fix in #2716 was symmetric by design.

Concern 3: Root Cause of AOT Failure Not Documented

The PR description states that #2716 "broke AOT packages" and links to a CI job, but does not explain why adding compile flags broke AOT builds. Understanding the root cause is important before choosing whether a full revert is the right approach, or whether a more targeted fix (e.g., a flag guarded by CUDA version, or fixing the AOT pipeline) is preferable.

Some possible root causes worth investigating:

Do the GDC flags require a minimum CUDA version that the AOT build environment doesn't meet?
Does the AOT path compile for SM versions where these flags are not supported?
Is there a linking or symbol conflict when building the pre-compiled packages?

Suggested Alternative Approach

Rather than a full revert, consider:

Keep the JIT fix, fix the AOT pipeline separately — if the AOT build fails because of the flags, investigate and fix the AOT build (e.g., conditional inclusion based on CUDA version), rather than removing the fix from JIT users who have the bug today.
Disable PDL for affected modules as a short-term workaround — this avoids removing the GDC flags while still providing a safe fallback for the release, without exposing users to silent NaN bugs.
Add CUTLASS_ENABLE_GDC_FOR_SM90 conditionally for SM120 only — since the SM120 kernel uses this flag due to an upstream CUTLASS guard (#ifdef CUTLASS_ENABLE_GDC_FOR_SM90), at a minimum the SM120 modules should keep this flag.

Summary

Issue	Assessment
Re-introduces NaN race condition under concurrency	High severity regression
Inconsistency with `tgv_gemm` keeping GDC flag	Medium — creates confusing asymmetry
Root cause of AOT failure unexplained	Needs investigation before merging
Release urgency	Understandable motivation, but correctness should take priority

Recommendation: Before merging, please investigate and document why the AOT build fails with the GDC flags. If the AOT issue cannot be quickly resolved, consider disabling PDL rather than removing the synchronization barriers, to avoid shipping a release with a known silent correctness bug on SM90+ hardware.

If this revert is merged anyway, please open a follow-up issue to re-add the fix with an AOT-compatible approach, and document the known risk in the release notes.

bkryu · 2026-03-11T00:01:43Z

/bot run

flashinfer-bot · 2026-03-11T00:02:40Z

GitLab MR !399 has been created, and the CI pipeline #45837015 is currently running. I'll report back once the pipeline job completes.

flashinfer-bot · 2026-03-11T12:07:10Z

[FAILED] Pipeline #45837015: 7/20 passed

Re-applies the fix from flashinfer-ai#2716 (reverted in flashinfer-ai#2737) using only -DCUTLASS_ENABLE_GDC_FOR_SM100=1, without -DCUTLASS_ENABLE_GDC_FOR_SM90=1. The SM90 flag was the cause of the AOT build failure: it triggers a direct #ifdef in sm90_gemm_tma_warpspecialized_cooperative.hpp (line 794) that calls scheduler.is_last_tile() — but SM100+/SM120 schedulers (PersistentTileSchedulerSm100StreamK) don't have that method. The SM100 flag alone is sufficient because CUTLASS 4.2.1's grid_dependency_control.h defines CUTLASS_GDC_ENABLED for the entire SM100 family (SM100, SM101, SM103, SM120, SM121) when CUTLASS_ENABLE_GDC_FOR_SM100 is set. All affected GEMM kernels use enablePDL=true, so the device-side GDC barriers (griddepcontrol.wait / griddepcontrol.launch_dependents) must be compiled in — otherwise PDL enables host-side kernel overlap but device-side synchronization is compiled out as no-ops, causing a race condition (NaN/garbage in output tiles under concurrency). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…g GDC flags cause PDL synchronization barriers to compile as no-ops" (flashinfer-ai#2737) Proposing to revert flashinfer-ai#2716 in order to unblock 0.6.6 release flashinfer-ai#2716 seems to have broken AOT packages https://github.com/flashinfer-ai/flashinfer/actions/runs/22870567870/job/66353637447?pr=2730  ## Summary by CodeRabbit * **Bug Fixes** * Removed legacy GPU compilation flags related to GDC enablement for certain GPU tiers during JIT GEMM generation, reducing extra compile flags and build noise; GDC-related flags for the latest GPU tier remain enabled where still applicable.  --------- Co-authored-by: yzh119 <zihaoy@nvidia.com>

…g GDC flags cause PDL synchronization barriers to compile as no-ops" (flashinfer-ai#2737) Proposing to revert flashinfer-ai#2716 in order to unblock 0.6.6 release flashinfer-ai#2716 seems to have broken AOT packages https://github.com/flashinfer-ai/flashinfer/actions/runs/22870567870/job/66353637447?pr=2730  ## Summary by CodeRabbit * **Bug Fixes** * Removed legacy GPU compilation flags related to GDC enablement for certain GPU tiers during JIT GEMM generation, reducing extra compile flags and build noise; GDC-related flags for the latest GPU tier remain enabled where still applicable.  --------- Co-authored-by: yzh119 <zihaoy@nvidia.com> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>

## Summary Re-applies #2716 (reverted in #2737) with the fix for the AOT build failure. **Only `-DCUTLASS_ENABLE_GDC_FOR_SM100=1`** is added. The `-DCUTLASS_ENABLE_GDC_FOR_SM90=1` flag that broke AOT builds is intentionally omitted. ## Why the original PR broke AOT `sm90_gemm_tma_warpspecialized_cooperative.hpp:794` has a direct `#ifdef CUTLASS_ENABLE_GDC_FOR_SM90` guard (not `CUTLASS_GDC_ENABLED`) that calls `scheduler.is_last_tile()`. When compiling SM120 kernels with that flag, the SM120 scheduler (`PersistentTileSchedulerSm100StreamK`) doesn't have `is_last_tile()` → compilation error. ## Why SM100 flag alone is sufficient CUTLASS 4.2.1 `grid_dependency_control.h` defines `CUTLASS_GDC_ENABLED` for the entire SM100 family (SM100/101/103/120/121) when `CUTLASS_ENABLE_GDC_FOR_SM100` is set. This enables `griddepcontrol.wait` and `griddepcontrol.launch_dependents` device-side barriers for all affected architectures. ## Why this is needed All affected GEMM kernels hardcode `enablePDL=true`, which enables host-side kernel overlap. Without the GDC compile flag, the device-side synchronization barriers compile as no-ops → race condition → NaN/garbage output tiles under concurrency. ## Affected modules - `fp4_gemm_cutlass` (SM100) - `fp4_gemm_cutlass_sm103` (SM103) - `fp4_gemm_cutlass_sm120` (SM120) - `fp8_gemm_cutlass` (SM100) - `mxfp8_gemm_cutlass` (SM100) - `gemm_sm120` (SM120 FP8 groupwise) (`tgv_gemm` already had the SM100 flag.) ## Test plan - [ ] AOT build with `FLASHINFER_CUDA_ARCH_LIST="12.1a"` (the exact config that broke before) - [ ] AOT build with full arch list `"7.5 8.0 8.9 9.0a 10.0a 12.0a"` - [ ] FP4 GEMM correctness under concurrent streams on SM120 🤖 Generated with [Claude Code](https://claude.com/claude-code)  ## Summary by CodeRabbit * **Chores** * Updated CUDA compilation configurations for matrix multiplication kernels across multiple data format variants (FP4, FP8, MXFP8, BF16) supporting additional GPU architectures.  --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>

Revert "fix(jit): GEMM kernels produce NaN under concurrency — missin…

a5e3c16

…g GDC fl…" This reverts commit 4c4013b.

aleozlx added the v0.6.6 release blocker label for 0.6.6 label Mar 10, 2026

aleozlx requested review from jiahanc, jimmyzho, nv-yunzheq and yzh119 as code owners March 10, 2026 01:32

gemini-code-assist bot reviewed Mar 10, 2026

View reviewed changes

aleozlx added the run-ci label Mar 10, 2026

fix

95992b1

Revert "fix"

b7cca53

This reverts commit 95992b1.

aleozlx removed the v0.6.6 release blocker label for 0.6.6 label Mar 10, 2026

bkryu approved these changes Mar 11, 2026

View reviewed changes

bkryu added the ready label Mar 11, 2026

yzh119 approved these changes Mar 11, 2026

View reviewed changes

bkryu mentioned this pull request Mar 12, 2026

Added missing padding #2726

Merged

5 tasks

bkryu enabled auto-merge (squash) March 12, 2026 00:43

bkryu merged commit f07faf9 into main Mar 12, 2026
103 of 139 checks passed

bkryu deleted the revert-2716-fix/sm120-fp4-pdl-gdc-flags branch March 12, 2026 01:34

voipmonitor mentioned this pull request Mar 13, 2026

fix(jit): enable GDC for CUTLASS GEMM PDL — SM100 flag only #2780

Merged

3 tasks

aleozlx mentioned this pull request Mar 18, 2026

Upgrade cutlass 4.2.1 -> 4.4.2 #2798

Merged

5 tasks

Conversation

aleozlx commented Mar 10, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

gemini-code-assist bot commented Mar 10, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

coderabbitai bot commented Mar 10, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (1 warning, 1 inconclusive)

Uh oh!

claude bot commented Mar 10, 2026

Code Review: Revert of GDC Flags (#2737)

Critical Concern: Restoring a Known Race Condition

Missing Information

Inconsistency After This Revert

Recommendation

Summary

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

flashinfer/jit/gemm/core.py (94-95)

flashinfer/jit/gemm/core.py (519)

Uh oh!

yzh119 commented Mar 10, 2026

Uh oh!

claude bot commented Mar 10, 2026

Code Review

Critical Concern: Potential NaN Bug Regression for SM120

Missing Root Cause Explanation

Suggested Targeted Fix

Minor Notes

Summary

Uh oh!

claude bot commented Mar 10, 2026

PR Review: Revert of #2716 (GDC Flags Fix)

What This Revert Does

Concern 1: Re-introduces a Correctness Bug (High Severity)

Concern 2: Inconsistency — tgv_gemm Keeps Its GDC Flag

Concern 3: Root Cause of AOT Failure Not Documented

Suggested Alternative Approach

Summary

Uh oh!

bkryu commented Mar 11, 2026

Uh oh!

flashinfer-bot commented Mar 11, 2026

Uh oh!

flashinfer-bot commented Mar 11, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aleozlx commented Mar 10, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 10, 2026 •

edited

Loading

Concern 2: Inconsistency — `tgv_gemm` Keeps Its GDC Flag