Skip to content

Cutlass dsl 4.5 bump#3246

Merged
kahyunnam merged 11 commits into
flashinfer-ai:mainfrom
kahyunnam:knam/cutlass-dsl-4.5-bump
May 7, 2026
Merged

Cutlass dsl 4.5 bump#3246
kahyunnam merged 11 commits into
flashinfer-ai:mainfrom
kahyunnam:knam/cutlass-dsl-4.5-bump

Conversation

@kahyunnam
Copy link
Copy Markdown
Member

@kahyunnam kahyunnam commented May 6, 2026

📌 Description

Bump cutlass-dsl to 4.5 so sm121 is supported in BlockScaledMmaOp

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Chores

    • Updated packaging/install flow to upgrade Python tooling early, added preparation steps for CUDA/cu13 runtimes, and bumped cutlass dependency to >=4.5.0; minor project metadata ordering adjusted.
  • Tests

    • Broadened GPU gating to include SM120/SM121 and added CUDA‑13+ detection for gated tests.
    • Test harness now installs branch-specific Python requirements during setup.

kahyunnam and others added 3 commits May 6, 2026 17:16
CUTLASS DSL 4.5 adds sm_121a to MmaSM120BlockScaledOp.admissible_archs,
fixing FP4 block-scaled MMA on DGX Spark (SM121) without a monkey-patch.
Also update the cute-dsl test skip condition to run on SM120/SM121.

Co-authored-by: Cursor <cursoragent@cursor.com>
Update requirements.txt and Docker install script to match the
pyproject.toml bump so CI picks up the new version.

Co-authored-by: Cursor <cursoragent@cursor.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 6, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Bumps nvidia-cutlass-dsl to >=4.5.0 in installer, requirements, and pyproject optionals; upgrades pip/setuptools early in the installer and adds cu13 prep steps; installs branch requirements during test setup; relaxes SM12x test gating to accept SM120/SM121 and adds CUDA‑13 detection for relevant tests.

Changes

Dependency, Packaging, and Test Gating

Layer / File(s) Summary
Installer prep
docker/install/install_python_packages.sh
Upgrades setuptools and pip early; in the cu13 path upgrades cuda-python and nvidia-cudnn-cu13; updates nvidia-cutlass-dsl[cu13] constraint to >=4.5.0.
Dependency constraints
requirements.txt
nvidia-cutlass-dsl constraint bumped from >=4.4.2 to >=4.5.0.
Packaging metadata / optionals
pyproject.toml
Reordered dynamic fields and updated [project.optional-dependencies] entries for cu12 and cu13 to require >=4.5.0.
Test utils wiring
scripts/test_utils.sh
install_and_verify now runs pip install -r requirements.txt from the branch after install_precompiled_kernels and before the editable local install.
FP4 GEMM test gating
tests/gemm/test_mm_fp4.py
Relaxed b12x compute-capability check to accept 12.x (SM120/SM121) rather than only 12.0; updated skip message.
MoE SM12x test gating
tests/moe/test_b12x_fused_moe.py
Added _cuda_13_or_newer() helper; removed _is_sm121() and the @not_sm121 skip; replaced with composite decorators requiring CuteDSL availability, SM120/SM121 support, and CUDA 13+ across affected tests.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

Suggested labels

cute-dsl

Suggested reviewers

  • sricketts
  • yzh119
  • cyx-6
  • jiahanc
  • jimmyzho
  • nv-yunzheq
  • samuellees

Poem

I’m a rabbit in the CI tree,
I nudge cutlass up to four point five,
pip and setuptools brushed and bright,
SM120 and SM121 say hi,
installs, tests, and builds—hop, delight. 🐇✨

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title 'Cutlass dsl 4.5 bump' clearly and directly describes the main change: upgrading the nvidia-cutlass-dsl dependency to version 4.5.
Description check ✅ Passed The PR description includes a clear explanation of the changes (bumping cutlass-dsl to 4.5 for SM121 support) and completes all checklist items, though the 'Related Issues' section is left empty.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the nvidia-cutlass-dsl dependency to version 4.5.0 across the project and expands the test suite to include SM12.x architectures for the cute-dsl backend. However, the changes to the test file are incomplete because the underlying requirement function in flashinfer/gemm/gemm_base.py still restricts the backend to SM10.x architectures, which will cause tests to be skipped on SM12.x devices despite the updates in this PR.

I am having trouble creating individual review comments. Click here to see my feedback.

tests/gemm/test_mm_fp4.py (35-36)

high

This change to allow testing cute-dsl on SM12.x architectures appears to be incomplete.

The check mm_fp4.is_backend_supported('cute-dsl', ...) at lines 22-25 will return False on SM12.x devices. This is because the underlying requirement function, _cute_dsl_gemm_fp4_requirement in flashinfer/gemm/gemm_base.py, is still restricted to SM10.x architectures via the @supported_compute_capability([100, 103]) decorator.

As a result, the test will be skipped before this new condition is evaluated, rendering this change ineffective for enabling tests on SM12.x.

To complete this change, the @supported_compute_capability decorator for _cute_dsl_gemm_fp4_requirement should be updated to include the new architectures (e.g., [100, 103, 120, 121]).

kahyunnam and others added 2 commits May 6, 2026 17:48
The CI test runner uses `pip install -e . --no-deps`, which skips
dependency resolution. Add an explicit `pip install -r requirements.txt`
before the editable install so that dependency bumps on the branch
(e.g. nvidia-cutlass-dsl) are picked up without rebuilding the Docker image.

Co-authored-by: Cursor <cursoragent@cursor.com>
nvidia-cutlass-dsl 4.5.0 adds sm_121a to MmaSM120BlockScaledOp.admissible_archs,
enabling warp-level MMA on SM121 (DGX Spark). Remove the SM121 test skips that
were guarding against the missing arch in 4.4.2.

Co-authored-by: Cursor <cursoragent@cursor.com>
@kahyunnam kahyunnam added the v0.6.11 release blocker label for 0.6.11 label May 6, 2026
@kahyunnam kahyunnam self-assigned this May 6, 2026
@kahyunnam
Copy link
Copy Markdown
Member Author

/bot run

1 similar comment
@kahyunnam
Copy link
Copy Markdown
Member Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !636 has been created, and the CI pipeline #50455625 is currently running. I'll report back once the pipeline job completes.

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !636 has been created, and the CI pipeline #50455636 is currently running. I'll report back once the pipeline job completes.

@aleozlx aleozlx added the run-ci label May 6, 2026
The base Docker image may ship older setuptools that can't properly
build the package. This matches the build-system requirement already
declared in pyproject.toml.

Co-authored-by: Cursor <cursoragent@cursor.com>
@kahyunnam kahyunnam force-pushed the knam/cutlass-dsl-4.5-bump branch from 419ad00 to 4fa728c Compare May 6, 2026 18:18
@kahyunnam
Copy link
Copy Markdown
Member Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !636 has been updated with latest changes, and the CI pipeline #50458106 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@pyproject.toml`:
- Around line 23-27: The pyproject.toml is missing a PEP 621-compliant version
declaration in the [project] table which breaks metadata generation; update the
[project] table to include a concrete version entry (e.g., version = "0.0.0") or
add dynamic = ["version", ...] in [project] to match what
[tool.setuptools.dynamic] provides so pip can build metadata; edit the [project]
section in pyproject.toml (and ensure consistency with
[tool.setuptools.dynamic]) to restore the version metadata required for pip
install -e ..
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: c3e855bd-bcb1-4655-9509-4c2fe62e9f9c

📥 Commits

Reviewing files that changed from the base of the PR and between d885b71 and 4fa728c.

📒 Files selected for processing (6)
  • docker/install/install_python_packages.sh
  • pyproject.toml
  • requirements.txt
  • scripts/test_utils.sh
  • tests/gemm/test_mm_fp4.py
  • tests/moe/test_b12x_fused_moe.py
💤 Files with no reviewable changes (1)
  • tests/moe/test_b12x_fused_moe.py

Comment thread pyproject.toml
setuptools 82.x has a regression with dynamic version fields and
custom build backends, and torch 2.11.0+cu129 requires setuptools<82.
Pin to >=77,<82 in both build-system requires and Docker setup.

Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pyproject.toml (1)

15-24: ⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

Restore [project].dynamic for version and dependencies to keep metadata generation valid.

[tool.setuptools.dynamic] is configured, but [project] does not declare those fields as dynamic. That can break pip install -e . / metadata builds.

🔧 Minimal fix
 [project]
 name = "flashinfer-python"
 description = "FlashInfer: Kernel Library for LLM Serving"
 requires-python = ">=3.10,<4.0"
 authors = [{ name = "FlashInfer team" }]
 license = "Apache-2.0"
 readme = "README.md"
 urls = { Homepage = "https://github.com/flashinfer-ai/flashinfer" }
 license-files = ["LICENSE", "LICENSE*.txt"]
+dynamic = ["version", "dependencies"]
#!/bin/bash
python - <<'PY'
import sys, tomllib
with open("pyproject.toml", "rb") as f:
    data = tomllib.load(f)

project = data.get("project", {})
dynamic = set(project.get("dynamic", []))
missing = []
if "version" not in project and "version" not in dynamic:
    missing.append("version")
if "dependencies" not in project and "dependencies" not in dynamic:
    missing.append("dependencies")

if missing:
    print("FAIL: missing [project].dynamic declarations for:", ", ".join(missing))
    sys.exit(1)

print("PASS: [project] declares required dynamic metadata fields.")
PY
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` around lines 15 - 24, pyproject.toml is missing a
[project].dynamic entry for the fields declared in [tool.setuptools.dynamic];
add a dynamic array under the [project] table that includes "version" and
"dependencies" (e.g., project.dynamic = ["version","dependencies"]) so metadata
generation works with the existing [tool.setuptools.dynamic] configuration;
update the [project] table where name/description are defined to include this
dynamic declaration.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pyproject.toml`:
- Around line 15-24: pyproject.toml is missing a [project].dynamic entry for the
fields declared in [tool.setuptools.dynamic]; add a dynamic array under the
[project] table that includes "version" and "dependencies" (e.g.,
project.dynamic = ["version","dependencies"]) so metadata generation works with
the existing [tool.setuptools.dynamic] configuration; update the [project] table
where name/description are defined to include this dynamic declaration.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 334df8f7-7ad9-4d17-946b-8122522eecb5

📥 Commits

Reviewing files that changed from the base of the PR and between 4fa728c and b379ce3.

📒 Files selected for processing (2)
  • docker/install/install_python_packages.sh
  • pyproject.toml

…mpliance

setuptools 82.x strictly enforces that dynamically-provided fields must
be declared in the [project] dynamic list. Without this, editable installs
fail with "project must contain ['version'] properties". This is the
proper fix; the setuptools<82 cap is no longer needed.

Co-authored-by: Cursor <cursoragent@cursor.com>
@kahyunnam
Copy link
Copy Markdown
Member Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !636 has been updated with latest changes, and the CI pipeline #50460049 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
pyproject.toml (1)

26-28: ⚠️ Potential issue | 🟠 Major | 🏗️ Heavy lift

Same >=4.5.0 version-floor concern applies here as in the Docker install script.

Both optional-dependency entries (cu12 and cu13) adopt the >=4.5.0 floor. If the stable 4.5.0 wheel is not yet live on the public PyPI index, any pip install flashinfer-python[cu12] or pip install flashinfer-python[cu13] invocation will fail to resolve the dependency. See the corresponding comment on docker/install/install_python_packages.sh for the verification steps.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@pyproject.toml` around lines 26 - 28, The optional-dependencies entries cu12
and cu13 in pyproject.toml pin nvidia-cutlass-dsl with a floor of ">=4.5.0",
which will fail if the 4.5.0 wheel isn't available; update those entries (cu12
and cu13) to reference a concrete, available package spec—either a released
exact version (e.g., nvidia-cutlass-dsl==<released-version>) or a constrained
range that matches a published wheel, or add a documented alternative
index/source for fetching the 4.5.0 wheel; ensure the change targets the
nvidia-cutlass-dsl spec for both cu12 and cu13.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@pyproject.toml`:
- Around line 26-28: The optional-dependencies entries cu12 and cu13 in
pyproject.toml pin nvidia-cutlass-dsl with a floor of ">=4.5.0", which will fail
if the 4.5.0 wheel isn't available; update those entries (cu12 and cu13) to
reference a concrete, available package spec—either a released exact version
(e.g., nvidia-cutlass-dsl==<released-version>) or a constrained range that
matches a published wheel, or add a documented alternative index/source for
fetching the 4.5.0 wheel; ensure the change targets the nvidia-cutlass-dsl spec
for both cu12 and cu13.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 5727f60e-386a-4908-8b37-ebb925d68a3e

📥 Commits

Reviewing files that changed from the base of the PR and between b379ce3 and b0edb4c.

📒 Files selected for processing (2)
  • docker/install/install_python_packages.sh
  • pyproject.toml

@kahyunnam kahyunnam enabled auto-merge (squash) May 6, 2026 22:47
@kahyunnam kahyunnam disabled auto-merge May 6, 2026 22:47
@kahyunnam
Copy link
Copy Markdown
Member Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !636 has been updated with latest changes, and the CI pipeline #50487011 is currently running. I'll report back once the pipeline job completes.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
scripts/test_utils.sh (1)

168-177: 💤 Low value

LGTM — minor note on the non-cu13 else branch

The cu13-path install (line 174) is meaningful: requirements.txt installs the base package without the [cu13] extra, and the explicit call adds it. The else-branch (line 176), however, re-installs a constraint (nvidia-cutlass-dsl>=4.5.0) that requirements.txt already satisfied, which causes a redundant PyPI round-trip on every non-cu13 CI run. Consider dropping the else branch if requirements.txt reliably covers the non-cu13 case:

♻️ Optional simplification
-        if [[ "${CUDA_VERSION}" == *"cu13"* ]] || [[ "${CUDA_VERSION}" == "13."* ]]; then
-            pip install --upgrade "nvidia-cutlass-dsl[cu13]>=4.5.0"
-        else
-            pip install --upgrade "nvidia-cutlass-dsl>=4.5.0"
-        fi
+        if [[ "${CUDA_VERSION}" == *"cu13"* ]] || [[ "${CUDA_VERSION}" == "13."* ]]; then
+            pip install --upgrade "nvidia-cutlass-dsl[cu13]>=4.5.0"
+        fi
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/test_utils.sh` around lines 168 - 177, The else branch redundantly
re-installs "nvidia-cutlass-dsl>=4.5.0" after requirements.txt already satisfied
it, causing unnecessary PyPI overhead; modify scripts/test_utils.sh to only run
the extra-cu13 pip install when CUDA_VERSION indicates cu13 (i.e., keep the if
branch that runs pip install --upgrade "nvidia-cutlass-dsl[cu13]>=4.5.0") and
remove the else branch so non-cu13 CI relies on requirements.txt instead.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@scripts/test_utils.sh`:
- Around line 168-177: The else branch redundantly re-installs
"nvidia-cutlass-dsl>=4.5.0" after requirements.txt already satisfied it, causing
unnecessary PyPI overhead; modify scripts/test_utils.sh to only run the
extra-cu13 pip install when CUDA_VERSION indicates cu13 (i.e., keep the if
branch that runs pip install --upgrade "nvidia-cutlass-dsl[cu13]>=4.5.0") and
remove the else branch so non-cu13 CI relies on requirements.txt instead.

ℹ️ Review info
⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: b8e4bd56-8bca-4926-b391-672699edf8f5

📥 Commits

Reviewing files that changed from the base of the PR and between b0edb4c and 992f88d.

📒 Files selected for processing (2)
  • requirements.txt
  • scripts/test_utils.sh
💤 Files with no reviewable changes (1)
  • requirements.txt

@kahyunnam
Copy link
Copy Markdown
Member Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !636 has been updated with latest changes, and the CI pipeline #50487672 is currently running. I'll report back once the pipeline job completes.

cutlass-dsl 4.5.0 no longer exposes Constexpr parameters (like
max_active_clusters) in the compiled kernel's runtime interface —
they are baked in at cute.compile() time. Remove the trailing mac
argument from static/micro and dynamic kernel launch calls.

Co-authored-by: Cursor <cursoragent@cursor.com>
@kahyunnam
Copy link
Copy Markdown
Member Author

/bot run

@flashinfer-bot
Copy link
Copy Markdown
Collaborator

GitLab MR !636 has been updated with latest changes, and the CI pipeline #50517666 is currently running. I'll report back once the pipeline job completes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants