Skip to content

Enable sm120f compilation#2650

Merged
yzh119 merged 5 commits intoflashinfer-ai:mainfrom
kahyunnam:knam/120f_compilation
Mar 4, 2026
Merged

Enable sm120f compilation#2650
yzh119 merged 5 commits intoflashinfer-ai:mainfrom
kahyunnam:knam/120f_compilation

Conversation

@kahyunnam
Copy link
Collaborator

@kahyunnam kahyunnam commented Feb 27, 2026

📌 Description

Enabled sm120f compilation for sm120 family related optimization for nvfp4. Tested for functionality, not perf. See related issues for more info.

🔍 Related Issues

#2649

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Release Notes

  • New Features

    • Added support for NVIDIA Blackwell SM 120f GPUs with optimized FP4 quantization in CUDA 12.9+.
    • New capability detection function for SM 120f support.
  • Documentation

    • Updated capability checks documentation to include new SM 120f support details.
  • Tests

    • Expanded test coverage to support additional GPU compute capabilities.
    • Improved kernel variant selection for different CUDA versions.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Feb 27, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Added SM120f (Blackwell architecture) support for FP4 quantization across the codebase. Updated CUDA architecture lists in CI workflows to branch on CUDA 12.9 threshold. Implemented runtime capability detection, JIT module generation, NVCC flags, and updated test utilities to enable SM120f FP4 operations.

Changes

Cohort / File(s) Summary
CI/Workflow CUDA Architecture Configuration
.github/workflows/nightly-release.yml, .github/workflows/release.yml
Updated FLASHINFER_CUDA_ARCH_LIST to use three-branch conditional on CUDA versions (< 12.9, < 13.0, >= 13.0), replacing single threshold logic to include 11.0a for CUDA 13.0+ and 12.0f for CUDA 12.9+.
JIT Core NVCC Flags
flashinfer/jit/core.py, flashinfer/jit/__init__.py
Added sm120f_nvcc_flags definition with compute_120f gencode and exported it via package init.py.
FP4 Quantization Module Generation
flashinfer/jit/fp4_quantization.py, flashinfer/fp4_quantization.py
Added gen_fp4_quantization_sm120f_module() function and wired "120f" backend into get_fp4_quantization_module with auto-switch logic for CUDA >= 12.9.
Runtime Capability Detection & AOT
flashinfer/aot.py, flashinfer/utils.py
Added detect_sm_capabilities recognition of "sm120f", introduced is_sm120f_supported() utility function, and updated gen_all_modules to conditionally append SM120f FP4 module.
Test Infrastructure & Documentation
CLAUDE.md, scripts/test_utils.sh, tests/utils/test_fp4_quantize.py
Updated documentation with is_sm120f_supported checks, adjusted precompiled kernel mapping for CUDA 12.8/12.9+ variants, and expanded FP4 test skip conditions to include is_sm110a_supported and is_sm12x_supported.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related issues

Possibly related PRs

Suggested reviewers

  • yzh119
  • cyx-6
  • yongwww
  • bkryu
  • nvmbreughe
  • ttyio

Poem

🐰 A Blackwell bloom in code takes flight,
SM120f shines through CUDA's night,
With FP4 kernels swift and small,
Version 12.9 conquers all!
Hopping through branches with flair, we compile bright ✨

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 34.78% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main purpose of the PR: enabling sm120f compilation, which is the primary objective evidenced across multiple code changes.
Description check ✅ Passed The PR description includes all key template sections: a clear description of what was done, a link to related issues, and completed pre-commit and testing checklists.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request enhances FlashInfer's support for NVIDIA's latest GPU architectures, specifically the Blackwell family (SM120), by enabling sm120f compilation for nvfp4 optimizations. It introduces a smart mechanism to select the most optimized architecture variant based on the available CUDA toolkit version, ensuring that advanced hardware features are utilized. The changes streamline the compilation process for these new architectures and broaden the scope of existing tests to cover the expanded support.

Highlights

  • SM120f Compilation Enabled: Enabled compilation for sm120f (Blackwell family) to leverage specific optimizations for nvfp4 when CUDA 12.9 or newer is available.
  • Dynamic Architecture Suffix Selection: Introduced a new utility function _get_arch_suffix to dynamically determine whether to use the 'a' (architecture-specific) or 'f' (feature-set) suffix for GPU architectures (SM100+), preferring 'f' for Blackwell+ with CUDA 12.9+ to enable native FP4 conversion instructions.
  • FP4 Quantization Module Integration: Integrated the gen_fp4_quantization_sm120f_module into the AOT compilation process and updated the get_fp4_quantization_module to prioritize the sm120f variant for fp4 quantization on sm120 devices when the CUDA version supports it.
  • Expanded Test Coverage: Updated FP4 quantization tests to include checks for sm110a and sm12x support, ensuring broader compatibility and correctness across different GPU architectures.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • CLAUDE.md
    • Updated documentation to include is_sm100f_supported and is_sm120f_supported in the list of supported GPU architecture checks.
  • flashinfer/aot.py
    • Imported gen_fp4_quantization_sm120f_module.
    • Added has_sm120f check to gen_all_modules.
    • Conditionally appended gen_fp4_quantization_sm120f_module() to jit_specs if has_sm120f is true.
    • Added sm120f to the sm_capabilities dictionary with a CUDA 12.9 requirement.
  • flashinfer/compilation_context.py
    • Added _get_arch_suffix function to dynamically determine 'a' or 'f' suffix for GPU architectures based on CUDA version (>=12.9 for Blackwell+).
    • Modified CompilationContext to use _get_arch_suffix when determining TARGET_CUDA_ARCHS.
  • flashinfer/fp4_quantization.py
    • Imported sm120f_nvcc_flags.
    • Added gen_fp4_quantization_sm120f_module function.
    • Updated get_fp4_quantization_module to include 120f in backend_modules and to prefer 120f over 120 if CUDA version is 12.9 or higher.
  • flashinfer/jit/init.py
    • Exported sm120f_nvcc_flags for use within the JIT compilation system.
  • flashinfer/jit/core.py
    • Defined sm120f_nvcc_flags for the compute_120f architecture.
  • flashinfer/jit/fp4_quantization.py
    • Imported sm120f_nvcc_flags.
    • Added the gen_fp4_quantization_sm120f_module function.
  • flashinfer/utils.py
    • Added a utility function is_sm120f_supported to check for sm120f compute capability and CUDA version.
  • scripts/test_utils.sh
    • Corrected the logic for determining the effective JIT architecture based on CUDA stream for sm120 family.
  • tests/utils/test_fp4_quantize.py
    • Imported is_sm110a_supported and is_sm12x_supported.
    • Expanded the GPU architecture support checks in FP4 quantization tests to include is_sm110a_supported and is_sm12x_supported alongside is_sm100a_supported.
Ignored Files
  • Ignored by pattern: .github/workflows/** (2)
    • .github/workflows/nightly-release.yml
    • .github/workflows/release.yml
Activity
  • The author noted that the changes were tested for functionality, but not for performance.
  • The pre-commit checks and test checklists in the PR description are currently unchecked, indicating pending verification or completion.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !354 has been created, and the CI pipeline #44996715 is currently running. I'll report back once the pipeline job completes.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request successfully enables compilation for the sm120f architecture. The changes are consistently applied across documentation, build scripts, JIT compilation logic, and tests. I have two suggestions for improvement: one to enhance error handling by using a more specific exception, and another to refactor duplicated code in the tests for better maintainability. Overall, the changes are solid and achieve the stated goal.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (3)
tests/utils/test_fp4_quantize.py (1)

158-159: Consider extracting the repeated skip check into a helper.

The same three-way capability check appears in 8 test functions. A helper would reduce duplication and make future updates easier.

♻️ Suggested refactor

Add a helper at module level:

def _is_nvfp4_supported(device: str = "cuda") -> bool:
    """Check if Nvfp4 is supported on the given device."""
    dev = torch.device(device)
    return (
        is_sm100a_supported(dev)
        or is_sm110a_supported(dev)
        or is_sm12x_supported(dev)
    )

Then replace each guard with:

if not _is_nvfp4_supported(device):
    pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/test_fp4_quantize.py` around lines 158 - 159, The repeated
three-way capability check using is_sm100a_supported, is_sm110a_supported, and
is_sm12x_supported should be extracted into a module-level helper (suggested
name _is_nvfp4_supported) that accepts an optional device string (default
"cuda") and returns the OR of those three checks; then replace each repeated
guard in tests/utils/test_fp4_quantize.py with a single if not
_is_nvfp4_supported(device): pytest.skip(...) call to remove duplication and
centralize updates.
flashinfer/compilation_context.py (1)

41-45: Consider logging the exception for debugging.

The silent except Exception: pass can mask version parsing issues. While the fallback to "a" suffix is safe, logging would aid debugging when CUDA version parsing fails unexpectedly.

🛠️ Suggested improvement
     if major >= 10 and cuda_version is not None:
         try:
             if pkg_version.parse(cuda_version) >= pkg_version.parse("12.9"):
                 suffix = "f"
-        except Exception:
-            pass
+        except Exception as e:
+            logger.debug(f"Failed to parse CUDA version '{cuda_version}': {e}, using default suffix 'a'")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/compilation_context.py` around lines 41 - 45, The try/except
around pkg_version.parse(cuda_version) swallows errors and hides parsing
failures; change it to catch the exception as a variable (e.g., except Exception
as e:) and log the error before proceeding so you still fall back to the default
suffix. Use the standard logger (e.g.,
logging.getLogger(__name__).exception(...) or logger.error(..., exc_info=True))
to record the exception for the pkg_version.parse(cuda_version) call while
leaving the fallback behavior for the suffix variable unchanged.
flashinfer/fp4_quantization.py (1)

154-160: Redundant import: torch is already imported at module level (line 22).

The import statement at line 158 is unnecessary since torch is already imported at the top of the file.

♻️ Suggested fix
     # Prefer 'f' (feature-set) variant when CUDA version supports it (>= 12.9),
     # as it enables native FP4 conversion instructions (cvt.rn.satfinite.e2m1x2.f32).
     if backend == "120":
         from .utils import version_at_least
-        import torch
         if version_at_least(torch.version.cuda, "12.9"):
             backend = "120f"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fp4_quantization.py` around lines 154 - 160, The local import of
torch inside the backend selection block is redundant because torch is already
imported at the module level; remove the inner "import torch" line in the block
that checks version_at_least(torch.version.cuda, "12.9") (the code working with
the backend variable and the version_at_least function) so the block simply
calls version_at_least(torch.version.cuda, "12.9") and sets backend = "120f"
when true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 115-116: Reformat the long boolean device-check condition in
tests/utils/test_fp4_quantize.py so it fits line-length rules: wrap the ORed
calls in a parenthesized multi-line expression (or assign torch.device(device)
to a local variable) and place each predicate (is_sm100a_supported(...),
is_sm110a_supported(...), is_sm12x_supported(...)) on its own line indented for
readability, then call pytest.skip(...) if the whole expression is False; apply
the same multi-line reformatting pattern to the other occurrences in the file
referenced by the review (the conditions at lines corresponding to the other
test functions).

---

Nitpick comments:
In `@flashinfer/compilation_context.py`:
- Around line 41-45: The try/except around pkg_version.parse(cuda_version)
swallows errors and hides parsing failures; change it to catch the exception as
a variable (e.g., except Exception as e:) and log the error before proceeding so
you still fall back to the default suffix. Use the standard logger (e.g.,
logging.getLogger(__name__).exception(...) or logger.error(..., exc_info=True))
to record the exception for the pkg_version.parse(cuda_version) call while
leaving the fallback behavior for the suffix variable unchanged.

In `@flashinfer/fp4_quantization.py`:
- Around line 154-160: The local import of torch inside the backend selection
block is redundant because torch is already imported at the module level; remove
the inner "import torch" line in the block that checks
version_at_least(torch.version.cuda, "12.9") (the code working with the backend
variable and the version_at_least function) so the block simply calls
version_at_least(torch.version.cuda, "12.9") and sets backend = "120f" when
true.

In `@tests/utils/test_fp4_quantize.py`:
- Around line 158-159: The repeated three-way capability check using
is_sm100a_supported, is_sm110a_supported, and is_sm12x_supported should be
extracted into a module-level helper (suggested name _is_nvfp4_supported) that
accepts an optional device string (default "cuda") and returns the OR of those
three checks; then replace each repeated guard in
tests/utils/test_fp4_quantize.py with a single if not
_is_nvfp4_supported(device): pytest.skip(...) call to remove duplication and
centralize updates.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ad94692 and b1e74fc.

📒 Files selected for processing (12)
  • .github/workflows/nightly-release.yml
  • .github/workflows/release.yml
  • CLAUDE.md
  • flashinfer/aot.py
  • flashinfer/compilation_context.py
  • flashinfer/fp4_quantization.py
  • flashinfer/jit/__init__.py
  • flashinfer/jit/core.py
  • flashinfer/jit/fp4_quantization.py
  • flashinfer/utils.py
  • scripts/test_utils.sh
  • tests/utils/test_fp4_quantize.py

@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !354 has been updated with latest changes, and the CI pipeline #44997805 is currently running. I'll report back once the pipeline job completes.

env:
DOCKER_IMAGE: ${{ matrix.arch == 'aarch64' && format('pytorch/manylinuxaarch64-builder:cuda{0}', matrix.cuda) || format('pytorch/manylinux2_28-builder:cuda{0}', matrix.cuda) }}
FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f' }}
FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '12.9' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || (matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 10.3a 12.0f' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f') }}
Copy link
Collaborator Author

@kahyunnam kahyunnam Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The rationale here:

knam@cudnn-dev-ballast-22-04:/home/scratch.knam$ docker run --gpus all -it --rm   -v $(pwd):/workspace   -w /workspace   --ipc=host  dockerhub.nvidia.com/flashinfer/flashinfer-ci-cu128  /bin/bash

==========
== CUDA ==
==========

CUDA Version 12.8.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

(py312) root@3c562d8b340b:/workspace# nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120
(py312) root@3c562d8b340b:/workspace# for arch in 100a 100f 103a 103f 110a 110f 120a 120f 121a 121f; do
  echo -n "compute_${arch}: "
  echo '__global__ void k(){}' > /tmp/test.cu
  nvcc -gencode=arch=compute_${arch},code=sm_${arch} -c /tmp/test.cu -o /dev/null 2>/dev/null && echo "OK" || echo "FAIL"
done
compute_100a: OK
compute_100f: FAIL
compute_103a: FAIL
compute_103f: FAIL
compute_110a: FAIL
compute_110f: FAIL
compute_120a: OK
compute_120f: FAIL
compute_121a: FAIL
compute_121f: FAIL


knam@cudnn-dev-ballast-22-04:/home/scratch.knam$ docker run --gpus all -it --rm   -v $(pwd):/workspace   -w /workspace   --ipc=host  dockerhub.nvidia.com/flashinfer/flashinfer-ci-cu129   /bin/bash

==========
== CUDA ==
==========

CUDA Version 12.9.0

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

(py312) root@3aa77b7298a5:/workspace# nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_103
compute_120
compute_121
(py312) root@7553296b7b81:/workspace# for arch in 100a 100f 103a 103f 110a 110f 120a 120f 121a 121f; do
  echo -n "compute_${arch}: "
  echo '__global__ void k(){}' > /tmp/test.cu
  nvcc -gencode=arch=compute_${arch},code=sm_${arch} -c /tmp/test.cu -o /dev/null 2>/dev/null && echo "OK" || echo "FAIL"
done
compute_100a: OK
compute_100f: OK
compute_103a: OK
compute_103f: OK
compute_110a: FAIL
compute_110f: FAIL
compute_120a: OK
compute_120f: OK
compute_121a: OK
compute_121f: OK


knam@cudnn-dev-ballast-22-04:/home/scratch.knam$ docker run --gpus all -it --rm   -v $(pwd):/workspace   -w /workspace   --ipc=host  dockerhub.nvidia.com/flashinfer/flashinfer-ci-cu130  /bin/bash

==========
== CUDA ==
==========

CUDA Version 13.0.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

(py312) root@ac43bacddc6a:/workspace# nvcc --list-gpu-arch
compute_75
compute_80
compute_86
compute_87
compute_88
compute_89
compute_90
compute_100
compute_110
compute_103
compute_120
compute_121
(py312) root@e1bec98f84d3:/workspace# for arch in 100a 100f 103a 103f 110a 110f 120a 120f 121a 121f; do
  echo -n "compute_${arch}: "
  echo '__global__ void k(){}' > /tmp/test.cu
  nvcc -gencode=arch=compute_${arch},code=sm_${arch} -c /tmp/test.cu -o /dev/null 2>/dev/null && echo "OK" || echo "FAIL"
done
compute_100a: OK
compute_100f: OK
compute_103a: OK
compute_103f: OK
compute_110a: OK
compute_110f: OK
compute_120a: OK
compute_120f: OK
compute_121a: OK
compute_121f: OK


Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/compilation_context.py`:
- Around line 45-50: Replace the broad "except Exception: pass" in the CUDA
gating block (the if block checking major/minor and using pkg_version.parse on
cuda_version) with an explicit except InvalidVersion that logs why parsing
failed and that you're falling back to the default suffix; import InvalidVersion
from packaging.version (or from the same pkg_version module) and log the
exception message via the module logger or logging.warning, so failures aren't
silently swallowed and the fallback to suffix "a" is recorded.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b1e74fc and ff0ea8e.

📒 Files selected for processing (3)
  • .github/workflows/nightly-release.yml
  • .github/workflows/release.yml
  • flashinfer/compilation_context.py
🚧 Files skipped from review as they are similar to previous changes (2)
  • .github/workflows/release.yml
  • .github/workflows/nightly-release.yml

Comment on lines +45 to +50
if major == 12 and minor == 0 and cuda_version is not None:
try:
if pkg_version.parse(cuda_version) >= pkg_version.parse("12.9"):
suffix = "f"
except Exception:
pass
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

cat -n flashinfer/compilation_context.py | head -80

Repository: flashinfer-ai/flashinfer

Length of output: 3662


🏁 Script executed:

python3 << 'EOF'
# Check what exceptions are available in packaging.version
try:
    from packaging.version import InvalidVersion
    print("InvalidVersion found and importable")
    print(f"InvalidVersion type: {InvalidVersion}")
    print(f"InvalidVersion bases: {InvalidVersion.__bases__}")
except ImportError as e:
    print(f"Cannot import InvalidVersion: {e}")

# Test what exception is raised by pkg_version.parse for invalid versions
from packaging import version as pkg_version
try:
    pkg_version.parse("invalid_version_string")
except Exception as e:
    print(f"Exception type raised by parse: {type(e).__name__}")
    print(f"Exception message: {e}")
EOF

Repository: flashinfer-ai/flashinfer

Length of output: 321


Avoid silent exception swallowing in CUDA version gating.

At lines 45–50, except Exception: pass hides why SM120 may silently fall back to a. Catch InvalidVersion explicitly and log the fallback reason.

Proposed fix
-    from packaging import version as pkg_version
+    from packaging import version as pkg_version
+    from packaging.version import InvalidVersion
@@
-        except Exception:
-            pass
+        except InvalidVersion:
+            logger.debug("Unparseable CUDA version '%s'; using 'a' suffix for sm_%s%s", cuda_version, major, minor)
🧰 Tools
🪛 Ruff (0.15.2)

[error] 49-50: try-except-pass detected, consider logging the exception

(S110)


[warning] 49-49: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/compilation_context.py` around lines 45 - 50, Replace the broad
"except Exception: pass" in the CUDA gating block (the if block checking
major/minor and using pkg_version.parse on cuda_version) with an explicit except
InvalidVersion that logs why parsing failed and that you're falling back to the
default suffix; import InvalidVersion from packaging.version (or from the same
pkg_version module) and log the exception message via the module logger or
logging.warning, so failures aren't silently swallowed and the fallback to
suffix "a" is recorded.

@aleozlx aleozlx added the run-ci label Feb 27, 2026
@johnnynunez
Copy link
Contributor

thanks @aleozlx @kahyunnam <3

jjarquin added a commit to vistralis/flashinfer that referenced this pull request Feb 27, 2026
…on checks

- Restore sm120a/sm121a nvcc flags, add sm120f/sm121f alongside
- Add _get_arch_suffix() for dynamic a/f suffix selection based on CUDA version
  - SM120: 'f' suffix with CUDA >= 12.9 (matching upstream PR flashinfer-ai#2650)
  - SM121: 'f' suffix with CUDA >= 13.0 (DGX Spark support)
- Add is_sm120f_supported() and is_sm121f_supported() utilities
- Fix device_support_pdl() to return False on SM121 (GB10 lacks PDL support)
- Export sm120f_nvcc_flags and sm121f_nvcc_flags from jit package

Co-developed-with: flashinfer-ai#2650
instructions such as native FP4 conversion (cvt.rn.satfinite.e2m1x2.f32).

Note: 'a' and 'f' are different feature sets, not a superset relationship.
We only auto-select 'f' for SM120 where it's been verified to improve FP4 performance.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: In the case of 120/121, is there anything we lose by going from 120a/121a to 120f? If not, I might say that our reasoning here is not just that 120f improves FP4 perf -- it's also that 120f doesn't drop an capability vs the unmodified (120/121) and a variants (120a/121a), so there's no reason not to use 120f.

Copy link
Collaborator Author

@kahyunnam kahyunnam Feb 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought in theory the 'a' arch specific flag is supposed to have more architecture-specific optimizations than 'f' family compatible flag. But it seems like there's some driver bug here where 120f somehow compiles into more optimized ptx than 120a (for now this 120f might be more of a quick fix for nvfp4 perf for vllm)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #44997805: 7/20 passed

@johnnynunez
Copy link
Contributor

@aleozlx @kahyunnam @yzh119 how is it going?

@johnnynunez
Copy link
Contributor

@yzh119 could you take a look and merge it asap?
thank you!!

@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !354 has been updated with latest changes, and the CI pipeline #45217326 is currently running. I'll report back once the pipeline job completes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@flashinfer/fp4_quantization.py`:
- Around line 154-159: The code calls version_at_least(torch.version.cuda,
"12.9") which will raise if torch.version.cuda is None; update the conditional
in fp4_quantization.py so you first check that torch.version.cuda is truthy (or
is a string) before calling version_at_least — e.g. change the inner if to
verify torch.version.cuda is not None/empty and then call
version_at_least(torch.version.cuda, "12.9") to decide whether to set backend =
"120f"; reference the existing backend variable and version_at_least function
when making the guard.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff0ea8e and c05d4f7.

📒 Files selected for processing (1)
  • flashinfer/fp4_quantization.py

Comment on lines +154 to +159
# Prefer 'f' (feature-set) variant when CUDA version supports it (>= 12.9),
# as it enables native FP4 conversion instructions (cvt.rn.satfinite.e2m1x2.f32).
if backend == "120":
from .utils import version_at_least
if version_at_least(torch.version.cuda, "12.9"):
backend = "120f"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Call site (flashinfer/fp4_quantization.py) =="
sed -n '150,162p' flashinfer/fp4_quantization.py

echo
echo "== version_at_least implementation (flashinfer/utils.py) =="
sed -n '505,512p' flashinfer/utils.py

echo
echo "== Nearby usage context for torch.version.cuda =="
rg -n -C2 'torch\.version\.cuda|version_at_least\(' flashinfer/fp4_quantization.py

Repository: flashinfer-ai/flashinfer

Length of output: 1522


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Check torch.version.cuda type when CPU-only =="
python3 << 'EOF'
import torch
print(f"torch.version.cuda = {torch.version.cuda}")
print(f"type = {type(torch.version.cuda)}")

# Simulate what happens with None
try:
    from packaging import version as pkg_version
    result = pkg_version.parse(None)
    print(f"parse(None) succeeded: {result}")
except Exception as e:
    print(f"parse(None) failed with: {type(e).__name__}: {e}")
EOF

echo
echo "== Search for other torch.version.cuda usages in the codebase =="
rg -n 'torch\.version\.cuda' flashinfer/

echo
echo "== Check if there's a wrapper function for safe version checking =="
rg -n 'is_cuda_version_at_least\|cuda.*version' flashinfer/utils.py | head -20

Repository: flashinfer-ai/flashinfer

Length of output: 252


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Search for other torch.version.cuda usages in the codebase =="
rg -n 'torch\.version\.cuda' flashinfer/

echo
echo "== Check if there's a wrapper function for safe version checking =="
rg -n 'is_cuda_version_at_least' flashinfer/

echo
echo "== Check entire utils.py for version-related functions =="
rg -n 'def.*version' flashinfer/utils.py

echo
echo "== Look for is_cuda_version_at_least implementation =="
rg -A 5 'def is_cuda_version_at_least' flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 3916


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Look at get_cuda_version() implementation in cpp_ext.py =="
rg -B 5 -A 10 'def get_cuda_version' flashinfer/jit/cpp_ext.py

echo
echo "== Check if version_at_least expects string or can handle None =="
sed -n '505,515p' flashinfer/utils.py

echo
echo "== Verify context around line 78 in cpp_ext.py that checks for None =="
sed -n '75,85p' flashinfer/jit/cpp_ext.py

echo
echo "== Check all callers of version_at_least to see if they guard =="
rg -B 2 'version_at_least\(torch\.version\.cuda' flashinfer/

Repository: flashinfer-ai/flashinfer

Length of output: 3775


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== Check what happens when pkg_version.parse receives None =="
python3 << 'EOF'
from packaging import version as pkg_version
try:
    result = pkg_version.parse(None)
    print(f"parse(None) succeeded: {result}")
except TypeError as e:
    print(f"parse(None) raises TypeError: {e}")
except Exception as e:
    print(f"parse(None) raises {type(e).__name__}: {e}")
EOF

echo
echo "== Full context around fp4_quantization.py lines 150-165 =="
sed -n '150,165p' flashinfer/fp4_quantization.py

echo
echo "== Check if this is called at import time or later =="
rg -B 5 'backend = "120"' flashinfer/fp4_quantization.py | head -30

Repository: flashinfer-ai/flashinfer

Length of output: 909


Guard torch.version.cuda against None before version comparison.

At line 158, version_at_least(torch.version.cuda, "12.9") will crash with a TypeError when torch.version.cuda is None (CPU-only PyTorch builds). The packaging.version.parse() function requires a string and does not accept None.

🔧 Proposed fix
     if backend == "120":
         from .utils import version_at_least
-        if version_at_least(torch.version.cuda, "12.9"):
+        cuda_version = torch.version.cuda
+        if cuda_version is not None and version_at_least(cuda_version, "12.9"):
             backend = "120f"
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/fp4_quantization.py` around lines 154 - 159, The code calls
version_at_least(torch.version.cuda, "12.9") which will raise if
torch.version.cuda is None; update the conditional in fp4_quantization.py so you
first check that torch.version.cuda is truthy (or is a string) before calling
version_at_least — e.g. change the inner if to verify torch.version.cuda is not
None/empty and then call version_at_least(torch.version.cuda, "12.9") to decide
whether to set backend = "120f"; reference the existing backend variable and
version_at_least function when making the guard.

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #45217326: 1/20 passed

@kahyunnam
Copy link
Collaborator Author

kahyunnam commented Mar 3, 2026

[FAILED] Pipeline #45217326: 1/20 passed

~~ Not sure what's happening here, all the errors are just docker image failures due to a trailing semicolon? I see this in log: DOCKER_IMAGE=dockerhub.nvidia.com/flashinfer/flashinfer-ci-cu129: @yongwww @dierksen ~~

nvm, I figured out what was going on. I didn't commit my merge conflict fix so the prepare stage would fail, not generating the artifacts needed for docker version tag, causing seeming trailing semicolon.

@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !354 has been created, and the CI pipeline #45258467 is currently running. I'll report back once the pipeline job completes.

@kahyunnam kahyunnam force-pushed the knam/120f_compilation branch from c05d4f7 to b7c0b7d Compare March 3, 2026 19:28
@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !354 has been updated with latest changes, and the CI pipeline #45260681 is currently running. I'll report back once the pipeline job completes.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (2)
tests/utils/test_fp4_quantize.py (1)

115-116: ⚠️ Potential issue | 🟡 Minor

Reflow the long skip conditions to unblock pre-commit.

Lines 115, 158, 194, 243, 308, 336, 379, and 426 keep the boolean guard on one line; this is consistent with the reported ruff-format hook rewrite and can fail CI until committed in formatted form.

Suggested formatting pattern (apply to each occurrence)
-    if not (is_sm100a_supported(torch.device(device)) or is_sm110a_supported(torch.device(device)) or is_sm12x_supported(torch.device(device))):
+    if not (
+        is_sm100a_supported(torch.device(device))
+        or is_sm110a_supported(torch.device(device))
+        or is_sm12x_supported(torch.device(device))
+    ):
         pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")

Also applies to: 158-159, 194-195, 243-244, 308-309, 336-337, 379-380, 426-427

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/utils/test_fp4_quantize.py` around lines 115 - 116, The long boolean
guard in the skip checks (the if not (...) using is_sm100a_supported,
is_sm110a_supported, is_sm12x_supported with torch.device and pytest.skip)
should be reflowed across multiple lines so the condition is wrapped in
parentheses and each or-clause is on its own line; update every occurrence of
that pattern (the if not (is_sm100a_supported(torch.device(device)) or
is_sm110a_supported(torch.device(device)) or
is_sm12x_supported(torch.device(device))): pytest.skip(...)) to use a
multi-line, parenthesized condition to satisfy ruff-format while keeping the
exact logic and skip message unchanged.
flashinfer/compilation_context.py (1)

46-50: ⚠️ Potential issue | 🟡 Minor

Narrow the fallback exception instead of swallowing all errors.

At Line 49, except Exception: pass can hide unexpected runtime errors in addition to version-parse failures. Catch InvalidVersion explicitly and log fallback behavior.

🔧 Proposed fix
 def _get_arch_suffix(major: int, minor: int) -> str:
@@
-    from packaging import version as pkg_version
+    from packaging import version as pkg_version
+    from packaging.version import InvalidVersion
@@
-        except Exception:
-            pass
+        except InvalidVersion as err:
+            logger.debug(
+                "Unparseable CUDA version '%s'; falling back to 'a' suffix for sm_%s%s (%s)",
+                cuda_version,
+                major,
+                minor,
+                err,
+            )
#!/bin/bash
python - <<'PY'
from packaging import version as pkg_version
from packaging.version import InvalidVersion

samples = ["12.9", "bad.version", None]
for v in samples:
    try:
        out = pkg_version.parse(v)
        print(f"{v!r} -> OK: {out}")
    except Exception as e:
        print(f"{v!r} -> {type(e).__name__}: {e}")
PY
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@flashinfer/compilation_context.py` around lines 46 - 50, The try/except
currently swallows all exceptions around pkg_version.parse(cuda_version) which
can hide real errors; change the handler to catch
packaging.version.InvalidVersion specifically (import InvalidVersion from
packaging.version) and in that except branch log a clear fallback message (e.g.,
logger.warning or module logger) indicating parsing failed and the suffix
fallback is being used; leave other exceptions to propagate so real errors are
not silently ignored and keep the same logic that sets suffix = "f" when parse
succeeds and version >= "12.9".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@flashinfer/compilation_context.py`:
- Around line 46-50: The try/except currently swallows all exceptions around
pkg_version.parse(cuda_version) which can hide real errors; change the handler
to catch packaging.version.InvalidVersion specifically (import InvalidVersion
from packaging.version) and in that except branch log a clear fallback message
(e.g., logger.warning or module logger) indicating parsing failed and the suffix
fallback is being used; leave other exceptions to propagate so real errors are
not silently ignored and keep the same logic that sets suffix = "f" when parse
succeeds and version >= "12.9".

In `@tests/utils/test_fp4_quantize.py`:
- Around line 115-116: The long boolean guard in the skip checks (the if not
(...) using is_sm100a_supported, is_sm110a_supported, is_sm12x_supported with
torch.device and pytest.skip) should be reflowed across multiple lines so the
condition is wrapped in parentheses and each or-clause is on its own line;
update every occurrence of that pattern (the if not
(is_sm100a_supported(torch.device(device)) or
is_sm110a_supported(torch.device(device)) or
is_sm12x_supported(torch.device(device))): pytest.skip(...)) to use a
multi-line, parenthesized condition to satisfy ruff-format while keeping the
exact logic and skip message unchanged.

ℹ️ Review info

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between c05d4f7 and b7c0b7d.

📒 Files selected for processing (12)
  • .github/workflows/nightly-release.yml
  • .github/workflows/release.yml
  • CLAUDE.md
  • flashinfer/aot.py
  • flashinfer/compilation_context.py
  • flashinfer/fp4_quantization.py
  • flashinfer/jit/__init__.py
  • flashinfer/jit/core.py
  • flashinfer/jit/fp4_quantization.py
  • flashinfer/utils.py
  • scripts/test_utils.sh
  • tests/utils/test_fp4_quantize.py
🚧 Files skipped from review as they are similar to previous changes (6)
  • flashinfer/utils.py
  • .github/workflows/release.yml
  • flashinfer/jit/fp4_quantization.py
  • flashinfer/jit/init.py
  • CLAUDE.md
  • flashinfer/aot.py

@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #45260681: 8/20 passed

@kahyunnam kahyunnam self-assigned this Mar 4, 2026
@kahyunnam
Copy link
Collaborator Author

/bot run

@flashinfer-bot
Copy link
Collaborator

GitLab MR !354 has been updated with latest changes, and the CI pipeline #45281727 is currently running. I'll report back once the pipeline job completes.

@yzh119 yzh119 enabled auto-merge (squash) March 4, 2026 01:05
@flashinfer-bot
Copy link
Collaborator

[FAILED] Pipeline #45281727: 10/20 passed

@yzh119 yzh119 merged commit 635505f into flashinfer-ai:main Mar 4, 2026
38 of 39 checks passed
ameynaik-hub pushed a commit to ameynaik-hub/flashinfer that referenced this pull request Mar 18, 2026
<!-- .github/pull_request_template.md -->

## 📌 Description

Enabled sm120f compilation for sm120 family related optimization for
nvfp4. Tested for functionality, not perf. See related issues for more
info.

## 🔍 Related Issues

[https://github.com/flashinfer-ai/flashinfer/issues/2649](https://github.com/flashinfer-ai/flashinfer/issues/2649)

## 🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull
request, please make sure the following items are complete.

### ✅ Pre-commit Checks

- [x] I have installed `pre-commit` by running `pip install pre-commit`
(or used your preferred method).
- [x] I have installed the hooks with `pre-commit install`.
- [x] I have run the hooks manually with `pre-commit run --all-files`
and fixed any reported issues.

> If you are unsure about how to set up `pre-commit`, see [the
pre-commit documentation](https://pre-commit.com/).

## 🧪 Tests

- [x] Tests have been added or updated as needed.
- [x] All tests are passing (`unittest`, etc.).

## Reviewer Notes

<!-- Optional: anything you'd like reviewers to focus on, concerns, etc.
-->

<!-- This is an auto-generated comment: release notes by coderabbit.ai
-->
## Summary by CodeRabbit

## Release Notes

* **New Features**
* Added support for NVIDIA Blackwell SM 120f GPUs with optimized FP4
quantization in CUDA 12.9+.
  * New capability detection function for SM 120f support.

* **Documentation**
* Updated capability checks documentation to include new SM 120f support
details.

* **Tests**
* Expanded test coverage to support additional GPU compute capabilities.
  * Improved kernel variant selection for different CUDA versions.
<!-- end of auto-generated comment: release notes by coderabbit.ai -->

Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants