Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded SM120f (Blackwell architecture) support for FP4 quantization across the codebase. Updated CUDA architecture lists in CI workflows to branch on CUDA 12.9 threshold. Implemented runtime capability detection, JIT module generation, NVCC flags, and updated test utilities to enable SM120f FP4 operations. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes Possibly related issues
Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request enhances FlashInfer's support for NVIDIA's latest GPU architectures, specifically the Blackwell family (SM120), by enabling Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Ignored Files
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
|
/bot run |
There was a problem hiding this comment.
Code Review
This pull request successfully enables compilation for the sm120f architecture. The changes are consistently applied across documentation, build scripts, JIT compilation logic, and tests. I have two suggestions for improvement: one to enhance error handling by using a more specific exception, and another to refactor duplicated code in the tests for better maintainability. Overall, the changes are solid and achieve the stated goal.
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (3)
tests/utils/test_fp4_quantize.py (1)
158-159: Consider extracting the repeated skip check into a helper.The same three-way capability check appears in 8 test functions. A helper would reduce duplication and make future updates easier.
♻️ Suggested refactor
Add a helper at module level:
def _is_nvfp4_supported(device: str = "cuda") -> bool: """Check if Nvfp4 is supported on the given device.""" dev = torch.device(device) return ( is_sm100a_supported(dev) or is_sm110a_supported(dev) or is_sm12x_supported(dev) )Then replace each guard with:
if not _is_nvfp4_supported(device): pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/utils/test_fp4_quantize.py` around lines 158 - 159, The repeated three-way capability check using is_sm100a_supported, is_sm110a_supported, and is_sm12x_supported should be extracted into a module-level helper (suggested name _is_nvfp4_supported) that accepts an optional device string (default "cuda") and returns the OR of those three checks; then replace each repeated guard in tests/utils/test_fp4_quantize.py with a single if not _is_nvfp4_supported(device): pytest.skip(...) call to remove duplication and centralize updates.flashinfer/compilation_context.py (1)
41-45: Consider logging the exception for debugging.The silent
except Exception: passcan mask version parsing issues. While the fallback to "a" suffix is safe, logging would aid debugging when CUDA version parsing fails unexpectedly.🛠️ Suggested improvement
if major >= 10 and cuda_version is not None: try: if pkg_version.parse(cuda_version) >= pkg_version.parse("12.9"): suffix = "f" - except Exception: - pass + except Exception as e: + logger.debug(f"Failed to parse CUDA version '{cuda_version}': {e}, using default suffix 'a'")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/compilation_context.py` around lines 41 - 45, The try/except around pkg_version.parse(cuda_version) swallows errors and hides parsing failures; change it to catch the exception as a variable (e.g., except Exception as e:) and log the error before proceeding so you still fall back to the default suffix. Use the standard logger (e.g., logging.getLogger(__name__).exception(...) or logger.error(..., exc_info=True)) to record the exception for the pkg_version.parse(cuda_version) call while leaving the fallback behavior for the suffix variable unchanged.flashinfer/fp4_quantization.py (1)
154-160: Redundant import:torchis already imported at module level (line 22).The import statement at line 158 is unnecessary since
torchis already imported at the top of the file.♻️ Suggested fix
# Prefer 'f' (feature-set) variant when CUDA version supports it (>= 12.9), # as it enables native FP4 conversion instructions (cvt.rn.satfinite.e2m1x2.f32). if backend == "120": from .utils import version_at_least - import torch if version_at_least(torch.version.cuda, "12.9"): backend = "120f"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/fp4_quantization.py` around lines 154 - 160, The local import of torch inside the backend selection block is redundant because torch is already imported at the module level; remove the inner "import torch" line in the block that checks version_at_least(torch.version.cuda, "12.9") (the code working with the backend variable and the version_at_least function) so the block simply calls version_at_least(torch.version.cuda, "12.9") and sets backend = "120f" when true.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@tests/utils/test_fp4_quantize.py`:
- Around line 115-116: Reformat the long boolean device-check condition in
tests/utils/test_fp4_quantize.py so it fits line-length rules: wrap the ORed
calls in a parenthesized multi-line expression (or assign torch.device(device)
to a local variable) and place each predicate (is_sm100a_supported(...),
is_sm110a_supported(...), is_sm12x_supported(...)) on its own line indented for
readability, then call pytest.skip(...) if the whole expression is False; apply
the same multi-line reformatting pattern to the other occurrences in the file
referenced by the review (the conditions at lines corresponding to the other
test functions).
---
Nitpick comments:
In `@flashinfer/compilation_context.py`:
- Around line 41-45: The try/except around pkg_version.parse(cuda_version)
swallows errors and hides parsing failures; change it to catch the exception as
a variable (e.g., except Exception as e:) and log the error before proceeding so
you still fall back to the default suffix. Use the standard logger (e.g.,
logging.getLogger(__name__).exception(...) or logger.error(..., exc_info=True))
to record the exception for the pkg_version.parse(cuda_version) call while
leaving the fallback behavior for the suffix variable unchanged.
In `@flashinfer/fp4_quantization.py`:
- Around line 154-160: The local import of torch inside the backend selection
block is redundant because torch is already imported at the module level; remove
the inner "import torch" line in the block that checks
version_at_least(torch.version.cuda, "12.9") (the code working with the backend
variable and the version_at_least function) so the block simply calls
version_at_least(torch.version.cuda, "12.9") and sets backend = "120f" when
true.
In `@tests/utils/test_fp4_quantize.py`:
- Around line 158-159: The repeated three-way capability check using
is_sm100a_supported, is_sm110a_supported, and is_sm12x_supported should be
extracted into a module-level helper (suggested name _is_nvfp4_supported) that
accepts an optional device string (default "cuda") and returns the OR of those
three checks; then replace each repeated guard in
tests/utils/test_fp4_quantize.py with a single if not
_is_nvfp4_supported(device): pytest.skip(...) call to remove duplication and
centralize updates.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
.github/workflows/nightly-release.yml.github/workflows/release.ymlCLAUDE.mdflashinfer/aot.pyflashinfer/compilation_context.pyflashinfer/fp4_quantization.pyflashinfer/jit/__init__.pyflashinfer/jit/core.pyflashinfer/jit/fp4_quantization.pyflashinfer/utils.pyscripts/test_utils.shtests/utils/test_fp4_quantize.py
|
/bot run |
| env: | ||
| DOCKER_IMAGE: ${{ matrix.arch == 'aarch64' && format('pytorch/manylinuxaarch64-builder:cuda{0}', matrix.cuda) || format('pytorch/manylinux2_28-builder:cuda{0}', matrix.cuda) }} | ||
| FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f' }} | ||
| FLASHINFER_CUDA_ARCH_LIST: ${{ matrix.cuda < '12.9' && '7.5 8.0 8.9 9.0a 10.0a 12.0a' || (matrix.cuda < '13.0' && '7.5 8.0 8.9 9.0a 10.0a 10.3a 12.0f' || '7.5 8.0 8.9 9.0a 10.0a 10.3a 11.0a 12.0f') }} |
There was a problem hiding this comment.
The rationale here:
knam@cudnn-dev-ballast-22-04:/home/scratch.knam$ docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace --ipc=host dockerhub.nvidia.com/flashinfer/flashinfer-ci-cu128 /bin/bash
==========
== CUDA ==
==========
CUDA Version 12.8.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
(py312) root@3c562d8b340b:/workspace# nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_120
(py312) root@3c562d8b340b:/workspace# for arch in 100a 100f 103a 103f 110a 110f 120a 120f 121a 121f; do
echo -n "compute_${arch}: "
echo '__global__ void k(){}' > /tmp/test.cu
nvcc -gencode=arch=compute_${arch},code=sm_${arch} -c /tmp/test.cu -o /dev/null 2>/dev/null && echo "OK" || echo "FAIL"
done
compute_100a: OK
compute_100f: FAIL
compute_103a: FAIL
compute_103f: FAIL
compute_110a: FAIL
compute_110f: FAIL
compute_120a: OK
compute_120f: FAIL
compute_121a: FAIL
compute_121f: FAIL
knam@cudnn-dev-ballast-22-04:/home/scratch.knam$ docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace --ipc=host dockerhub.nvidia.com/flashinfer/flashinfer-ci-cu129 /bin/bash
==========
== CUDA ==
==========
CUDA Version 12.9.0
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
(py312) root@3aa77b7298a5:/workspace# nvcc --list-gpu-arch
compute_50
compute_52
compute_53
compute_60
compute_61
compute_62
compute_70
compute_72
compute_75
compute_80
compute_86
compute_87
compute_89
compute_90
compute_100
compute_101
compute_103
compute_120
compute_121
(py312) root@7553296b7b81:/workspace# for arch in 100a 100f 103a 103f 110a 110f 120a 120f 121a 121f; do
echo -n "compute_${arch}: "
echo '__global__ void k(){}' > /tmp/test.cu
nvcc -gencode=arch=compute_${arch},code=sm_${arch} -c /tmp/test.cu -o /dev/null 2>/dev/null && echo "OK" || echo "FAIL"
done
compute_100a: OK
compute_100f: OK
compute_103a: OK
compute_103f: OK
compute_110a: FAIL
compute_110f: FAIL
compute_120a: OK
compute_120f: OK
compute_121a: OK
compute_121f: OK
knam@cudnn-dev-ballast-22-04:/home/scratch.knam$ docker run --gpus all -it --rm -v $(pwd):/workspace -w /workspace --ipc=host dockerhub.nvidia.com/flashinfer/flashinfer-ci-cu130 /bin/bash
==========
== CUDA ==
==========
CUDA Version 13.0.1
Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license
A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.
(py312) root@ac43bacddc6a:/workspace# nvcc --list-gpu-arch
compute_75
compute_80
compute_86
compute_87
compute_88
compute_89
compute_90
compute_100
compute_110
compute_103
compute_120
compute_121
(py312) root@e1bec98f84d3:/workspace# for arch in 100a 100f 103a 103f 110a 110f 120a 120f 121a 121f; do
echo -n "compute_${arch}: "
echo '__global__ void k(){}' > /tmp/test.cu
nvcc -gencode=arch=compute_${arch},code=sm_${arch} -c /tmp/test.cu -o /dev/null 2>/dev/null && echo "OK" || echo "FAIL"
done
compute_100a: OK
compute_100f: OK
compute_103a: OK
compute_103f: OK
compute_110a: OK
compute_110f: OK
compute_120a: OK
compute_120f: OK
compute_121a: OK
compute_121f: OK
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/compilation_context.py`:
- Around line 45-50: Replace the broad "except Exception: pass" in the CUDA
gating block (the if block checking major/minor and using pkg_version.parse on
cuda_version) with an explicit except InvalidVersion that logs why parsing
failed and that you're falling back to the default suffix; import InvalidVersion
from packaging.version (or from the same pkg_version module) and log the
exception message via the module logger or logging.warning, so failures aren't
silently swallowed and the fallback to suffix "a" is recorded.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (3)
.github/workflows/nightly-release.yml.github/workflows/release.ymlflashinfer/compilation_context.py
🚧 Files skipped from review as they are similar to previous changes (2)
- .github/workflows/release.yml
- .github/workflows/nightly-release.yml
flashinfer/compilation_context.py
Outdated
| if major == 12 and minor == 0 and cuda_version is not None: | ||
| try: | ||
| if pkg_version.parse(cuda_version) >= pkg_version.parse("12.9"): | ||
| suffix = "f" | ||
| except Exception: | ||
| pass |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
cat -n flashinfer/compilation_context.py | head -80Repository: flashinfer-ai/flashinfer
Length of output: 3662
🏁 Script executed:
python3 << 'EOF'
# Check what exceptions are available in packaging.version
try:
from packaging.version import InvalidVersion
print("InvalidVersion found and importable")
print(f"InvalidVersion type: {InvalidVersion}")
print(f"InvalidVersion bases: {InvalidVersion.__bases__}")
except ImportError as e:
print(f"Cannot import InvalidVersion: {e}")
# Test what exception is raised by pkg_version.parse for invalid versions
from packaging import version as pkg_version
try:
pkg_version.parse("invalid_version_string")
except Exception as e:
print(f"Exception type raised by parse: {type(e).__name__}")
print(f"Exception message: {e}")
EOFRepository: flashinfer-ai/flashinfer
Length of output: 321
Avoid silent exception swallowing in CUDA version gating.
At lines 45–50, except Exception: pass hides why SM120 may silently fall back to a. Catch InvalidVersion explicitly and log the fallback reason.
Proposed fix
- from packaging import version as pkg_version
+ from packaging import version as pkg_version
+ from packaging.version import InvalidVersion
@@
- except Exception:
- pass
+ except InvalidVersion:
+ logger.debug("Unparseable CUDA version '%s'; using 'a' suffix for sm_%s%s", cuda_version, major, minor)🧰 Tools
🪛 Ruff (0.15.2)
[error] 49-50: try-except-pass detected, consider logging the exception
(S110)
[warning] 49-49: Do not catch blind exception: Exception
(BLE001)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/compilation_context.py` around lines 45 - 50, Replace the broad
"except Exception: pass" in the CUDA gating block (the if block checking
major/minor and using pkg_version.parse on cuda_version) with an explicit except
InvalidVersion that logs why parsing failed and that you're falling back to the
default suffix; import InvalidVersion from packaging.version (or from the same
pkg_version module) and log the exception message via the module logger or
logging.warning, so failures aren't silently swallowed and the fallback to
suffix "a" is recorded.
|
thanks @aleozlx @kahyunnam <3 |
…on checks - Restore sm120a/sm121a nvcc flags, add sm120f/sm121f alongside - Add _get_arch_suffix() for dynamic a/f suffix selection based on CUDA version - SM120: 'f' suffix with CUDA >= 12.9 (matching upstream PR flashinfer-ai#2650) - SM121: 'f' suffix with CUDA >= 13.0 (DGX Spark support) - Add is_sm120f_supported() and is_sm121f_supported() utilities - Fix device_support_pdl() to return False on SM121 (GB10 lacks PDL support) - Export sm120f_nvcc_flags and sm121f_nvcc_flags from jit package Co-developed-with: flashinfer-ai#2650
flashinfer/compilation_context.py
Outdated
| instructions such as native FP4 conversion (cvt.rn.satfinite.e2m1x2.f32). | ||
|
|
||
| Note: 'a' and 'f' are different feature sets, not a superset relationship. | ||
| We only auto-select 'f' for SM120 where it's been verified to improve FP4 performance. |
There was a problem hiding this comment.
nitpick: In the case of 120/121, is there anything we lose by going from 120a/121a to 120f? If not, I might say that our reasoning here is not just that 120f improves FP4 perf -- it's also that 120f doesn't drop an capability vs the unmodified (120/121) and a variants (120a/121a), so there's no reason not to use 120f.
There was a problem hiding this comment.
I thought in theory the 'a' arch specific flag is supposed to have more architecture-specific optimizations than 'f' family compatible flag. But it seems like there's some driver bug here where 120f somehow compiles into more optimized ptx than 120a (for now this 120f might be more of a quick fix for nvfp4 perf for vllm)
There was a problem hiding this comment.
|
[FAILED] Pipeline #44997805: 7/20 passed |
|
@aleozlx @kahyunnam @yzh119 how is it going? |
|
@yzh119 could you take a look and merge it asap? |
|
/bot run |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@flashinfer/fp4_quantization.py`:
- Around line 154-159: The code calls version_at_least(torch.version.cuda,
"12.9") which will raise if torch.version.cuda is None; update the conditional
in fp4_quantization.py so you first check that torch.version.cuda is truthy (or
is a string) before calling version_at_least — e.g. change the inner if to
verify torch.version.cuda is not None/empty and then call
version_at_least(torch.version.cuda, "12.9") to decide whether to set backend =
"120f"; reference the existing backend variable and version_at_least function
when making the guard.
flashinfer/fp4_quantization.py
Outdated
| # Prefer 'f' (feature-set) variant when CUDA version supports it (>= 12.9), | ||
| # as it enables native FP4 conversion instructions (cvt.rn.satfinite.e2m1x2.f32). | ||
| if backend == "120": | ||
| from .utils import version_at_least | ||
| if version_at_least(torch.version.cuda, "12.9"): | ||
| backend = "120f" |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Call site (flashinfer/fp4_quantization.py) =="
sed -n '150,162p' flashinfer/fp4_quantization.py
echo
echo "== version_at_least implementation (flashinfer/utils.py) =="
sed -n '505,512p' flashinfer/utils.py
echo
echo "== Nearby usage context for torch.version.cuda =="
rg -n -C2 'torch\.version\.cuda|version_at_least\(' flashinfer/fp4_quantization.pyRepository: flashinfer-ai/flashinfer
Length of output: 1522
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check torch.version.cuda type when CPU-only =="
python3 << 'EOF'
import torch
print(f"torch.version.cuda = {torch.version.cuda}")
print(f"type = {type(torch.version.cuda)}")
# Simulate what happens with None
try:
from packaging import version as pkg_version
result = pkg_version.parse(None)
print(f"parse(None) succeeded: {result}")
except Exception as e:
print(f"parse(None) failed with: {type(e).__name__}: {e}")
EOF
echo
echo "== Search for other torch.version.cuda usages in the codebase =="
rg -n 'torch\.version\.cuda' flashinfer/
echo
echo "== Check if there's a wrapper function for safe version checking =="
rg -n 'is_cuda_version_at_least\|cuda.*version' flashinfer/utils.py | head -20Repository: flashinfer-ai/flashinfer
Length of output: 252
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Search for other torch.version.cuda usages in the codebase =="
rg -n 'torch\.version\.cuda' flashinfer/
echo
echo "== Check if there's a wrapper function for safe version checking =="
rg -n 'is_cuda_version_at_least' flashinfer/
echo
echo "== Check entire utils.py for version-related functions =="
rg -n 'def.*version' flashinfer/utils.py
echo
echo "== Look for is_cuda_version_at_least implementation =="
rg -A 5 'def is_cuda_version_at_least' flashinfer/Repository: flashinfer-ai/flashinfer
Length of output: 3916
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Look at get_cuda_version() implementation in cpp_ext.py =="
rg -B 5 -A 10 'def get_cuda_version' flashinfer/jit/cpp_ext.py
echo
echo "== Check if version_at_least expects string or can handle None =="
sed -n '505,515p' flashinfer/utils.py
echo
echo "== Verify context around line 78 in cpp_ext.py that checks for None =="
sed -n '75,85p' flashinfer/jit/cpp_ext.py
echo
echo "== Check all callers of version_at_least to see if they guard =="
rg -B 2 'version_at_least\(torch\.version\.cuda' flashinfer/Repository: flashinfer-ai/flashinfer
Length of output: 3775
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== Check what happens when pkg_version.parse receives None =="
python3 << 'EOF'
from packaging import version as pkg_version
try:
result = pkg_version.parse(None)
print(f"parse(None) succeeded: {result}")
except TypeError as e:
print(f"parse(None) raises TypeError: {e}")
except Exception as e:
print(f"parse(None) raises {type(e).__name__}: {e}")
EOF
echo
echo "== Full context around fp4_quantization.py lines 150-165 =="
sed -n '150,165p' flashinfer/fp4_quantization.py
echo
echo "== Check if this is called at import time or later =="
rg -B 5 'backend = "120"' flashinfer/fp4_quantization.py | head -30Repository: flashinfer-ai/flashinfer
Length of output: 909
Guard torch.version.cuda against None before version comparison.
At line 158, version_at_least(torch.version.cuda, "12.9") will crash with a TypeError when torch.version.cuda is None (CPU-only PyTorch builds). The packaging.version.parse() function requires a string and does not accept None.
🔧 Proposed fix
if backend == "120":
from .utils import version_at_least
- if version_at_least(torch.version.cuda, "12.9"):
+ cuda_version = torch.version.cuda
+ if cuda_version is not None and version_at_least(cuda_version, "12.9"):
backend = "120f"🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@flashinfer/fp4_quantization.py` around lines 154 - 159, The code calls
version_at_least(torch.version.cuda, "12.9") which will raise if
torch.version.cuda is None; update the conditional in fp4_quantization.py so you
first check that torch.version.cuda is truthy (or is a string) before calling
version_at_least — e.g. change the inner if to verify torch.version.cuda is not
None/empty and then call version_at_least(torch.version.cuda, "12.9") to decide
whether to set backend = "120f"; reference the existing backend variable and
version_at_least function when making the guard.
|
[FAILED] Pipeline #45217326: 1/20 passed |
~~ Not sure what's happening here, all the errors are just docker image failures due to a trailing semicolon? I see this in log: nvm, I figured out what was going on. I didn't commit my merge conflict fix so the prepare stage would fail, not generating the artifacts needed for docker version tag, causing seeming trailing semicolon. |
|
/bot run |
c05d4f7 to
b7c0b7d
Compare
|
/bot run |
There was a problem hiding this comment.
♻️ Duplicate comments (2)
tests/utils/test_fp4_quantize.py (1)
115-116:⚠️ Potential issue | 🟡 MinorReflow the long skip conditions to unblock pre-commit.
Lines 115, 158, 194, 243, 308, 336, 379, and 426 keep the boolean guard on one line; this is consistent with the reported
ruff-formathook rewrite and can fail CI until committed in formatted form.Suggested formatting pattern (apply to each occurrence)
- if not (is_sm100a_supported(torch.device(device)) or is_sm110a_supported(torch.device(device)) or is_sm12x_supported(torch.device(device))): + if not ( + is_sm100a_supported(torch.device(device)) + or is_sm110a_supported(torch.device(device)) + or is_sm12x_supported(torch.device(device)) + ): pytest.skip("Nvfp4 Requires compute capability >= 10 and CUDA >= 12.8")Also applies to: 158-159, 194-195, 243-244, 308-309, 336-337, 379-380, 426-427
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/utils/test_fp4_quantize.py` around lines 115 - 116, The long boolean guard in the skip checks (the if not (...) using is_sm100a_supported, is_sm110a_supported, is_sm12x_supported with torch.device and pytest.skip) should be reflowed across multiple lines so the condition is wrapped in parentheses and each or-clause is on its own line; update every occurrence of that pattern (the if not (is_sm100a_supported(torch.device(device)) or is_sm110a_supported(torch.device(device)) or is_sm12x_supported(torch.device(device))): pytest.skip(...)) to use a multi-line, parenthesized condition to satisfy ruff-format while keeping the exact logic and skip message unchanged.flashinfer/compilation_context.py (1)
46-50:⚠️ Potential issue | 🟡 MinorNarrow the fallback exception instead of swallowing all errors.
At Line 49,
except Exception: passcan hide unexpected runtime errors in addition to version-parse failures. CatchInvalidVersionexplicitly and log fallback behavior.🔧 Proposed fix
def _get_arch_suffix(major: int, minor: int) -> str: @@ - from packaging import version as pkg_version + from packaging import version as pkg_version + from packaging.version import InvalidVersion @@ - except Exception: - pass + except InvalidVersion as err: + logger.debug( + "Unparseable CUDA version '%s'; falling back to 'a' suffix for sm_%s%s (%s)", + cuda_version, + major, + minor, + err, + )#!/bin/bash python - <<'PY' from packaging import version as pkg_version from packaging.version import InvalidVersion samples = ["12.9", "bad.version", None] for v in samples: try: out = pkg_version.parse(v) print(f"{v!r} -> OK: {out}") except Exception as e: print(f"{v!r} -> {type(e).__name__}: {e}") PY🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@flashinfer/compilation_context.py` around lines 46 - 50, The try/except currently swallows all exceptions around pkg_version.parse(cuda_version) which can hide real errors; change the handler to catch packaging.version.InvalidVersion specifically (import InvalidVersion from packaging.version) and in that except branch log a clear fallback message (e.g., logger.warning or module logger) indicating parsing failed and the suffix fallback is being used; leave other exceptions to propagate so real errors are not silently ignored and keep the same logic that sets suffix = "f" when parse succeeds and version >= "12.9".
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@flashinfer/compilation_context.py`:
- Around line 46-50: The try/except currently swallows all exceptions around
pkg_version.parse(cuda_version) which can hide real errors; change the handler
to catch packaging.version.InvalidVersion specifically (import InvalidVersion
from packaging.version) and in that except branch log a clear fallback message
(e.g., logger.warning or module logger) indicating parsing failed and the suffix
fallback is being used; leave other exceptions to propagate so real errors are
not silently ignored and keep the same logic that sets suffix = "f" when parse
succeeds and version >= "12.9".
In `@tests/utils/test_fp4_quantize.py`:
- Around line 115-116: The long boolean guard in the skip checks (the if not
(...) using is_sm100a_supported, is_sm110a_supported, is_sm12x_supported with
torch.device and pytest.skip) should be reflowed across multiple lines so the
condition is wrapped in parentheses and each or-clause is on its own line;
update every occurrence of that pattern (the if not
(is_sm100a_supported(torch.device(device)) or
is_sm110a_supported(torch.device(device)) or
is_sm12x_supported(torch.device(device))): pytest.skip(...)) to use a
multi-line, parenthesized condition to satisfy ruff-format while keeping the
exact logic and skip message unchanged.
ℹ️ Review info
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (12)
.github/workflows/nightly-release.yml.github/workflows/release.ymlCLAUDE.mdflashinfer/aot.pyflashinfer/compilation_context.pyflashinfer/fp4_quantization.pyflashinfer/jit/__init__.pyflashinfer/jit/core.pyflashinfer/jit/fp4_quantization.pyflashinfer/utils.pyscripts/test_utils.shtests/utils/test_fp4_quantize.py
🚧 Files skipped from review as they are similar to previous changes (6)
- flashinfer/utils.py
- .github/workflows/release.yml
- flashinfer/jit/fp4_quantization.py
- flashinfer/jit/init.py
- CLAUDE.md
- flashinfer/aot.py
|
[FAILED] Pipeline #45260681: 8/20 passed |
|
/bot run |
|
[FAILED] Pipeline #45281727: 10/20 passed |
<!-- .github/pull_request_template.md --> ## 📌 Description Enabled sm120f compilation for sm120 family related optimization for nvfp4. Tested for functionality, not perf. See related issues for more info. ## 🔍 Related Issues [https://github.com/flashinfer-ai/flashinfer/issues/2649](https://github.com/flashinfer-ai/flashinfer/issues/2649) ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes <!-- Optional: anything you'd like reviewers to focus on, concerns, etc. --> <!-- This is an auto-generated comment: release notes by coderabbit.ai --> ## Summary by CodeRabbit ## Release Notes * **New Features** * Added support for NVIDIA Blackwell SM 120f GPUs with optimized FP4 quantization in CUDA 12.9+. * New capability detection function for SM 120f support. * **Documentation** * Updated capability checks documentation to include new SM 120f support details. * **Tests** * Expanded test coverage to support additional GPU compute capabilities. * Improved kernel variant selection for different CUDA versions. <!-- end of auto-generated comment: release notes by coderabbit.ai --> Signed-off-by: Amey Naik <212485788+ameynaik-hub@users.noreply.github.com>
📌 Description
Enabled sm120f compilation for sm120 family related optimization for nvfp4. Tested for functionality, not perf. See related issues for more info.
🔍 Related Issues
#2649
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests