Skip to content

dependency: update nvidia-cutlass-dsl#2288

Merged
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
yzh119:cutlass-4.3.4
Jan 5, 2026
Merged

dependency: update nvidia-cutlass-dsl#2288
yzh119 merged 1 commit intoflashinfer-ai:mainfrom
yzh119:cutlass-4.3.4

Conversation

@yzh119
Copy link
Collaborator

@yzh119 yzh119 commented Jan 5, 2026

📌 Description

Update minimal version requirement of nvidia-cutlass-dsl to 4.3.4, which should resolve the arm issue in #2279

🔍 Related Issues

#2279

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

  • I have installed pre-commit by running pip install pre-commit (or used your preferred method).
  • I have installed the hooks with pre-commit install.
  • I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

  • Tests have been added or updated as needed.
  • All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

  • Chores
    • Updated internal dependencies to improve stability and compatibility.

✏️ Tip: You can customize this high-level summary in your review settings.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Jan 5, 2026

📝 Walkthrough

Walkthrough

Updated the nvidia-cutlass-dsl dependency version constraint from >=4.3.2 to >=4.3.4 in requirements.txt. No other dependencies or code changes were made.

Changes

Cohort / File(s) Summary
Dependency Version Bump
requirements.txt
Updated nvidia-cutlass-dsl version constraint from >=4.3.2 to >=4.3.4

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~1 minute

Possibly related PRs

Suggested reviewers

  • cyx-6
  • bkryu
  • nvmbreughe

Poem

🐰 A tiny hop, a version bump so small,
From 4.3.2 to 4.3.4 we call,
Dependencies dance, ever fresh and bright,
This cutlass-dsl shines ever more right!

Pre-merge checks

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and accurately describes the main change: updating the nvidia-cutlass-dsl dependency version.
Description check ✅ Passed The pull request description follows the template and includes the required sections: description of changes with rationale, related issues reference, and completed pre-commit checklist items.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce6f083 and 7012235.

📒 Files selected for processing (1)
  • requirements.txt
🚧 Files skipped from review as they are similar to previous changes (1)
  • requirements.txt
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
  • GitHub Check: Deploy Docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily addresses a critical dependency update for nvidia-cutlass-dsl to version 4.3.4, which is expected to resolve an ARM-related compatibility issue. Beyond this core update, the PR also includes significant internal improvements by explicitly defining the public API of various Python modules using __all__ lists, enhancing code clarity and maintainability. Furthermore, it expands the project's documentation to cover newly exposed or refined functionalities across several kernel categories, including attention, GEMM, and quantization.

Highlights

  • Dependency Update: The nvidia-cutlass-dsl dependency has been updated to version 4.3.4, specifically to resolve an ARM compatibility issue referenced in [WIP] Refactor: simplify torch -> cute-dsl boilerplate and enable tvm-ffi for cute-dsl kernels #2279.
  • API Exposure Refinement: Numerous Python modules across the flashinfer package have been updated to explicitly define their public API using __all__ lists, enhancing code clarity and maintainability.
  • Documentation Expansion: The API documentation has been significantly expanded to cover new functionalities and kernels related to attention (including POD attention), communication (Unified AllReduce Fusion API), GEMM (Blackwell, TensorRT-LLM Low Latency, CuTe-DSL), normalization (quantized RMSNorm), and quantization (FP8 kernels).
  • Conditional CuTe-DSL GEMM Imports: Conditional imports for CuTe-DSL GEMM kernels have been introduced, allowing for their use when available without breaking functionality otherwise.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request updates the nvidia-cutlass-dsl dependency to version 4.3.4. In addition, it introduces a large-scale improvement to the codebase by adding __all__ declarations to numerous modules, which explicitly defines their public APIs and improves maintainability. The documentation has also been updated to reflect new features and API changes, including new functionalities in attention, GEMM, normalization, and quantization modules. The changes are well-structured and follow good Python practices. The conditional imports for optional features like CuTe-DSL are handled correctly. Overall, this is a solid pull request that improves both functionality and code quality.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

Fix all issues with AI Agents 🤖
In @flashinfer/api_logging.py:
- Around line 568-570: The package's public API does not expose the new
decorator; import and re-export the flashinfer_api symbol from api_logging by
adding an import for flashinfer_api in the package initializer and ensure the
package-level __all__ includes "flashinfer_api" so it becomes part of the public
API surface (i.e., add an import of flashinfer_api from api_logging and append
"flashinfer_api" to the __all__ list in the package __init__).

In @flashinfer/cuda_utils.py:
- Around line 64-66: Import and re-export the checkCudaErrors symbol in the
package initializer: add an import in the main __init__.py that pulls
checkCudaErrors from the cuda_utils module and ensure checkCudaErrors is
included in the package __all__ so it becomes part of the public API (matching
the pattern used for other modules like activation, attention, decode).

In @flashinfer/mla.py:
- Around line 802-804: Add the two public decode functions to the module
exports: include "trtllm_batch_decode_with_kv_cache_mla" and
"xqa_batch_decode_with_kv_cache_mla" in the __all__ list in flashinfer/mla.py
(alongside "BatchMLAPagedAttentionWrapper"), and then re-export those same
symbols from flashinfer/__init__.py to match other public APIs; update the
import/exports in __init__.py to import these functions from flashinfer.mla and
add them to the package-level __all__ so they are available as top-level
flashinfer APIs.

In @flashinfer/norm.py:
- Around line 415-425: The current __all__ in flashinfer/norm.py unconditionally
lists add_rmsnorm_fp4quant and rmsnorm_fp4quant even though those symbols may be
set to None when the cute_dsl import fails; change to the conditional-extend
pattern used in flashinfer/gemm/__init__.py: define a base __all__ containing
only core exports (e.g., "fused_add_rmsnorm", "fused_add_rmsnorm_quant",
"gemma_fused_add_rmsnorm", "gemma_rmsnorm", "layernorm", "rmsnorm",
"rmsnorm_quant"), keep the optional symbols (add_rmsnorm_fp4quant,
rmsnorm_fp4quant) defined as before but remove them from the base list, and then
if _CUTE_DSL_AVAILABLE (or the existing flag used for the import) append or
extend __all__ with ["add_rmsnorm_fp4quant", "rmsnorm_fp4quant"] so wildcard
imports and introspection never see None entries.
🧹 Nitpick comments (10)
flashinfer/green_ctx.py (1)

298-307: LGTM! Public API surface correctly established.

The __all__ declaration properly exposes all public green context utility functions. The selected symbols align with the module's purpose and documented functionality.

Optional: Sort the __all__ list alphabetically for consistency

Applying alphabetical sorting can improve maintainability:

 __all__ = [
-    "get_sm_count_constraint",
-    "get_cudevice",
-    "get_device_resource",
-    "split_resource",
-    "split_resource_by_sm_count",
     "create_green_ctx_streams",
+    "get_cudevice",
+    "get_device_resource",
+    "get_sm_count_constraint",
     "split_device_green_ctx",
     "split_device_green_ctx_by_sm_count",
+    "split_resource",
+    "split_resource_by_sm_count",
 ]
flashinfer/tllm_utils.py (1)

15-18: LGTM! Public API surface correctly defined.

The __all__ export list appropriately exposes the two utility functions from this module.

Note: Static analysis suggests alphabetical sorting of __all__, but the current order is logical (primary function first, then helper). Sorting is optional.

flashinfer/topk.py (1)

424-428: Consider exporting can_implement_filtered_topk.

The three main Top-K functions are correctly exported. However, can_implement_filtered_topk() (Line 141) has a complete public docstring and appears intended for users to check GPU capability before using FilteredTopK.

Suggested addition to exports
 __all__ = [
     "top_k",
     "top_k_page_table_transform",
     "top_k_ragged_transform",
+    "can_implement_filtered_topk",
 ]

Optionally, you might also export the topk alias (Line 254) for backward compatibility, though this is less critical.

flashinfer/concat_ops.py (1)

85-88: Consider sorting __all__ entries to satisfy Ruff (RUF022)

The exports are correct, but Ruff flags this __all__ as unsorted. Reordering them alphabetically would address the lint without behavior change.

Proposed `__all__` reordering
-__all__ = [
-    "get_concat_mla_module",
-    "concat_mla_k",
-]
+__all__ = [
+    "concat_mla_k",
+    "get_concat_mla_module",
+]
flashinfer/fp4_quantization.py (1)

1004-1018: LGTM!

The __all__ declaration correctly exposes the FP4 quantization public API. All 13 symbols are properly defined in the module.

The static analysis tool suggests sorting the __all__ list alphabetically for consistency. This is purely a style preference and can be addressed at your discretion.

flashinfer/testing/__init__.py (1)

32-45: LGTM!

The __all__ declaration properly exposes the testing utilities, matching the imported symbols from .utils.

The static analysis tool suggests sorting the __all__ list alphabetically. This is a minor style preference and entirely optional.

flashinfer/logits_processor/__init__.py (1)

36-62: Public API surface correctly established.

The __all__ declaration properly exposes the logits processor public API.

Optionally, consider sorting the __all__ list alphabetically (or by category then alphabetically within each category) to improve maintainability, as suggested by the static analysis tool. The current grouping by category is already helpful for readability.

flashinfer/comm/__init__.py (1)

70-123: Public API surface correctly established.

The __all__ declaration properly exposes the communication module's public API. The categorical grouping with comments enhances readability.

Optionally, consider alphabetically sorting entries within each category to make it easier to locate specific symbols and maintain consistency. The current categorical organization is already beneficial for understanding the module structure.

flashinfer/jit/__init__.py (1)

96-158: Good addition of public API surface definition.

Adding __all__ explicitly defines the module's public API, which is excellent practice.

The static analysis tool suggests sorting the __all__ list using isort-style ordering for consistency. This is optional and can be applied using the Ruff formatter.

📋 How to apply sorting

Run the following command to auto-fix:

ruff check --select RUF022 --fix flashinfer/jit/__init__.py
flashinfer/gemm/__init__.py (1)

49-53: Conditional export correctly extends public API when CuTe-DSL is available.

The conditional extension of __all__ properly gates the new GEMM kernel exports behind the availability flag.

The static analysis tool suggests alphabetically sorting this list, though the current logical ordering (function before class) is also reasonable. If you prefer consistency with typical Python conventions, consider alphabetical sorting:

Optional: alphabetically sort the conditional exports
 if _CUTE_DSL_AVAILABLE:
     __all__ += [
-        "grouped_gemm_nt_masked",
         "Sm100BlockScaledPersistentDenseGemmKernel",
+        "grouped_gemm_nt_masked",
     ]
📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff41a8f and ce6f083.

📒 Files selected for processing (42)
  • docs/api/attention.rst
  • docs/api/comm.rst
  • docs/api/gemm.rst
  • docs/api/norm.rst
  • docs/api/quantization.rst
  • flashinfer/__init__.py
  • flashinfer/activation.py
  • flashinfer/aot.py
  • flashinfer/api_logging.py
  • flashinfer/artifacts.py
  • flashinfer/attention.py
  • flashinfer/autotuner.py
  • flashinfer/cascade.py
  • flashinfer/comm/__init__.py
  • flashinfer/compilation_context.py
  • flashinfer/concat_ops.py
  • flashinfer/cuda_utils.py
  • flashinfer/decode.py
  • flashinfer/deep_gemm.py
  • flashinfer/fp4_quantization.py
  • flashinfer/fp8_quantization.py
  • flashinfer/gemm/__init__.py
  • flashinfer/green_ctx.py
  • flashinfer/jit/__init__.py
  • flashinfer/logits_processor/__init__.py
  • flashinfer/mla.py
  • flashinfer/norm.py
  • flashinfer/page.py
  • flashinfer/pod.py
  • flashinfer/prefill.py
  • flashinfer/quantization.py
  • flashinfer/rope.py
  • flashinfer/sampling.py
  • flashinfer/sparse.py
  • flashinfer/testing/__init__.py
  • flashinfer/tllm_utils.py
  • flashinfer/topk.py
  • flashinfer/trtllm_low_latency_gemm.py
  • flashinfer/utils.py
  • flashinfer/version.py
  • flashinfer/xqa.py
  • requirements.txt
🧰 Additional context used
📓 Path-based instructions (4)
flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

  • flashinfer/tllm_utils.py
  • flashinfer/quantization.py
  • flashinfer/cuda_utils.py
  • flashinfer/sparse.py
  • flashinfer/compilation_context.py
  • flashinfer/pod.py
  • flashinfer/deep_gemm.py
  • flashinfer/trtllm_low_latency_gemm.py
  • flashinfer/cascade.py
  • flashinfer/fp4_quantization.py
  • flashinfer/logits_processor/__init__.py
  • flashinfer/utils.py
  • flashinfer/prefill.py
  • flashinfer/api_logging.py
  • flashinfer/green_ctx.py
  • flashinfer/concat_ops.py
  • flashinfer/topk.py
  • flashinfer/testing/__init__.py
  • flashinfer/page.py
  • flashinfer/artifacts.py
  • flashinfer/activation.py
  • flashinfer/autotuner.py
  • flashinfer/xqa.py
  • flashinfer/norm.py
  • flashinfer/rope.py
  • flashinfer/jit/__init__.py
  • flashinfer/mla.py
  • flashinfer/sampling.py
  • flashinfer/attention.py
  • flashinfer/gemm/__init__.py
  • flashinfer/comm/__init__.py
  • flashinfer/version.py
  • flashinfer/aot.py
  • flashinfer/fp8_quantization.py
  • flashinfer/decode.py
  • flashinfer/__init__.py
flashinfer/jit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/jit/**/*.py: JIT module generators in flashinfer/jit/ must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Use gen_jit_spec() function to return a properly configured JitSpec from module generators with appropriate sources and extra_cuda_cflags
Specify supported_major_versions in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Files:

  • flashinfer/jit/__init__.py
flashinfer/aot.py

📄 CodeRabbit inference engine (CLAUDE.md)

Register new operations in flashinfer/aot.py by calling the gen_*_module() function for AOT (Ahead-Of-Time) pre-compilation support

Files:

  • flashinfer/aot.py
flashinfer/__init__.py

📄 CodeRabbit inference engine (CLAUDE.md)

Export new operations in flashinfer/__init__.py to make them available as public API

Files:

  • flashinfer/__init__.py
🧠 Learnings (9)
📓 Common learnings
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API

Applied to files:

  • flashinfer/tllm_utils.py
  • flashinfer/quantization.py
  • flashinfer/cuda_utils.py
  • flashinfer/sparse.py
  • flashinfer/compilation_context.py
  • flashinfer/pod.py
  • flashinfer/deep_gemm.py
  • flashinfer/trtllm_low_latency_gemm.py
  • flashinfer/cascade.py
  • flashinfer/fp4_quantization.py
  • flashinfer/logits_processor/__init__.py
  • docs/api/attention.rst
  • flashinfer/utils.py
  • flashinfer/prefill.py
  • flashinfer/api_logging.py
  • flashinfer/green_ctx.py
  • flashinfer/concat_ops.py
  • flashinfer/topk.py
  • flashinfer/testing/__init__.py
  • flashinfer/page.py
  • flashinfer/artifacts.py
  • flashinfer/activation.py
  • flashinfer/autotuner.py
  • flashinfer/xqa.py
  • flashinfer/norm.py
  • flashinfer/rope.py
  • flashinfer/jit/__init__.py
  • flashinfer/mla.py
  • flashinfer/sampling.py
  • flashinfer/attention.py
  • flashinfer/gemm/__init__.py
  • flashinfer/comm/__init__.py
  • flashinfer/version.py
  • flashinfer/aot.py
  • flashinfer/fp8_quantization.py
  • flashinfer/decode.py
  • flashinfer/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Applied to files:

  • flashinfer/api_logging.py
  • flashinfer/testing/__init__.py
  • flashinfer/version.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `functools.cache` decorator on Python API functions to implement module-level caching and avoid recompilation

Applied to files:

  • flashinfer/api_logging.py
  • flashinfer/decode.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to tests/**/*.py : Test implementations should use `flashinfer.utils` functions (`get_compute_capability`, `is_sm90a_supported`, `is_sm100a_supported`, etc.) to skip tests on unsupported GPU architectures

Applied to files:

  • flashinfer/testing/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

  • flashinfer/jit/__init__.py
  • flashinfer/gemm/__init__.py
  • flashinfer/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Use `gen_jit_spec()` function to return a properly configured JitSpec from module generators with appropriate `sources` and `extra_cuda_cflags`

Applied to files:

  • flashinfer/jit/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : JIT module generators in `flashinfer/jit/` must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec

Applied to files:

  • flashinfer/jit/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support

Applied to files:

  • flashinfer/aot.py
🧬 Code graph analysis (1)
flashinfer/gemm/__init__.py (1)
flashinfer/cute_dsl/blockscaled_gemm.py (2)
  • grouped_gemm_nt_masked (2945-3046)
  • Sm100BlockScaledPersistentDenseGemmKernel (464-2449)
🪛 Ruff (0.14.10)
flashinfer/tllm_utils.py

15-18: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/fp4_quantization.py

1004-1018: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/logits_processor/__init__.py

36-62: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/green_ctx.py

298-307: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/concat_ops.py

85-88: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/testing/__init__.py

32-45: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/artifacts.py

253-265: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/jit/__init__.py

96-158: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/gemm/__init__.py

50-53: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/comm/__init__.py

70-123: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

flashinfer/version.py

26-29: __all__ is not sorted

Apply an isort-style sorting to __all__

(RUF022)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
  • GitHub Check: build (cu126, amd64)
  • GitHub Check: build (cu126, arm64)
  • GitHub Check: build (cu128, amd64)
  • GitHub Check: Deploy Docs
🔇 Additional comments (30)
flashinfer/attention.py (1)

282-285: Public API surface correctly defined.

The __all__ declaration properly exports both BatchAttention and BatchAttentionWithAttentionSinkWrapper classes, and these are re-exported in flashinfer/__init__.py to establish the public API surface.

flashinfer/utils.py (1)

1187-1189: LGTM! Public API surface explicitly defined.

The __all__ declaration correctly exports next_positive_power_of_2, establishing an explicit public API surface for this module. The function is properly re-exported in flashinfer/__init__.py, making it available as part of the top-level public API. The minimal export list appears intentional as part of the PR-wide effort to refine public APIs.

flashinfer/quantization.py (1)

142-145: LGTM! Quantization public API correctly defined.

The __all__ declaration appropriately exports packbits and segment_packbits, establishing the quantization module's public API surface. Both functions are properly decorated with @flashinfer_api and are correctly re-exported in flashinfer/__init__.py.

flashinfer/version.py (1)

26-29: LGTM! Version exports are correctly defined.

The __all__ declaration properly exposes the version metadata symbols, which are the only public exports for this module.

flashinfer/xqa.py (1)

532-535: LGTM! Public API exports are appropriate.

The __all__ declaration correctly exposes the two main XQA attention functions while keeping internal helper functions (e.g., get_xqa_module, get_xqa_module_mla) private. This follows sound API design principles.

requirements.txt (1)

7-7: Dependency version verified — nvidia-cutlass-dsl 4.3.4 exists on PyPI with no known security advisories.

The version bump from 4.3.2 to 4.3.4 is valid and safe to merge.

flashinfer/artifacts.py (1)

253-265: LGTM! Comprehensive public API definition with helpful organization.

The __all__ list appropriately exports all artifact management utilities. The grouping by type (Classes, Functions) provides clarity that would be lost with alphabetical sorting.

flashinfer/compilation_context.py (1)

71-73: LGTM! Public API correctly defined.

The __all__ export appropriately exposes the CompilationContext class as the module's public interface.

flashinfer/autotuner.py (1)

794-796: The current __all__ export list appropriately restricts the public API.

The classes mentioned (TunableRunner, TuningConfig, DynamicTensorSpec, ConstraintSpec, OptimizationProfile, AutoTuner) are internal implementation details used within flashinfer and benchmarks. While heavily used internally, they are deliberately excluded from __all__, indicating the framework is designed for users to interact only through the autotune context manager, not to extend or directly instantiate these classes. No public documentation or examples encourage custom TunableRunner implementations, and there is no evidence of intended user-facing API extension patterns. The current design is consistent with the stated interface.

flashinfer/sparse.py (1)

1171-1174: Sparse attention wrappers correctly exported via __all__

__all__ cleanly exposes the two wrapper classes defined in this module and does not leak internal helpers. Looks good as a public surface.

flashinfer/cascade.py (1)

1083-1090: Cascade attention public API surface is coherent

The __all__ list accurately exposes the cascade wrappers and merge helpers defined here, without exporting internal utilities. This matches the intended public API.

flashinfer/decode.py (1)

2682-2689: Decode module __all__ cleanly reflects intended public API

The decode __all__ entries correspond to existing, user-facing decode wrappers and functions (including fast_decode_plan) and avoid exporting internal helpers. This is a sound explicit public surface.

flashinfer/deep_gemm.py (1)

1614-1622: Deep GEMM entrypoints exported appropriately

The __all__ list exposes the main DeepGEMM-facing APIs (KernelMap, loader helpers, and grouped FP8 GEMM functions) and keeps internal helpers private. This is a sensible public surface.

flashinfer/page.py (1)

384-389: LGTM!

The __all__ declaration correctly exposes the public API for this module. All four symbols are properly defined functions in the file.

flashinfer/aot.py (1)

884-886: LGTM!

Correctly exposes register_default_modules as the public API for this module, consistent with the repository pattern of explicit exports.

flashinfer/activation.py (1)

227-232: LGTM!

The __all__ declaration properly exposes the fused activation functions as the public API. All symbols are decorated with @flashinfer_api and well-defined.

flashinfer/prefill.py (1)

3759-3764: Public API surface for flashinfer.prefill looks consistent

The new __all__ cleanly exposes the two batch wrappers and the single-request helpers, all of which are defined in this module and documented in docs/api/attention.rst. Internal helpers (JIT/module getters, trtllm/deepseek utilities) correctly remain unexported.

flashinfer/pod.py (1)

1204-1207: POD wrapper exports are correct and aligned with docs

Exporting only PODWithPagedKVCacheWrapper and BatchPODWithPagedKVCacheWrapper via __all__ matches the documented public surface and keeps lower-level helpers internal.

docs/api/attention.rst (1)

28-29: New attention docs entries are consistent with the Python API

Adding fast_decode_plan to the decode autosummary and documenting PODWithPagedKVCacheWrapper / BatchPODWithPagedKVCacheWrapper under flashinfer.pod matches the exported symbols in the codebase and follows the existing documentation structure.

Also applies to: 114-129

flashinfer/rope.py (1)

1674-1685: RoPE public exports are well-scoped

The new __all__ exposes only the high-level RoPE application helpers while keeping quantization and cache-append primitives internal. All exported symbols exist in this module and are already wrapped with @flashinfer_api, so the public surface is coherent.

flashinfer/sampling.py (1)

1590-1603: LGTM!

The __all__ declaration properly establishes the public API surface for the sampling module. All listed symbols correspond to functions defined in this module.

docs/api/norm.rst (1)

14-21: LGTM!

Documentation properly updated to reflect the expanded public API surface. The new autosummary entries align with the __all__ declaration in flashinfer/norm.py.

docs/api/comm.rst (1)

49-49: LGTM! Documentation updates are well-structured.

The rename from FP4QuantizationSFLayout to QuantizationSFLayout and the new Unified AllReduce Fusion API section are clearly documented and properly formatted with appropriate Sphinx directives.

Also applies to: 97-116

docs/api/quantization.rst (1)

16-27: LGTM! FP8 quantization documentation properly structured.

The new documentation section for the FP8 quantization module is well-organized and follows the established pattern in this file.

flashinfer/trtllm_low_latency_gemm.py (1)

227-229: LGTM! Appropriate public API surface definition.

The __all__ declaration correctly exports prepare_low_latency_gemm_weights as the public API. The trtllm_low_latency_gemm function appears to be an internal implementation detail not meant for direct external use.

docs/api/gemm.rst (2)

25-25: LGTM! Comprehensive GEMM documentation updates.

The new sections for Blackwell GEMM, TensorRT-LLM Low Latency GEMM, and CuTe-DSL GEMM are well-organized and properly document the expanded API surface. The addition of mm_fp8 to the FP8 GEMM section completes the documentation for that API.

Also applies to: 48-75


66-75: Dependency update is already present.

The nvidia-cutlass-dsl dependency has been correctly updated to version 4.3.4 in requirements.txt. The pyproject.toml dynamically loads dependencies from this file, so no additional changes are needed. The CuTe-DSL GEMM APIs documented in this file are properly supported by the specified dependency version.

Likely an incorrect or invalid review comment.

flashinfer/fp8_quantization.py (1)

211-214: LGTM: Public API surface properly defined.

The __all__ declaration correctly exports the two public FP8 quantization functions, establishing a clear module API surface.

flashinfer/__init__.py (1)

93-100: LGTM: Conditional import pattern correctly handles optional CuTe-DSL dependency.

The try/except block appropriately exposes CuTe-DSL GEMM kernels when the nvidia-cutlass-dsl package (>=4.3.4) is available, while gracefully degrading when it's not installed.

flashinfer/gemm/__init__.py (1)

22-31: LGTM: Availability flag pattern clearly indicates CuTe-DSL support.

The _CUTE_DSL_AVAILABLE flag provides an explicit indicator of whether CuTe-DSL GEMM kernels are available, which is helpful for runtime feature detection.

upd

fix

Update flashinfer/norm.py

Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

Revert "Update flashinfer/norm.py"

This reverts commit c350bf3.

upd

Revert "upd"

This reverts commit bf3d49d3e66d5d8e48808b579547d20492143818.
@yzh119 yzh119 enabled auto-merge (squash) January 5, 2026 21:39
@yzh119 yzh119 merged commit a97b5d7 into flashinfer-ai:main Jan 5, 2026
15 checks passed
@coderabbitai coderabbitai bot mentioned this pull request Mar 11, 2026
5 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants