dependency: update nvidia-cutlass-dsl by yzh119 · Pull Request #2288 · flashinfer-ai/flashinfer

yzh119 · 2026-01-05T19:15:25Z

📌 Description

Update minimal version requirement of nvidia-cutlass-dsl to 4.3.4, which should resolve the arm issue in #2279

🔍 Related Issues

🚀 Pull Request Checklist

Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.

✅ Pre-commit Checks

I have installed pre-commit by running pip install pre-commit (or used your preferred method).
I have installed the hooks with pre-commit install.
I have run the hooks manually with pre-commit run --all-files and fixed any reported issues.

If you are unsure about how to set up pre-commit, see the pre-commit documentation.

🧪 Tests

Tests have been added or updated as needed.
All tests are passing (unittest, etc.).

Reviewer Notes

Summary by CodeRabbit

Chores
- Updated internal dependencies to improve stability and compatibility.

_{✏️ Tip: You can customize this high-level summary in your review settings.}

coderabbitai · 2026-01-05T19:15:37Z

📝 Walkthrough

Walkthrough

Updated the nvidia-cutlass-dsl dependency version constraint from >=4.3.2 to >=4.3.4 in requirements.txt. No other dependencies or code changes were made.

Changes

Cohort / File(s)	Summary
Dependency Version Bump `requirements.txt`	Updated nvidia-cutlass-dsl version constraint from >=4.3.2 to >=4.3.4

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~1 minute

Possibly related PRs

Update nvidia-cutlass-dsl version to 4.3.1 #2161: Also bumps the nvidia-cutlass-dsl dependency version in requirements.txt to a newer constraint.

Suggested reviewers

cyx-6
bkryu
nvmbreughe

Poem

🐰 A tiny hop, a version bump so small,
From 4.3.2 to 4.3.4 we call,
Dependencies dance, ever fresh and bright,
This cutlass-dsl shines ever more right! ✨

Pre-merge checks

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and accurately describes the main change: updating the nvidia-cutlass-dsl dependency version.
Description check	✅ Passed	The pull request description follows the template and includes the required sections: description of changes with rationale, related issues reference, and completed pre-commit checklist items.

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ce6f083 and 7012235.

📒 Files selected for processing (1)

requirements.txt

🚧 Files skipped from review as they are similar to previous changes (1)

requirements.txt

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)

GitHub Check: Deploy Docs

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

gemini-code-assist · 2026-01-05T19:16:01Z

Summary of Changes

Hello @yzh119, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request primarily addresses a critical dependency update for nvidia-cutlass-dsl to version 4.3.4, which is expected to resolve an ARM-related compatibility issue. Beyond this core update, the PR also includes significant internal improvements by explicitly defining the public API of various Python modules using __all__ lists, enhancing code clarity and maintainability. Furthermore, it expands the project's documentation to cover newly exposed or refined functionalities across several kernel categories, including attention, GEMM, and quantization.

Highlights

Dependency Update: The nvidia-cutlass-dsl dependency has been updated to version 4.3.4, specifically to resolve an ARM compatibility issue referenced in [WIP] Refactor: simplify torch -> cute-dsl boilerplate and enable tvm-ffi for cute-dsl kernels #2279.
API Exposure Refinement: Numerous Python modules across the flashinfer package have been updated to explicitly define their public API using __all__ lists, enhancing code clarity and maintainability.
Documentation Expansion: The API documentation has been significantly expanded to cover new functionalities and kernels related to attention (including POD attention), communication (Unified AllReduce Fusion API), GEMM (Blackwell, TensorRT-LLM Low Latency, CuTe-DSL), normalization (quantized RMSNorm), and quantization (FP8 kernels).
Conditional CuTe-DSL GEMM Imports: Conditional imports for CuTe-DSL GEMM kernels have been introduced, allowing for their use when available without breaking functionality otherwise.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request updates the nvidia-cutlass-dsl dependency to version 4.3.4. In addition, it introduces a large-scale improvement to the codebase by adding __all__ declarations to numerous modules, which explicitly defines their public APIs and improves maintainability. The documentation has also been updated to reflect new features and API changes, including new functionalities in attention, GEMM, normalization, and quantization modules. The changes are well-structured and follow good Python practices. The conditional imports for optional features like CuTe-DSL are handled correctly. Overall, this is a solid pull request that improves both functionality and code quality.

coderabbitai

Actionable comments posted: 4

Fix all issues with AI Agents 🤖

In @flashinfer/api_logging.py:
- Around line 568-570: The package's public API does not expose the new
decorator; import and re-export the flashinfer_api symbol from api_logging by
adding an import for flashinfer_api in the package initializer and ensure the
package-level __all__ includes "flashinfer_api" so it becomes part of the public
API surface (i.e., add an import of flashinfer_api from api_logging and append
"flashinfer_api" to the __all__ list in the package __init__).

In @flashinfer/cuda_utils.py:
- Around line 64-66: Import and re-export the checkCudaErrors symbol in the
package initializer: add an import in the main __init__.py that pulls
checkCudaErrors from the cuda_utils module and ensure checkCudaErrors is
included in the package __all__ so it becomes part of the public API (matching
the pattern used for other modules like activation, attention, decode).

In @flashinfer/mla.py:
- Around line 802-804: Add the two public decode functions to the module
exports: include "trtllm_batch_decode_with_kv_cache_mla" and
"xqa_batch_decode_with_kv_cache_mla" in the __all__ list in flashinfer/mla.py
(alongside "BatchMLAPagedAttentionWrapper"), and then re-export those same
symbols from flashinfer/__init__.py to match other public APIs; update the
import/exports in __init__.py to import these functions from flashinfer.mla and
add them to the package-level __all__ so they are available as top-level
flashinfer APIs.

In @flashinfer/norm.py:
- Around line 415-425: The current __all__ in flashinfer/norm.py unconditionally
lists add_rmsnorm_fp4quant and rmsnorm_fp4quant even though those symbols may be
set to None when the cute_dsl import fails; change to the conditional-extend
pattern used in flashinfer/gemm/__init__.py: define a base __all__ containing
only core exports (e.g., "fused_add_rmsnorm", "fused_add_rmsnorm_quant",
"gemma_fused_add_rmsnorm", "gemma_rmsnorm", "layernorm", "rmsnorm",
"rmsnorm_quant"), keep the optional symbols (add_rmsnorm_fp4quant,
rmsnorm_fp4quant) defined as before but remove them from the base list, and then
if _CUTE_DSL_AVAILABLE (or the existing flag used for the import) append or
extend __all__ with ["add_rmsnorm_fp4quant", "rmsnorm_fp4quant"] so wildcard
imports and introspection never see None entries.

🧹 Nitpick comments (10)

flashinfer/green_ctx.py (1)
298-307: LGTM! Public API surface correctly established.

The __all__ declaration properly exposes all public green context utility functions. The selected symbols align with the module's purpose and documented functionality.
Optional: Sort the __all__ list alphabetically for consistency

Applying alphabetical sorting can improve maintainability:
 __all__ = [
-    "get_sm_count_constraint",
-    "get_cudevice",
-    "get_device_resource",
-    "split_resource",
-    "split_resource_by_sm_count",
     "create_green_ctx_streams",
+    "get_cudevice",
+    "get_device_resource",
+    "get_sm_count_constraint",
     "split_device_green_ctx",
     "split_device_green_ctx_by_sm_count",
+    "split_resource",
+    "split_resource_by_sm_count",
 ]
flashinfer/tllm_utils.py (1)

15-18: LGTM! Public API surface correctly defined.

The __all__ export list appropriately exposes the two utility functions from this module.

Note: Static analysis suggests alphabetical sorting of __all__, but the current order is logical (primary function first, then helper). Sorting is optional.
flashinfer/topk.py (1)
424-428: Consider exporting can_implement_filtered_topk.

The three main Top-K functions are correctly exported. However, can_implement_filtered_topk() (Line 141) has a complete public docstring and appears intended for users to check GPU capability before using FilteredTopK.
Suggested addition to exports
 __all__ = [
     "top_k",
     "top_k_page_table_transform",
     "top_k_ragged_transform",
+    "can_implement_filtered_topk",
 ]
Optionally, you might also export the topk alias (Line 254) for backward compatibility, though this is less critical.
flashinfer/concat_ops.py (1)
85-88: Consider sorting __all__ entries to satisfy Ruff (RUF022)

The exports are correct, but Ruff flags this __all__ as unsorted. Reordering them alphabetically would address the lint without behavior change.
Proposed `__all__` reordering
-__all__ = [
-    "get_concat_mla_module",
-    "concat_mla_k",
-]
+__all__ = [
+    "concat_mla_k",
+    "get_concat_mla_module",
+]
flashinfer/fp4_quantization.py (1)

1004-1018: LGTM!

The __all__ declaration correctly exposes the FP4 quantization public API. All 13 symbols are properly defined in the module.

The static analysis tool suggests sorting the __all__ list alphabetically for consistency. This is purely a style preference and can be addressed at your discretion.

flashinfer/testing/__init__.py (1)

32-45: LGTM!

The __all__ declaration properly exposes the testing utilities, matching the imported symbols from .utils.

The static analysis tool suggests sorting the __all__ list alphabetically. This is a minor style preference and entirely optional.

flashinfer/logits_processor/__init__.py (1)

36-62: Public API surface correctly established.

The __all__ declaration properly exposes the logits processor public API.

Optionally, consider sorting the __all__ list alphabetically (or by category then alphabetically within each category) to improve maintainability, as suggested by the static analysis tool. The current grouping by category is already helpful for readability.

flashinfer/comm/__init__.py (1)

70-123: Public API surface correctly established.

The __all__ declaration properly exposes the communication module's public API. The categorical grouping with comments enhances readability.

Optionally, consider alphabetically sorting entries within each category to make it easier to locate specific symbols and maintain consistency. The current categorical organization is already beneficial for understanding the module structure.
flashinfer/jit/__init__.py (1)
96-158: Good addition of public API surface definition.

Adding __all__ explicitly defines the module's public API, which is excellent practice.

The static analysis tool suggests sorting the __all__ list using isort-style ordering for consistency. This is optional and can be applied using the Ruff formatter.
📋 How to apply sorting

Run the following command to auto-fix:
ruff check --select RUF022 --fix flashinfer/jit/__init__.py
flashinfer/gemm/__init__.py (1)
49-53: Conditional export correctly extends public API when CuTe-DSL is available.

The conditional extension of __all__ properly gates the new GEMM kernel exports behind the availability flag.

The static analysis tool suggests alphabetically sorting this list, though the current logical ordering (function before class) is also reasonable. If you prefer consistency with typical Python conventions, consider alphabetical sorting:
Optional: alphabetically sort the conditional exports
 if _CUTE_DSL_AVAILABLE:
     __all__ += [
-        "grouped_gemm_nt_masked",
         "Sm100BlockScaledPersistentDenseGemmKernel",
+        "grouped_gemm_nt_masked",
     ]

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff41a8f and ce6f083.

📒 Files selected for processing (42)

docs/api/attention.rst
docs/api/comm.rst
docs/api/gemm.rst
docs/api/norm.rst
docs/api/quantization.rst
flashinfer/__init__.py
flashinfer/activation.py
flashinfer/aot.py
flashinfer/api_logging.py
flashinfer/artifacts.py
flashinfer/attention.py
flashinfer/autotuner.py
flashinfer/cascade.py
flashinfer/comm/__init__.py
flashinfer/compilation_context.py
flashinfer/concat_ops.py
flashinfer/cuda_utils.py
flashinfer/decode.py
flashinfer/deep_gemm.py
flashinfer/fp4_quantization.py
flashinfer/fp8_quantization.py
flashinfer/gemm/__init__.py
flashinfer/green_ctx.py
flashinfer/jit/__init__.py
flashinfer/logits_processor/__init__.py
flashinfer/mla.py
flashinfer/norm.py
flashinfer/page.py
flashinfer/pod.py
flashinfer/prefill.py
flashinfer/quantization.py
flashinfer/rope.py
flashinfer/sampling.py
flashinfer/sparse.py
flashinfer/testing/__init__.py
flashinfer/tllm_utils.py
flashinfer/topk.py
flashinfer/trtllm_low_latency_gemm.py
flashinfer/utils.py
flashinfer/version.py
flashinfer/xqa.py
requirements.txt

🧰 Additional context used

📓 Path-based instructions (4)

flashinfer/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/**/*.py: Use @functools.cache decorator on Python API functions to implement module-level caching and avoid recompilation
Use @flashinfer_api decorator for debugging API calls, enable via FLASHINFER_LOGLEVEL environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Files:

flashinfer/tllm_utils.py
flashinfer/quantization.py
flashinfer/cuda_utils.py
flashinfer/sparse.py
flashinfer/compilation_context.py
flashinfer/pod.py
flashinfer/deep_gemm.py
flashinfer/trtllm_low_latency_gemm.py
flashinfer/cascade.py
flashinfer/fp4_quantization.py
flashinfer/logits_processor/__init__.py
flashinfer/utils.py
flashinfer/prefill.py
flashinfer/api_logging.py
flashinfer/green_ctx.py
flashinfer/concat_ops.py
flashinfer/topk.py
flashinfer/testing/__init__.py
flashinfer/page.py
flashinfer/artifacts.py
flashinfer/activation.py
flashinfer/autotuner.py
flashinfer/xqa.py
flashinfer/norm.py
flashinfer/rope.py
flashinfer/jit/__init__.py
flashinfer/mla.py
flashinfer/sampling.py
flashinfer/attention.py
flashinfer/gemm/__init__.py
flashinfer/comm/__init__.py
flashinfer/version.py
flashinfer/aot.py
flashinfer/fp8_quantization.py
flashinfer/decode.py
flashinfer/__init__.py

flashinfer/jit/**/*.py

📄 CodeRabbit inference engine (CLAUDE.md)

flashinfer/jit/**/*.py: JIT module generators in flashinfer/jit/ must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Use gen_jit_spec() function to return a properly configured JitSpec from module generators with appropriate sources and extra_cuda_cflags
Specify supported_major_versions in JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)

Files:

flashinfer/jit/__init__.py

flashinfer/aot.py

📄 CodeRabbit inference engine (CLAUDE.md)

Register new operations in flashinfer/aot.py by calling the gen_*_module() function for AOT (Ahead-Of-Time) pre-compilation support

Files:

flashinfer/aot.py

flashinfer/__init__.py

📄 CodeRabbit inference engine (CLAUDE.md)

Export new operations in flashinfer/__init__.py to make them available as public API

Files:

flashinfer/__init__.py

🧠 Learnings (9)

📓 Common learnings

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API

Applied to files:

flashinfer/tllm_utils.py
flashinfer/quantization.py
flashinfer/cuda_utils.py
flashinfer/sparse.py
flashinfer/compilation_context.py
flashinfer/pod.py
flashinfer/deep_gemm.py
flashinfer/trtllm_low_latency_gemm.py
flashinfer/cascade.py
flashinfer/fp4_quantization.py
flashinfer/logits_processor/__init__.py
docs/api/attention.rst
flashinfer/utils.py
flashinfer/prefill.py
flashinfer/api_logging.py
flashinfer/green_ctx.py
flashinfer/concat_ops.py
flashinfer/topk.py
flashinfer/testing/__init__.py
flashinfer/page.py
flashinfer/artifacts.py
flashinfer/activation.py
flashinfer/autotuner.py
flashinfer/xqa.py
flashinfer/norm.py
flashinfer/rope.py
flashinfer/jit/__init__.py
flashinfer/mla.py
flashinfer/sampling.py
flashinfer/attention.py
flashinfer/gemm/__init__.py
flashinfer/comm/__init__.py
flashinfer/version.py
flashinfer/aot.py
flashinfer/fp8_quantization.py
flashinfer/decode.py
flashinfer/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)

Applied to files:

flashinfer/api_logging.py
flashinfer/testing/__init__.py
flashinfer/version.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `functools.cache` decorator on Python API functions to implement module-level caching and avoid recompilation

Applied to files:

flashinfer/api_logging.py
flashinfer/decode.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to tests/**/*.py : Test implementations should use `flashinfer.utils` functions (`get_compute_capability`, `is_sm90a_supported`, `is_sm100a_supported`, etc.) to skip tests on unsupported GPU architectures

Applied to files:

flashinfer/testing/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed

Applied to files:

flashinfer/jit/__init__.py
flashinfer/gemm/__init__.py
flashinfer/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Use `gen_jit_spec()` function to return a properly configured JitSpec from module generators with appropriate `sources` and `extra_cuda_cflags`

Applied to files:

flashinfer/jit/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : JIT module generators in `flashinfer/jit/` must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec

Applied to files:

flashinfer/jit/__init__.py

📚 Learning: 2025-12-30T09:34:39.900Z

Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support

Applied to files:

flashinfer/aot.py

🧬 Code graph analysis (1)

flashinfer/gemm/__init__.py (1)

flashinfer/cute_dsl/blockscaled_gemm.py (2)

grouped_gemm_nt_masked (2945-3046)

Sm100BlockScaledPersistentDenseGemmKernel (464-2449)

🪛 Ruff (0.14.10)

flashinfer/tllm_utils.py

15-18: __all__ is not sorted