dependency: update nvidia-cutlass-dsl#2288
Conversation
📝 WalkthroughWalkthroughUpdated the nvidia-cutlass-dsl dependency version constraint from >=4.3.2 to >=4.3.4 in requirements.txt. No other dependencies or code changes were made. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~1 minute Possibly related PRs
Suggested reviewers
Poem
Pre-merge checks❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
📜 Recent review detailsConfiguration used: defaults Review profile: CHILL Plan: Pro 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (1)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @yzh119, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request primarily addresses a critical dependency update for Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request updates the nvidia-cutlass-dsl dependency to version 4.3.4. In addition, it introduces a large-scale improvement to the codebase by adding __all__ declarations to numerous modules, which explicitly defines their public APIs and improves maintainability. The documentation has also been updated to reflect new features and API changes, including new functionalities in attention, GEMM, normalization, and quantization modules. The changes are well-structured and follow good Python practices. The conditional imports for optional features like CuTe-DSL are handled correctly. Overall, this is a solid pull request that improves both functionality and code quality.
There was a problem hiding this comment.
Actionable comments posted: 4
Fix all issues with AI Agents 🤖
In @flashinfer/api_logging.py:
- Around line 568-570: The package's public API does not expose the new
decorator; import and re-export the flashinfer_api symbol from api_logging by
adding an import for flashinfer_api in the package initializer and ensure the
package-level __all__ includes "flashinfer_api" so it becomes part of the public
API surface (i.e., add an import of flashinfer_api from api_logging and append
"flashinfer_api" to the __all__ list in the package __init__).
In @flashinfer/cuda_utils.py:
- Around line 64-66: Import and re-export the checkCudaErrors symbol in the
package initializer: add an import in the main __init__.py that pulls
checkCudaErrors from the cuda_utils module and ensure checkCudaErrors is
included in the package __all__ so it becomes part of the public API (matching
the pattern used for other modules like activation, attention, decode).
In @flashinfer/mla.py:
- Around line 802-804: Add the two public decode functions to the module
exports: include "trtllm_batch_decode_with_kv_cache_mla" and
"xqa_batch_decode_with_kv_cache_mla" in the __all__ list in flashinfer/mla.py
(alongside "BatchMLAPagedAttentionWrapper"), and then re-export those same
symbols from flashinfer/__init__.py to match other public APIs; update the
import/exports in __init__.py to import these functions from flashinfer.mla and
add them to the package-level __all__ so they are available as top-level
flashinfer APIs.
In @flashinfer/norm.py:
- Around line 415-425: The current __all__ in flashinfer/norm.py unconditionally
lists add_rmsnorm_fp4quant and rmsnorm_fp4quant even though those symbols may be
set to None when the cute_dsl import fails; change to the conditional-extend
pattern used in flashinfer/gemm/__init__.py: define a base __all__ containing
only core exports (e.g., "fused_add_rmsnorm", "fused_add_rmsnorm_quant",
"gemma_fused_add_rmsnorm", "gemma_rmsnorm", "layernorm", "rmsnorm",
"rmsnorm_quant"), keep the optional symbols (add_rmsnorm_fp4quant,
rmsnorm_fp4quant) defined as before but remove them from the base list, and then
if _CUTE_DSL_AVAILABLE (or the existing flag used for the import) append or
extend __all__ with ["add_rmsnorm_fp4quant", "rmsnorm_fp4quant"] so wildcard
imports and introspection never see None entries.
🧹 Nitpick comments (10)
flashinfer/green_ctx.py (1)
298-307: LGTM! Public API surface correctly established.The
__all__declaration properly exposes all public green context utility functions. The selected symbols align with the module's purpose and documented functionality.Optional: Sort the __all__ list alphabetically for consistency
Applying alphabetical sorting can improve maintainability:
__all__ = [ - "get_sm_count_constraint", - "get_cudevice", - "get_device_resource", - "split_resource", - "split_resource_by_sm_count", "create_green_ctx_streams", + "get_cudevice", + "get_device_resource", + "get_sm_count_constraint", "split_device_green_ctx", "split_device_green_ctx_by_sm_count", + "split_resource", + "split_resource_by_sm_count", ]flashinfer/tllm_utils.py (1)
15-18: LGTM! Public API surface correctly defined.The
__all__export list appropriately exposes the two utility functions from this module.Note: Static analysis suggests alphabetical sorting of
__all__, but the current order is logical (primary function first, then helper). Sorting is optional.flashinfer/topk.py (1)
424-428: Consider exportingcan_implement_filtered_topk.The three main Top-K functions are correctly exported. However,
can_implement_filtered_topk()(Line 141) has a complete public docstring and appears intended for users to check GPU capability before using FilteredTopK.Suggested addition to exports
__all__ = [ "top_k", "top_k_page_table_transform", "top_k_ragged_transform", + "can_implement_filtered_topk", ]Optionally, you might also export the
topkalias (Line 254) for backward compatibility, though this is less critical.flashinfer/concat_ops.py (1)
85-88: Consider sorting__all__entries to satisfy Ruff (RUF022)The exports are correct, but Ruff flags this
__all__as unsorted. Reordering them alphabetically would address the lint without behavior change.Proposed `__all__` reordering
-__all__ = [ - "get_concat_mla_module", - "concat_mla_k", -] +__all__ = [ + "concat_mla_k", + "get_concat_mla_module", +]flashinfer/fp4_quantization.py (1)
1004-1018: LGTM!The
__all__declaration correctly exposes the FP4 quantization public API. All 13 symbols are properly defined in the module.The static analysis tool suggests sorting the
__all__list alphabetically for consistency. This is purely a style preference and can be addressed at your discretion.flashinfer/testing/__init__.py (1)
32-45: LGTM!The
__all__declaration properly exposes the testing utilities, matching the imported symbols from.utils.The static analysis tool suggests sorting the
__all__list alphabetically. This is a minor style preference and entirely optional.flashinfer/logits_processor/__init__.py (1)
36-62: Public API surface correctly established.The
__all__declaration properly exposes the logits processor public API.Optionally, consider sorting the
__all__list alphabetically (or by category then alphabetically within each category) to improve maintainability, as suggested by the static analysis tool. The current grouping by category is already helpful for readability.flashinfer/comm/__init__.py (1)
70-123: Public API surface correctly established.The
__all__declaration properly exposes the communication module's public API. The categorical grouping with comments enhances readability.Optionally, consider alphabetically sorting entries within each category to make it easier to locate specific symbols and maintain consistency. The current categorical organization is already beneficial for understanding the module structure.
flashinfer/jit/__init__.py (1)
96-158: Good addition of public API surface definition.Adding
__all__explicitly defines the module's public API, which is excellent practice.The static analysis tool suggests sorting the
__all__list using isort-style ordering for consistency. This is optional and can be applied using the Ruff formatter.📋 How to apply sorting
Run the following command to auto-fix:
ruff check --select RUF022 --fix flashinfer/jit/__init__.pyflashinfer/gemm/__init__.py (1)
49-53: Conditional export correctly extends public API when CuTe-DSL is available.The conditional extension of
__all__properly gates the new GEMM kernel exports behind the availability flag.The static analysis tool suggests alphabetically sorting this list, though the current logical ordering (function before class) is also reasonable. If you prefer consistency with typical Python conventions, consider alphabetical sorting:
Optional: alphabetically sort the conditional exports
if _CUTE_DSL_AVAILABLE: __all__ += [ - "grouped_gemm_nt_masked", "Sm100BlockScaledPersistentDenseGemmKernel", + "grouped_gemm_nt_masked", ]
📜 Review details
Configuration used: defaults
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (42)
docs/api/attention.rstdocs/api/comm.rstdocs/api/gemm.rstdocs/api/norm.rstdocs/api/quantization.rstflashinfer/__init__.pyflashinfer/activation.pyflashinfer/aot.pyflashinfer/api_logging.pyflashinfer/artifacts.pyflashinfer/attention.pyflashinfer/autotuner.pyflashinfer/cascade.pyflashinfer/comm/__init__.pyflashinfer/compilation_context.pyflashinfer/concat_ops.pyflashinfer/cuda_utils.pyflashinfer/decode.pyflashinfer/deep_gemm.pyflashinfer/fp4_quantization.pyflashinfer/fp8_quantization.pyflashinfer/gemm/__init__.pyflashinfer/green_ctx.pyflashinfer/jit/__init__.pyflashinfer/logits_processor/__init__.pyflashinfer/mla.pyflashinfer/norm.pyflashinfer/page.pyflashinfer/pod.pyflashinfer/prefill.pyflashinfer/quantization.pyflashinfer/rope.pyflashinfer/sampling.pyflashinfer/sparse.pyflashinfer/testing/__init__.pyflashinfer/tllm_utils.pyflashinfer/topk.pyflashinfer/trtllm_low_latency_gemm.pyflashinfer/utils.pyflashinfer/version.pyflashinfer/xqa.pyrequirements.txt
🧰 Additional context used
📓 Path-based instructions (4)
flashinfer/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
flashinfer/**/*.py: Use@functools.cachedecorator on Python API functions to implement module-level caching and avoid recompilation
Use@flashinfer_apidecorator for debugging API calls, enable viaFLASHINFER_LOGLEVELenvironment variable (0=off, 1=basic, 3=detailed, 5=with stats)
Files:
flashinfer/tllm_utils.pyflashinfer/quantization.pyflashinfer/cuda_utils.pyflashinfer/sparse.pyflashinfer/compilation_context.pyflashinfer/pod.pyflashinfer/deep_gemm.pyflashinfer/trtllm_low_latency_gemm.pyflashinfer/cascade.pyflashinfer/fp4_quantization.pyflashinfer/logits_processor/__init__.pyflashinfer/utils.pyflashinfer/prefill.pyflashinfer/api_logging.pyflashinfer/green_ctx.pyflashinfer/concat_ops.pyflashinfer/topk.pyflashinfer/testing/__init__.pyflashinfer/page.pyflashinfer/artifacts.pyflashinfer/activation.pyflashinfer/autotuner.pyflashinfer/xqa.pyflashinfer/norm.pyflashinfer/rope.pyflashinfer/jit/__init__.pyflashinfer/mla.pyflashinfer/sampling.pyflashinfer/attention.pyflashinfer/gemm/__init__.pyflashinfer/comm/__init__.pyflashinfer/version.pyflashinfer/aot.pyflashinfer/fp8_quantization.pyflashinfer/decode.pyflashinfer/__init__.py
flashinfer/jit/**/*.py
📄 CodeRabbit inference engine (CLAUDE.md)
flashinfer/jit/**/*.py: JIT module generators inflashinfer/jit/must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Usegen_jit_spec()function to return a properly configured JitSpec from module generators with appropriatesourcesandextra_cuda_cflags
Specifysupported_major_versionsin JitSpec to restrict kernel compilation to supported GPU architectures (e.g., SM versions 9, 10, 11, 12 for Hopper/newer)
Files:
flashinfer/jit/__init__.py
flashinfer/aot.py
📄 CodeRabbit inference engine (CLAUDE.md)
Register new operations in
flashinfer/aot.pyby calling thegen_*_module()function for AOT (Ahead-Of-Time) pre-compilation support
Files:
flashinfer/aot.py
flashinfer/__init__.py
📄 CodeRabbit inference engine (CLAUDE.md)
Export new operations in
flashinfer/__init__.pyto make them available as public API
Files:
flashinfer/__init__.py
🧠 Learnings (9)
📓 Common learnings
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/__init__.py : Export new operations in `flashinfer/__init__.py` to make them available as public API
Applied to files:
flashinfer/tllm_utils.pyflashinfer/quantization.pyflashinfer/cuda_utils.pyflashinfer/sparse.pyflashinfer/compilation_context.pyflashinfer/pod.pyflashinfer/deep_gemm.pyflashinfer/trtllm_low_latency_gemm.pyflashinfer/cascade.pyflashinfer/fp4_quantization.pyflashinfer/logits_processor/__init__.pydocs/api/attention.rstflashinfer/utils.pyflashinfer/prefill.pyflashinfer/api_logging.pyflashinfer/green_ctx.pyflashinfer/concat_ops.pyflashinfer/topk.pyflashinfer/testing/__init__.pyflashinfer/page.pyflashinfer/artifacts.pyflashinfer/activation.pyflashinfer/autotuner.pyflashinfer/xqa.pyflashinfer/norm.pyflashinfer/rope.pyflashinfer/jit/__init__.pyflashinfer/mla.pyflashinfer/sampling.pyflashinfer/attention.pyflashinfer/gemm/__init__.pyflashinfer/comm/__init__.pyflashinfer/version.pyflashinfer/aot.pyflashinfer/fp8_quantization.pyflashinfer/decode.pyflashinfer/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `flashinfer_api` decorator for debugging API calls, enable via `FLASHINFER_LOGLEVEL` environment variable (0=off, 1=basic, 3=detailed, 5=with stats)
Applied to files:
flashinfer/api_logging.pyflashinfer/testing/__init__.pyflashinfer/version.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/**/*.py : Use `functools.cache` decorator on Python API functions to implement module-level caching and avoid recompilation
Applied to files:
flashinfer/api_logging.pyflashinfer/decode.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to tests/**/*.py : Test implementations should use `flashinfer.utils` functions (`get_compute_capability`, `is_sm90a_supported`, `is_sm100a_supported`, etc.) to skip tests on unsupported GPU architectures
Applied to files:
flashinfer/testing/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to include/**/*.cuh : Kernel code in `include/flashinfer/` is automatically picked up by JIT compilation on changes - no pip reinstall needed
Applied to files:
flashinfer/jit/__init__.pyflashinfer/gemm/__init__.pyflashinfer/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : Use `gen_jit_spec()` function to return a properly configured JitSpec from module generators with appropriate `sources` and `extra_cuda_cflags`
Applied to files:
flashinfer/jit/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/jit/**/*.py : JIT module generators in `flashinfer/jit/` must follow the pattern: compute URI → create directory → (optional) render Jinja template → copy sources → return JitSpec
Applied to files:
flashinfer/jit/__init__.py
📚 Learning: 2025-12-30T09:34:39.900Z
Learnt from: CR
Repo: flashinfer-ai/flashinfer PR: 0
File: CLAUDE.md:0-0
Timestamp: 2025-12-30T09:34:39.900Z
Learning: Applies to flashinfer/aot.py : Register new operations in `flashinfer/aot.py` by calling the `gen_*_module()` function for AOT (Ahead-Of-Time) pre-compilation support
Applied to files:
flashinfer/aot.py
🧬 Code graph analysis (1)
flashinfer/gemm/__init__.py (1)
flashinfer/cute_dsl/blockscaled_gemm.py (2)
grouped_gemm_nt_masked(2945-3046)Sm100BlockScaledPersistentDenseGemmKernel(464-2449)
🪛 Ruff (0.14.10)
flashinfer/tllm_utils.py
15-18: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/fp4_quantization.py
1004-1018: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/logits_processor/__init__.py
36-62: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/green_ctx.py
298-307: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/concat_ops.py
85-88: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/testing/__init__.py
32-45: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/artifacts.py
253-265: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/jit/__init__.py
96-158: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/gemm/__init__.py
50-53: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/comm/__init__.py
70-123: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
flashinfer/version.py
26-29: __all__ is not sorted
Apply an isort-style sorting to __all__
(RUF022)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (4)
- GitHub Check: build (cu126, amd64)
- GitHub Check: build (cu126, arm64)
- GitHub Check: build (cu128, amd64)
- GitHub Check: Deploy Docs
🔇 Additional comments (30)
flashinfer/attention.py (1)
282-285: Public API surface correctly defined.The
__all__declaration properly exports bothBatchAttentionandBatchAttentionWithAttentionSinkWrapperclasses, and these are re-exported inflashinfer/__init__.pyto establish the public API surface.flashinfer/utils.py (1)
1187-1189: LGTM! Public API surface explicitly defined.The
__all__declaration correctly exportsnext_positive_power_of_2, establishing an explicit public API surface for this module. The function is properly re-exported inflashinfer/__init__.py, making it available as part of the top-level public API. The minimal export list appears intentional as part of the PR-wide effort to refine public APIs.flashinfer/quantization.py (1)
142-145: LGTM! Quantization public API correctly defined.The
__all__declaration appropriately exportspackbitsandsegment_packbits, establishing the quantization module's public API surface. Both functions are properly decorated with@flashinfer_apiand are correctly re-exported inflashinfer/__init__.py.flashinfer/version.py (1)
26-29: LGTM! Version exports are correctly defined.The
__all__declaration properly exposes the version metadata symbols, which are the only public exports for this module.flashinfer/xqa.py (1)
532-535: LGTM! Public API exports are appropriate.The
__all__declaration correctly exposes the two main XQA attention functions while keeping internal helper functions (e.g.,get_xqa_module,get_xqa_module_mla) private. This follows sound API design principles.requirements.txt (1)
7-7: Dependency version verified — nvidia-cutlass-dsl 4.3.4 exists on PyPI with no known security advisories.The version bump from 4.3.2 to 4.3.4 is valid and safe to merge.
flashinfer/artifacts.py (1)
253-265: LGTM! Comprehensive public API definition with helpful organization.The
__all__list appropriately exports all artifact management utilities. The grouping by type (Classes, Functions) provides clarity that would be lost with alphabetical sorting.flashinfer/compilation_context.py (1)
71-73: LGTM! Public API correctly defined.The
__all__export appropriately exposes theCompilationContextclass as the module's public interface.flashinfer/autotuner.py (1)
794-796: The current__all__export list appropriately restricts the public API.The classes mentioned (
TunableRunner,TuningConfig,DynamicTensorSpec,ConstraintSpec,OptimizationProfile,AutoTuner) are internal implementation details used within flashinfer and benchmarks. While heavily used internally, they are deliberately excluded from__all__, indicating the framework is designed for users to interact only through theautotunecontext manager, not to extend or directly instantiate these classes. No public documentation or examples encourage customTunableRunnerimplementations, and there is no evidence of intended user-facing API extension patterns. The current design is consistent with the stated interface.flashinfer/sparse.py (1)
1171-1174: Sparse attention wrappers correctly exported via__all__
__all__cleanly exposes the two wrapper classes defined in this module and does not leak internal helpers. Looks good as a public surface.flashinfer/cascade.py (1)
1083-1090: Cascade attention public API surface is coherentThe
__all__list accurately exposes the cascade wrappers and merge helpers defined here, without exporting internal utilities. This matches the intended public API.flashinfer/decode.py (1)
2682-2689: Decode module__all__cleanly reflects intended public APIThe decode
__all__entries correspond to existing, user-facing decode wrappers and functions (includingfast_decode_plan) and avoid exporting internal helpers. This is a sound explicit public surface.flashinfer/deep_gemm.py (1)
1614-1622: Deep GEMM entrypoints exported appropriatelyThe
__all__list exposes the main DeepGEMM-facing APIs (KernelMap, loader helpers, and grouped FP8 GEMM functions) and keeps internal helpers private. This is a sensible public surface.flashinfer/page.py (1)
384-389: LGTM!The
__all__declaration correctly exposes the public API for this module. All four symbols are properly defined functions in the file.flashinfer/aot.py (1)
884-886: LGTM!Correctly exposes
register_default_modulesas the public API for this module, consistent with the repository pattern of explicit exports.flashinfer/activation.py (1)
227-232: LGTM!The
__all__declaration properly exposes the fused activation functions as the public API. All symbols are decorated with@flashinfer_apiand well-defined.flashinfer/prefill.py (1)
3759-3764: Public API surface forflashinfer.prefilllooks consistentThe new
__all__cleanly exposes the two batch wrappers and the single-request helpers, all of which are defined in this module and documented indocs/api/attention.rst. Internal helpers (JIT/module getters, trtllm/deepseek utilities) correctly remain unexported.flashinfer/pod.py (1)
1204-1207: POD wrapper exports are correct and aligned with docsExporting only
PODWithPagedKVCacheWrapperandBatchPODWithPagedKVCacheWrappervia__all__matches the documented public surface and keeps lower-level helpers internal.docs/api/attention.rst (1)
28-29: New attention docs entries are consistent with the Python APIAdding
fast_decode_planto the decode autosummary and documentingPODWithPagedKVCacheWrapper/BatchPODWithPagedKVCacheWrapperunderflashinfer.podmatches the exported symbols in the codebase and follows the existing documentation structure.Also applies to: 114-129
flashinfer/rope.py (1)
1674-1685: RoPE public exports are well-scopedThe new
__all__exposes only the high-level RoPE application helpers while keeping quantization and cache-append primitives internal. All exported symbols exist in this module and are already wrapped with@flashinfer_api, so the public surface is coherent.flashinfer/sampling.py (1)
1590-1603: LGTM!The
__all__declaration properly establishes the public API surface for the sampling module. All listed symbols correspond to functions defined in this module.docs/api/norm.rst (1)
14-21: LGTM!Documentation properly updated to reflect the expanded public API surface. The new autosummary entries align with the
__all__declaration inflashinfer/norm.py.docs/api/comm.rst (1)
49-49: LGTM! Documentation updates are well-structured.The rename from
FP4QuantizationSFLayouttoQuantizationSFLayoutand the new Unified AllReduce Fusion API section are clearly documented and properly formatted with appropriate Sphinx directives.Also applies to: 97-116
docs/api/quantization.rst (1)
16-27: LGTM! FP8 quantization documentation properly structured.The new documentation section for the FP8 quantization module is well-organized and follows the established pattern in this file.
flashinfer/trtllm_low_latency_gemm.py (1)
227-229: LGTM! Appropriate public API surface definition.The
__all__declaration correctly exportsprepare_low_latency_gemm_weightsas the public API. Thetrtllm_low_latency_gemmfunction appears to be an internal implementation detail not meant for direct external use.docs/api/gemm.rst (2)
25-25: LGTM! Comprehensive GEMM documentation updates.The new sections for Blackwell GEMM, TensorRT-LLM Low Latency GEMM, and CuTe-DSL GEMM are well-organized and properly document the expanded API surface. The addition of
mm_fp8to the FP8 GEMM section completes the documentation for that API.Also applies to: 48-75
66-75: Dependency update is already present.The nvidia-cutlass-dsl dependency has been correctly updated to version 4.3.4 in requirements.txt. The pyproject.toml dynamically loads dependencies from this file, so no additional changes are needed. The CuTe-DSL GEMM APIs documented in this file are properly supported by the specified dependency version.
Likely an incorrect or invalid review comment.
flashinfer/fp8_quantization.py (1)
211-214: LGTM: Public API surface properly defined.The
__all__declaration correctly exports the two public FP8 quantization functions, establishing a clear module API surface.flashinfer/__init__.py (1)
93-100: LGTM: Conditional import pattern correctly handles optional CuTe-DSL dependency.The try/except block appropriately exposes CuTe-DSL GEMM kernels when the nvidia-cutlass-dsl package (>=4.3.4) is available, while gracefully degrading when it's not installed.
flashinfer/gemm/__init__.py (1)
22-31: LGTM: Availability flag pattern clearly indicates CuTe-DSL support.The
_CUTE_DSL_AVAILABLEflag provides an explicit indicator of whether CuTe-DSL GEMM kernels are available, which is helpful for runtime feature detection.
upd fix Update flashinfer/norm.py Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com> Revert "Update flashinfer/norm.py" This reverts commit c350bf3. upd Revert "upd" This reverts commit bf3d49d3e66d5d8e48808b579547d20492143818.
📌 Description
Update minimal version requirement of nvidia-cutlass-dsl to 4.3.4, which should resolve the arm issue in #2279
🔍 Related Issues
#2279
🚀 Pull Request Checklist
Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete.
✅ Pre-commit Checks
pre-commitby runningpip install pre-commit(or used your preferred method).pre-commit install.pre-commit run --all-filesand fixed any reported issues.🧪 Tests
unittest, etc.).Reviewer Notes
Summary by CodeRabbit
✏️ Tip: You can customize this high-level summary in your review settings.