Skip to content

Conversation

@LeiWang1999
Copy link
Member

@LeiWang1999 LeiWang1999 commented Dec 24, 2025

as title, previous pr introduced atomatically tma swizzle lowering, but some tma shape can not apply swizzling and then raise errors, we should fallback it into linear gemm layout.

Summary by CodeRabbit

Release Notes

  • Bug Fixes
    • Refined memory layout selection for 64-bit elements. Now uses linear layout fallback when stride alignment requirements are not met, preventing attempts to use bank swizzle layouts in incompatible configurations.
    • Enhanced stride alignment validation across supported GPU architectures for improved consistency and robustness.

✏️ Tip: You can customize this high-level summary in your review settings.

…mat_stride % 8 != 0. Refactor swizzling layout conditions to check mat_stride before mat_continuous, improving layout selection logic for better performance.
@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 24, 2025

📝 Walkthrough

Walkthrough

A single file modification that adds stride alignment checks for 64-bit element sizes in GemmABLayout Hopper and Sm100 paths. When mat_stride is not divisible by 8, the code now falls back to a linear layout instead of attempting bank swizzle layouts, preventing misaligned stride scenarios.

Changes

Cohort / File(s) Summary
GemmABLayout Stride Alignment
src/layout/gemm_layouts.cc
Added stride alignment guards (mat_stride % 8 == 0) for 64-bit element handling in both Hopper and Sm100 architectural paths. Non-aligned strides now return linear layout instead of proceeding to bank swizzle layout branches, preventing unsafe stride misalignment scenarios.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~12 minutes

Poem

A rabbit hops through stride-aligned lands,
With checks for eight-division in hand,
When paths don't align, we take the straight way,
A linear fallback saves the day! 🐰✨

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title directly describes the main change: introducing a fallback to linear layout instead of raising errors, which aligns with the PR's core objective to handle TMA swizzle lowering failures gracefully.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

📜 Recent review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0006621 and 44eeb56.

📒 Files selected for processing (1)
  • src/layout/gemm_layouts.cc
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • src/layout/gemm_layouts.cc
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • src/layout/gemm_layouts.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (3)
src/layout/gemm_layouts.cc (3)

760-765: LGTM! Fallback prevents crashes for misaligned float64 strides.

The early fallback to linear layout when mat_stride % 8 != 0 correctly prevents the downstream ICHECK(stride % 8 == 0) failure in makeQuarterBankSwizzleLayout (line 388). This is appropriately placed after the more specialized F64_Kouter and F64_Kinner layouts are attempted.


769-779: LGTM! Stride alignment guard prevents swizzle layout failures.

Wrapping the swizzle layout selection with if (mat_stride % 8 == 0) correctly prevents calls to makeFullBankSwizzleLayout, makeHalfBankSwizzleLayout, and makeQuarterBankSwizzleLayout when the stride constraint isn't met. All three functions require stride % 8 == 0 (lines 388, 407, 426), so this guard is necessary.


797-807: LGTM! Consistent stride alignment guard for SM100.

The changes mirror the Hopper implementation, correctly guarding swizzle layout selection with mat_stride % 8 == 0. This ensures consistency across architectures and prevents the same class of failures.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@LeiWang1999 LeiWang1999 changed the title [Bugfix] Fallback to a Linear Layout instead of raise errors [Bugfix] Fallback to a Linear Layout instead of raising errors Dec 24, 2025
@LeiWang1999 LeiWang1999 merged commit cfccd63 into tile-ai:main Dec 24, 2025
6 of 7 checks passed
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/layout/gemm_layouts.cc (1)

732-751: Add stride alignment guard to makeGemmABLayout before calling bank swizzle layouts.

The makeGemmABLayout function lacks the mat_stride % 8 == 0 guard before calling swizzle layouts (lines 745, 747), while both makeGemmABLayoutHopper (line 769) and makeGemmABLayoutSm100 (line 797) include this check. All three bank swizzle functions—makeFullBankSwizzleLayout, makeHalfBankSwizzleLayout, and makeQuarterBankSwizzleLayout—have ICHECK(stride % 8 == 0) assertions. This inconsistency could cause crashes if makeGemmABLayout is called with misaligned strides when appropriate continuous divisibility conditions are met. Add the stride alignment guard to match the protective pattern used in the newer variants.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 0006621 and 44eeb56.

📒 Files selected for processing (1)
  • src/layout/gemm_layouts.cc
🧰 Additional context used
🧠 Learnings (3)
📓 Common learnings
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.

Applied to files:

  • src/layout/gemm_layouts.cc
📚 Learning: 2025-09-15T10:51:06.985Z
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.

Applied to files:

  • src/layout/gemm_layouts.cc
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: Test for Python 3.12 with Metal (on macos-latest)
  • GitHub Check: Test for Python 3.12 with CUDA-12.8 (on self-hosted-nvidia)
  • GitHub Check: Test for Python 3.12 with Nightly-ROCm-7.1 (on self-hosted-amd)
🔇 Additional comments (3)
src/layout/gemm_layouts.cc (3)

760-765: LGTM! Fallback prevents crashes for misaligned float64 strides.

The early fallback to linear layout when mat_stride % 8 != 0 correctly prevents the downstream ICHECK(stride % 8 == 0) failure in makeQuarterBankSwizzleLayout (line 388). This is appropriately placed after the more specialized F64_Kouter and F64_Kinner layouts are attempted.


769-779: LGTM! Stride alignment guard prevents swizzle layout failures.

Wrapping the swizzle layout selection with if (mat_stride % 8 == 0) correctly prevents calls to makeFullBankSwizzleLayout, makeHalfBankSwizzleLayout, and makeQuarterBankSwizzleLayout when the stride constraint isn't met. All three functions require stride % 8 == 0 (lines 388, 407, 426), so this guard is necessary.


797-807: LGTM! Consistent stride alignment guard for SM100.

The changes mirror the Hopper implementation, correctly guarding swizzle layout selection with mat_stride % 8 == 0. This ensures consistency across architectures and prevents the same class of failures.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant