Skip to content

Conversation

@Rachmanino
Copy link
Collaborator

@Rachmanino Rachmanino commented Dec 31, 2025

Summary by CodeRabbit

  • Documentation

    • Added inline comments to improve clarity in GEMM example code.
  • Chores

    • Updated block configuration parameters in the GEMM example to optimize performance characteristics.

✏️ Tip: You can customize this high-level summary in your review settings.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Dec 31, 2025

📝 Walkthrough

Walkthrough

Block_N parameter reduced from 256 to 128 in GEMM kernel configuration. A_shared and B_shared buffer shapes adjusted accordingly. Loop logic and GEMM operations remain unchanged. Inline comments added to T.copy statements for clarity.

Changes

Cohort / File(s) Summary
GEMM kernel parameter tuning
examples/gemm_sm100/gemm_tcgen5mma.py
Reduced block_N from 256 to 128; block_M and block_K unchanged at 128. A_shared and B_shared shapes automatically adjusted due to transposed flags. num_stages remains 2. Added inline comments to two T.copy operations. No logic changes.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Poem

🐰 A hop and a skip through matrix blocks,
From 256 down to 128 we dock,
Shared memory dancing, shapes realign,
Comments now whisper what's fine,
GEMM optimized—a tuning so neat!

Pre-merge checks and finishing touches

❌ Failed checks (1 warning, 1 inconclusive)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
Title check ❓ Inconclusive The title is vague and generic, using '[BugFix] fix tcgen5mma example' which provides no meaningful details about what bug was fixed or what the actual change is. Provide a more specific title that describes the actual bug fix, e.g., 'Reduce block_N from 256 to 128 in tcgen5mma example' or 'Fix tcgen5mma block size configuration for SM100'.
✅ Passed checks (1 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/gemm_sm100/gemm_tcgen5mma.py (1)

41-42: Consider clarifying the inline comments.

The comments # not trans_A and # trans_B accurately reflect the configuration values, but they could be more descriptive about what they document. Consider rephrasing to clarify whether they describe the data layout or the configuration, for example: # A is row-major (trans_A=False) and # B is column-major (trans_B=True).

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 53ea96c and aca04e6.

📒 Files selected for processing (1)
  • examples/gemm_sm100/gemm_tcgen5mma.py
🧰 Additional context used
🧠 Learnings (1)
📓 Common learnings
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h.
Learnt from: botbw
Repo: tile-ai/tilelang PR: 691
File: src/tl_templates/cuda/gemm_sp_sm80.h:81-85
Timestamp: 2025-09-15T10:51:06.985Z
Learning: In CUTLASS tensor operation layouts, crosswise constants should be computed using sizeof(T) (bytes), not cutlass::sizeof_bits<T>::value (bits). However, the layout template parameter should use sizeof_bits<T>::value (bits). This is the established pattern in the official CUTLASS codebase, as seen in default_mma_core_sparse_sm80.h where Crosswise uses sizeof(ElementA) but the layout template uses sizeof_bits<ElementA>::value.
🔇 Additional comments (2)
examples/gemm_sm100/gemm_tcgen5mma.py (2)

55-55: LGTM! The block size reduction addresses the bug.

Reducing block_N from 256 to 128 significantly reduces shared memory usage per block (from ~160KB to ~96KB), which likely addresses hardware constraints or memory limitations specific to TensorCore Gen5 MMA operations. The dependent buffer shapes (A_shared, B_shared, C_shared) are properly adjusted by the existing logic.


58-58: No action needed. The transition from num_stages = 0 to num_stages = 2 is intentional behavior by design. The conditional logic explicitly enables 2-stage pipelining when all block dimensions are smaller than 256, and disables it when any dimension is 256 or larger. With the new tile configuration (128×128×128), this results in the correct num_stages = 2 setting. The code includes validation tests that confirm correct compilation and numerical accuracy.

@LeiWang1999 LeiWang1999 merged commit d6eb5d3 into tile-ai:main Jan 1, 2026
9 of 11 checks passed
@Rachmanino Rachmanino deleted the tcgen5 branch January 1, 2026 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants