Fix fp4 pointer arithmetic in CUDA codegen #1524

LJC00118 · 2025-12-24T10:52:21Z

Summary by CodeRabbit

Bug Fixes
- Refined pointer/index arithmetic for 4-bit and 1-bit scalar types to ensure correct memory loads
- Fixed fp4 (4-bit float) memory access index calculation for packed elements
- Improved code generation for buffer element access to use simplified index expressions
Documentation
- Added clarifying comments explaining pointer arithmetic adjustments

_{✏️ Tip: You can customize this high-level summary in your review settings.}

github-actions · 2025-12-24T10:52:29Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run pre-commit run --all-files in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀

coderabbitai · 2025-12-24T10:52:34Z

📝 Walkthrough

Walkthrough

Pre-simplified index division in GetBufferRef for 4-bit, 1-bit, and fp4 scalar types using truncdiv and arith::Analyzer().Simplify(). Pointer arithmetic emission now uses the simplified index and includes an explicit div_factor fix for fp4 pointer math.

Changes

Cohort / File(s)	Summary
CUDA Buffer Reference Indexing `src/target/codegen_cuda.cc`	In `GetBufferRef`, use `truncdiv(index, div_factor)` and `arith::Analyzer().Simplify()` for 4-bit and 1-bit scalar types; add explicit `div_factor` handling for `fp4` (float4 packed) when `lanes == 1` (div_factor = 2); replace inline `/ div_factor` in emitted pointer arithmetic with the simplified `index_str`; minor ldmatrix indexing format tweak and clarifying comments.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Poem

🐰 I nibble bytes in CUDA's glade,
Truncdiv trims the index braid,
Four-bit, one-bit, fp4 aligned,
Pointer paths now neatly signed,
Hop—codegen's tidy, carrot-fresh and brave.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The title 'Fix fp4 pointer arithmetic in CUDA codegen' clearly and specifically describes the main change in the changeset—addressing fp4 pointer arithmetic issues in CUDA code generation.
Docstring Coverage	✅ Passed	Docstring coverage is 100.00% which is sufficient. The required threshold is 80.00%.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

src/target/codegen_cuda.cc (1)
1329-1338: Fix looks correct, but consider refactoring to reduce duplication.

The fp4 pointer arithmetic fix correctly addresses the issue where scalar fp4 pointer arithmetic was treating each 4-bit element as a full byte. The division by 2 accounts for packing 2 fp4 elements per byte.

However, there's code duplication with lines 1322-1323. Consider refactoring:
🔎 Suggested refactoring to reduce duplication
   } else if (t == buffer_element_dtype) {
     os << buffer_str << "[" << index_str << "]";
   } else {
-    // Fix fp4 pointer arithmetic: fp4 elements are 4-bit packed 2 per byte.
-    // fp4* + n incorrectly advances n bytes (skipping 2n elements).
-    int div_factor = 1;
-    if (buffer_element_dtype.is_float4() && buffer_element_dtype.lanes() == 1) {
-      div_factor = 2;
-    }
-    index_str =
-        PrintExpr(arith::Analyzer().Simplify(truncdiv(index, div_factor)));
-
+    // Fix fp4 pointer arithmetic: scalar fp4 elements are 4-bit packed, 2 per byte.
+    // Since fp4* + n advances n bytes, divide index by 2 to get correct byte offset.
+    if (buffer_element_dtype.is_float4() && buffer_element_dtype.lanes() == 1) {
+      index_str = PrintExpr(arith::Analyzer().Simplify(truncdiv(index, 2)));
+    }
     os << "*" << ptr_cast(t) << "(" << buffer_str << " + " << index_str << ")";
   }
This eliminates the div_factor variable and makes the logic more explicit.

📜 Review details

Configuration used: defaults

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 65654e5 and 7fcbfa8.

📒 Files selected for processing (1)

src/target/codegen_cuda.cc

🧰 Additional context used

🧬 Code graph analysis (1)

src/target/codegen_cuda.cc (1)

tilelang/language/tir/op.py (1)

truncdiv (3119-3142)

🔇 Additional comments (2)

src/target/codegen_cuda.cc (2)

1322-1325: LGTM: Pre-simplification improves code generation.

The pre-simplification of the index division using arith::Analyzer().Simplify(truncdiv(index, div_factor)) is a good approach. This allows the analyzer to fold constants and simplify expressions before code generation, potentially producing more efficient pointer arithmetic code.

2189-2189: Minor formatting fix.

Removed extraneous whitespace - no functional change.

Fix fp4 pointer arithmetic in CUDA codegen

65654e5

Fix fp4 pointer arithmetic in CUDA codegen

7fcbfa8

coderabbitai bot reviewed Dec 24, 2025

View reviewed changes

LeiWang1999 approved these changes Dec 24, 2025

View reviewed changes

LeiWang1999 merged commit d0bcc69 into tile-ai:main Dec 24, 2025
6 checks passed

This was referenced Jan 23, 2026

[Enhancement] Legalize subtype access #1724

Merged

[Bugfix] Handle offset handling for subtype ptr #1738

Merged

kurisu6912 mentioned this pull request Feb 11, 2026

[LoopVectorize] Loop Independent Var Optimization in IfThenElse Expr kurisu6912/tilelang#2

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix fp4 pointer arithmetic in CUDA codegen #1524

Fix fp4 pointer arithmetic in CUDA codegen #1524

Uh oh!

LJC00118 commented Dec 24, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

coderabbitai bot commented Dec 24, 2025 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Poem

Uh oh!

coderabbitai bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix fp4 pointer arithmetic in CUDA codegen #1524

Fix fp4 pointer arithmetic in CUDA codegen #1524

Uh oh!

Conversation

LJC00118 commented Dec 24, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

github-actions bot commented Dec 24, 2025

Uh oh!

coderabbitai bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

LJC00118 commented Dec 24, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Dec 24, 2025 •

edited

Loading