[Feat] Add fast sine and cosine definitions in CUDA templates #865

Rachmanino · 2025-09-23T05:42:48Z

This pull request adds new mathematical function macros to the CUDA common header to support fast sine and cosine operations, improving performance and consistency for trigonometric calculations.

Math function enhancements:

Added macros for fast sine (hsin) and fast cosine (hcos) using cutlass::fast_sin and cutlass::fast_cos in src/tl_templates/cuda/common.h.

Summary by CodeRabbit

New Features
- Added fast sine and cosine operations to GPU-accelerated math helpers, enabling optimized trigonometric computations.
- Complements existing fast math utilities and can reduce latency for workloads relying on trig calculations.
- No breaking changes: existing functionality remains unaffected, with seamless integration for current users.

coderabbitai · 2025-09-23T05:42:56Z

Walkthrough

Adds two macros in src/tl_templates/cuda/common.h: hsin and hcos, mapped to cutlass::fast_sin and cutlass::fast_cos. No other behavior or public declarations changed.

Changes

Cohort / File(s)	Summary
CUDA math helper macros `src/tl_templates/cuda/common.h`	Added macros `hsin` and `hcos` mapping to `cutlass::fast_sin` and `cutlass::fast_cos`; no other modifications.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

I twitched my whiskers, cosine in the air,
Sine waves ripple through CUDA’s lair.
Two swift hops—hsin, hcos—now in flight,
Fast as moonbeams on a byte-lit night.
I thump approval: math made light.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check	✅ Passed	The title "[Feat] Add fast sine and cosine definitions in CUDA templates" succinctly and accurately summarizes the primary change (adding fast sine and cosine macros in the CUDA template header), matching the modification in src/tl_templates/cuda/common.h. It is a short, single-sentence description that clearly conveys the main intent without vague language.
Docstring Coverage	✅ Passed	No functions found in the changes. Docstring coverage check skipped.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2025-09-23T05:42:58Z

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

gemini-code-assist · 2025-09-23T05:42:58Z

Summary of Changes

Hello @Rachmanino, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates optimized sine and cosine functions into the CUDA template library. By defining hsin and hcos macros, it provides convenient access to cutlass::fast_sin and cutlass::fast_cos, which are designed for high-performance trigonometric calculations on NVIDIA GPUs. This enhancement aims to improve the efficiency of CUDA kernels that rely on these mathematical operations.

Highlights

New Fast Math Definitions: Introduced hsin and hcos as new macro definitions, aliasing cutlass::fast_sin and cutlass::fast_cos respectively.
CUDA Template Integration: These new fast trigonometric functions are now available within the CUDA common templates, enhancing the library's mathematical capabilities for GPU-accelerated computations.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces fast sine and cosine function definitions for CUDA templates by adding hsin and hcos macros that map to cutlass::fast_sin and cutlass::fast_cos. The change is consistent with existing fast math function definitions in common.h. My review includes a suggestion to use inline functions instead of macros for better type safety and to align with modern C++ best practices. This would be a good step towards improving the overall code quality by gradually moving away from preprocessor macros for function-like behavior.

gemini-code-assist · 2025-09-23T05:43:56Z

src/tl_templates/cuda/common.h

+#define hsin cutlass::fast_sin
+#define hcos cutlass::fast_cos


While using macros is consistent with the existing definitions in this file, it's generally safer to use inline functions in C++. This avoids potential issues with macro expansion, improves type safety, and respects namespaces. Since the project uses C++17 features (like std::is_same_v), you can use auto function parameters (a C++14 feature) to create generic wrappers concisely. This would be a good opportunity to start migrating away from macros for these function aliases.

TL_DEVICE auto hsin(auto x) { return cutlass::fast_sin(x); } TL_DEVICE auto hcos(auto x) { return cutlass::fast_cos(x); }

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48c9a35 and 09c1269.

📒 Files selected for processing (1)

src/tl_templates/cuda/common.h (1 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)

GitHub Check: build-test-amd
GitHub Check: format-check

coderabbitai · 2025-09-23T05:47:05Z

src/tl_templates/cuda/common.h

+#define hsin cutlass::fast_sin
+#define hcos cutlass::fast_cos


⚠️ Potential issue

🧩 Analysis chain

Adds fast trig macros consistently; verify CUTLASS support for all target types/versions.

hsin/hcos align with existing hexp/hlog/hsqrt patterns. Please confirm that cutlass::fast_sin / cutlass::fast_cos are available (and device-callable) for your targeted CUTLASS/CUDA versions and support the input types you pass (e.g., half_t, bfloat16_t, float). If bfloat16_t isn’t covered (similar to the custom bfloat16_t fast_exp wrapper below), consider adding analogous wrappers or fallbacks.

Run the following to check for potential collisions and usage sites:

Optionally, verify availability in docs:

🏁 Script executed:

#!/bin/bash # 1) Find any other definitions/usages of hsin/hcos to avoid macro conflicts. rg -n -C2 -P '(^\s*#\s*define\s+hsin\b)|\bhsin\s*\(' rg -n -C2 -P '(^\s*#\s*define\s+hcos\b)|\bhcos\s*\(' # 2) Confirm CUTLASS fast trig references in the repo (sanity check). rg -n -C2 -P '\bcutlass::fast_(sin|cos)\b' # 3) Optional: locate bfloat16 sin/cos usage that might require wrappers. rg -n -C2 -P '\b(bfloat16_t|half_t)\b.*\b(hsin|hcos)\s*\('

Length of output: 1025

🌐 Web query:

Does CUTLASS provide cutlass::fast_sin and cutlass::fast_cos in cutlass/fast_math.h, and what types are supported (float, half_t, bfloat16_t)?

💡 Result:

Short answer: No — CUTLASS’s fast_math.h does not expose cutlass::fast_sin / cutlass::fast_cos. The fast_math header in CUTLASS implements a set of “fast_…” helpers (e.g. fast_tanh, fast_exp, etc.) with overloads for float / double and some device half support, but you won’t find cutlass::fast_sin/fast_cos there. The header and related GitHub issues show the library provides fast variants for operations like tanh (including float/double/half overloads) but not sin/cos or bfloat16 fast trig helpers. [1][2]

Sources:

CUTLASS fast_math.h (repository header listing of the fast_math utilities). [1]

CUTLASS GitHub issue showing fast_* functions (fast_tanh) and type overloads discussion. [2]

References:
[1] cutlass/include/cutlass/fast_math.h — NVIDIA/cutlass (GitHub).
[2] NVIDIA/cutlass issue discussing fast_tanh overloads in fast_math.h.

Remove/guard hsin/hcos — cutlass::fast_sin / cutlass::fast_cos not provided

CUTLASS’s fast_math.h does not expose cutlass::fast_sin/cutlass::fast_cos; the macros in src/tl_templates/cuda/common.h (lines 21–22) will reference undefined symbols and likely break builds.

Replace the macros with device-callable alternatives (sinf/cosf or CUDA device intrinsics) or add explicit device wrappers that call a supported fast-trig implementation.

Add overloads/fallbacks for half_t and bfloat16_t (convert to float → trig → convert back) — CUTLASS lacks bfloat16 fast-trig helpers.

Ensure wrappers are annotated for device/host as needed.

src/tl_templates/cuda/common.h:21-22

🤖 Prompt for AI Agents

In src/tl_templates/cuda/common.h around lines 21-22, the macros #define hsin cutlass::fast_sin and #define hcos cutlass::fast_cos reference non-existent CUTLASS symbols; replace them with guarded, device-callable wrappers that call supported trig functions (e.g., sinf/cosf or CUDA device intrinsics) and mark them __host__ __device__; provide overloads/fallbacks for half_t and bfloat16_t by converting to float, performing the trig operation, then converting back; protect the replacements with #ifdef/#else to use cutlass implementations if available, and ensure proper includes and namespace qualification so builds won’t reference undefined symbols.

… cos, and sin (tile-ai#865)

…plicit fastmath op to invoke (#875) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.

) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested. * Add IEEE-compliant mathematical operations and refactor fast math module This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic. * debug removed * Refactor IEEE math tests for improved readability and consistency This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code. * Update README.md to enhance formatting of precision comparison results This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.

…ile-ai#865)

…plicit fastmath op to invoke (tile-ai#875) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (tile-ai#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.

…ile-ai#882) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (tile-ai#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested. * Add IEEE-compliant mathematical operations and refactor fast math module This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic. * debug removed * Refactor IEEE math tests for improved readability and consistency This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code. * Update README.md to enhance formatting of precision comparison results This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.

Add fast sine and cosine definitions in common.h for CUDA templates

09c1269

gemini-code-assist bot reviewed Sep 23, 2025

View reviewed changes

Rachmanino mentioned this pull request Sep 23, 2025

[Bug] Looks like T.tanh is usable but, T.cos and T.sin were not going to pass the compile. #862

Closed

coderabbitai bot reviewed Sep 23, 2025

View reviewed changes

LeiWang1999 approved these changes Sep 23, 2025

View reviewed changes

LeiWang1999 merged commit 86aaf3c into tile-ai:main Sep 23, 2025
6 of 7 checks passed

Rachmanino deleted the fix-math branch September 23, 2025 07:13

LeiWang1999 added a commit to LeiWang1999/tilelang that referenced this pull request Sep 25, 2025

Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan,…

d2f68b9

… cos, and sin (tile-ai#865)

This was referenced Oct 24, 2025

[Language] Initial version of tilelang frontend v2 #1120

Merged

[BugFix] alloc_var init failed to handle complex expression #1144

Merged

This was referenced Nov 12, 2025

[Fix] Fix a type that make wrong T.macro backtrace #1234

Merged

[Language] Add type stubs for tir op #1239

Merged

This was referenced Nov 21, 2025

[Feat] Add missing support for uint32x2, add unsigned implicit cast in bitwise op, add T.Ref as macro annotation #1302

Closed

[Fix] Remove unused let_bindings_ in CodeGenC to fix #1300 #1305

Merged

[Fix] Fix frame scope error in T.macro #1308

Merged

RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025

Add fast sine and cosine definitions in common.h for CUDA templates (t…

a657ded

…ile-ai#865)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Feat] Add fast sine and cosine definitions in CUDA templates #865

[Feat] Add fast sine and cosine definitions in CUDA templates #865

Uh oh!

Rachmanino commented Sep 23, 2025 •

edited by coderabbitai bot

Loading

Uh oh!

coderabbitai bot commented Sep 23, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

gemini-code-assist bot commented Sep 23, 2025

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

gemini-code-assist bot Sep 23, 2025

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Sep 23, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		#define hsin cutlass::fast_sin
		#define hcos cutlass::fast_cos

[Feat] Add fast sine and cosine definitions in CUDA templates #865

[Feat] Add fast sine and cosine definitions in CUDA templates #865

Uh oh!

Conversation

Rachmanino commented Sep 23, 2025 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

github-actions bot commented Sep 23, 2025

Uh oh!

gemini-code-assist bot commented Sep 23, 2025

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Rachmanino commented Sep 23, 2025 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Sep 23, 2025 •

edited

Loading