Skip to content

Conversation

@Rachmanino
Copy link
Collaborator

@Rachmanino Rachmanino commented Sep 23, 2025

This pull request adds new mathematical function macros to the CUDA common header to support fast sine and cosine operations, improving performance and consistency for trigonometric calculations.

Math function enhancements:

  • Added macros for fast sine (hsin) and fast cosine (hcos) using cutlass::fast_sin and cutlass::fast_cos in src/tl_templates/cuda/common.h.

Summary by CodeRabbit

  • New Features
    • Added fast sine and cosine operations to GPU-accelerated math helpers, enabling optimized trigonometric computations.
    • Complements existing fast math utilities and can reduce latency for workloads relying on trig calculations.
    • No breaking changes: existing functionality remains unaffected, with seamless integration for current users.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Sep 23, 2025

Walkthrough

Adds two macros in src/tl_templates/cuda/common.h: hsin and hcos, mapped to cutlass::fast_sin and cutlass::fast_cos. No other behavior or public declarations changed.

Changes

Cohort / File(s) Summary
CUDA math helper macros
src/tl_templates/cuda/common.h
Added macros hsin and hcos mapping to cutlass::fast_sin and cutlass::fast_cos; no other modifications.

Estimated code review effort

🎯 1 (Trivial) | ⏱️ ~3 minutes

Poem

I twitched my whiskers, cosine in the air,
Sine waves ripple through CUDA’s lair.
Two swift hops—hsin, hcos—now in flight,
Fast as moonbeams on a byte-lit night.
I thump approval: math made light.

Pre-merge checks and finishing touches

✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title Check ✅ Passed The title "[Feat] Add fast sine and cosine definitions in CUDA templates" succinctly and accurately summarizes the primary change (adding fast sine and cosine macros in the CUDA template header), matching the modification in src/tl_templates/cuda/common.h. It is a short, single-sentence description that clearly conveys the main intent without vague language.
Docstring Coverage ✅ Passed No functions found in the changes. Docstring coverage check skipped.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link

👋 Hi! Thank you for contributing to the TileLang project.

Please remember to run bash format.sh in the root directory of the project to ensure your changes are properly linted and formatted. This will help ensure your contribution passes the format check.

We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work!

🚀

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @Rachmanino, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates optimized sine and cosine functions into the CUDA template library. By defining hsin and hcos macros, it provides convenient access to cutlass::fast_sin and cutlass::fast_cos, which are designed for high-performance trigonometric calculations on NVIDIA GPUs. This enhancement aims to improve the efficiency of CUDA kernels that rely on these mathematical operations.

Highlights

  • New Fast Math Definitions: Introduced hsin and hcos as new macro definitions, aliasing cutlass::fast_sin and cutlass::fast_cos respectively.
  • CUDA Template Integration: These new fast trigonometric functions are now available within the CUDA common templates, enhancing the library's mathematical capabilities for GPU-accelerated computations.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces fast sine and cosine function definitions for CUDA templates by adding hsin and hcos macros that map to cutlass::fast_sin and cutlass::fast_cos. The change is consistent with existing fast math function definitions in common.h. My review includes a suggestion to use inline functions instead of macros for better type safety and to align with modern C++ best practices. This would be a good step towards improving the overall code quality by gradually moving away from preprocessor macros for function-like behavior.

Comment on lines +21 to +22
#define hsin cutlass::fast_sin
#define hcos cutlass::fast_cos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

While using macros is consistent with the existing definitions in this file, it's generally safer to use inline functions in C++. This avoids potential issues with macro expansion, improves type safety, and respects namespaces. Since the project uses C++17 features (like std::is_same_v), you can use auto function parameters (a C++14 feature) to create generic wrappers concisely. This would be a good opportunity to start migrating away from macros for these function aliases.

TL_DEVICE auto hsin(auto x) { return cutlass::fast_sin(x); }
TL_DEVICE auto hcos(auto x) { return cutlass::fast_cos(x); }

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 48c9a35 and 09c1269.

📒 Files selected for processing (1)
  • src/tl_templates/cuda/common.h (1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
  • GitHub Check: build-test-amd
  • GitHub Check: format-check

Comment on lines +21 to +22
#define hsin cutlass::fast_sin
#define hcos cutlass::fast_cos
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

🧩 Analysis chain

Adds fast trig macros consistently; verify CUTLASS support for all target types/versions.

hsin/hcos align with existing hexp/hlog/hsqrt patterns. Please confirm that cutlass::fast_sin / cutlass::fast_cos are available (and device-callable) for your targeted CUTLASS/CUDA versions and support the input types you pass (e.g., half_t, bfloat16_t, float). If bfloat16_t isn’t covered (similar to the custom bfloat16_t fast_exp wrapper below), consider adding analogous wrappers or fallbacks.

Run the following to check for potential collisions and usage sites:

Optionally, verify availability in docs:


🏁 Script executed:

#!/bin/bash
# 1) Find any other definitions/usages of hsin/hcos to avoid macro conflicts.
rg -n -C2 -P '(^\s*#\s*define\s+hsin\b)|\bhsin\s*\('
rg -n -C2 -P '(^\s*#\s*define\s+hcos\b)|\bhcos\s*\('

# 2) Confirm CUTLASS fast trig references in the repo (sanity check).
rg -n -C2 -P '\bcutlass::fast_(sin|cos)\b'

# 3) Optional: locate bfloat16 sin/cos usage that might require wrappers.
rg -n -C2 -P '\b(bfloat16_t|half_t)\b.*\b(hsin|hcos)\s*\('

Length of output: 1025


🌐 Web query:

Does CUTLASS provide cutlass::fast_sin and cutlass::fast_cos in cutlass/fast_math.h, and what types are supported (float, half_t, bfloat16_t)?

💡 Result:

Short answer: No — CUTLASS’s fast_math.h does not expose cutlass::fast_sin / cutlass::fast_cos. The fast_math header in CUTLASS implements a set of “fast_…” helpers (e.g. fast_tanh, fast_exp, etc.) with overloads for float / double and some device half support, but you won’t find cutlass::fast_sin/fast_cos there. The header and related GitHub issues show the library provides fast variants for operations like tanh (including float/double/half overloads) but not sin/cos or bfloat16 fast trig helpers. [1][2]

Sources:

  • CUTLASS fast_math.h (repository header listing of the fast_math utilities). [1]
  • CUTLASS GitHub issue showing fast_* functions (fast_tanh) and type overloads discussion. [2]

References:
[1] cutlass/include/cutlass/fast_math.h — NVIDIA/cutlass (GitHub).
[2] NVIDIA/cutlass issue discussing fast_tanh overloads in fast_math.h.


Remove/guard hsin/hcos — cutlass::fast_sin / cutlass::fast_cos not provided

CUTLASS’s fast_math.h does not expose cutlass::fast_sin/cutlass::fast_cos; the macros in src/tl_templates/cuda/common.h (lines 21–22) will reference undefined symbols and likely break builds.

  • Replace the macros with device-callable alternatives (sinf/cosf or CUDA device intrinsics) or add explicit device wrappers that call a supported fast-trig implementation.
  • Add overloads/fallbacks for half_t and bfloat16_t (convert to float → trig → convert back) — CUTLASS lacks bfloat16 fast-trig helpers.
  • Ensure wrappers are annotated for device/host as needed.

src/tl_templates/cuda/common.h:21-22

🤖 Prompt for AI Agents
In src/tl_templates/cuda/common.h around lines 21-22, the macros #define hsin
cutlass::fast_sin and #define hcos cutlass::fast_cos reference non-existent
CUTLASS symbols; replace them with guarded, device-callable wrappers that call
supported trig functions (e.g., sinf/cosf or CUDA device intrinsics) and mark
them __host__ __device__; provide overloads/fallbacks for half_t and bfloat16_t
by converting to float, performing the trig operation, then converting back;
protect the replacements with #ifdef/#else to use cutlass implementations if
available, and ensure proper includes and namespace qualification so builds
won’t reference undefined symbols.

@LeiWang1999 LeiWang1999 merged commit 86aaf3c into tile-ai:main Sep 23, 2025
6 of 7 checks passed
@Rachmanino Rachmanino deleted the fix-math branch September 23, 2025 07:13
LeiWang1999 added a commit to LeiWang1999/tilelang that referenced this pull request Sep 25, 2025
LeiWang1999 added a commit that referenced this pull request Sep 26, 2025
…plicit fastmath op to invoke (#875)

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)

* Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.

* Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.

* Add precision comparison tool for CUDA operations

This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
LeiWang1999 added a commit that referenced this pull request Sep 26, 2025
)

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865)

* Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.

* Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.

* Add precision comparison tool for CUDA operations

This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.

* Add IEEE-compliant mathematical operations and refactor fast math module

This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic.

* debug removed

* Refactor IEEE math tests for improved readability and consistency

This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code.

* Update README.md to enhance formatting of precision comparison results

This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
…plicit fastmath op to invoke (tile-ai#875)

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (tile-ai#865)

* Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.

* Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.

* Add precision comparison tool for CUDA operations

This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
RubiaCx pushed a commit to RubiaCx/tilelang that referenced this pull request Nov 24, 2025
…ile-ai#882)

* Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (tile-ai#865)

* Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity.

* Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality.

* Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution.

* Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior.

* Add precision comparison tool for CUDA operations

This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations.

* Add precision comparison tool for CUDA operations

This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.

* Add IEEE-compliant mathematical operations and refactor fast math module

This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic.

* debug removed

* Refactor IEEE math tests for improved readability and consistency

This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code.

* Update README.md to enhance formatting of precision comparison results

This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants