-
Notifications
You must be signed in to change notification settings - Fork 330
[Feat] Add fast sine and cosine definitions in CUDA templates #865
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
WalkthroughAdds two macros in src/tl_templates/cuda/common.h: hsin and hcos, mapped to cutlass::fast_sin and cutlass::fast_cos. No other behavior or public declarations changed. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~3 minutes Poem
Pre-merge checks and finishing touches✅ Passed checks (3 passed)
✨ Finishing touches
🧪 Generate unit tests
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
👋 Hi! Thank you for contributing to the TileLang project. Please remember to run We appreciate you taking this step! Our team will review your contribution, and we look forward to your awesome work! 🚀 |
Summary of ChangesHello @Rachmanino, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates optimized sine and cosine functions into the CUDA template library. By defining Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request introduces fast sine and cosine function definitions for CUDA templates by adding hsin and hcos macros that map to cutlass::fast_sin and cutlass::fast_cos. The change is consistent with existing fast math function definitions in common.h. My review includes a suggestion to use inline functions instead of macros for better type safety and to align with modern C++ best practices. This would be a good step towards improving the overall code quality by gradually moving away from preprocessor macros for function-like behavior.
| #define hsin cutlass::fast_sin | ||
| #define hcos cutlass::fast_cos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While using macros is consistent with the existing definitions in this file, it's generally safer to use inline functions in C++. This avoids potential issues with macro expansion, improves type safety, and respects namespaces. Since the project uses C++17 features (like std::is_same_v), you can use auto function parameters (a C++14 feature) to create generic wrappers concisely. This would be a good opportunity to start migrating away from macros for these function aliases.
TL_DEVICE auto hsin(auto x) { return cutlass::fast_sin(x); }
TL_DEVICE auto hcos(auto x) { return cutlass::fast_cos(x); }There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
src/tl_templates/cuda/common.h(1 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (2)
- GitHub Check: build-test-amd
- GitHub Check: format-check
| #define hsin cutlass::fast_sin | ||
| #define hcos cutlass::fast_cos |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🧩 Analysis chain
Adds fast trig macros consistently; verify CUTLASS support for all target types/versions.
hsin/hcos align with existing hexp/hlog/hsqrt patterns. Please confirm that cutlass::fast_sin / cutlass::fast_cos are available (and device-callable) for your targeted CUTLASS/CUDA versions and support the input types you pass (e.g., half_t, bfloat16_t, float). If bfloat16_t isn’t covered (similar to the custom bfloat16_t fast_exp wrapper below), consider adding analogous wrappers or fallbacks.
Run the following to check for potential collisions and usage sites:
Optionally, verify availability in docs:
🏁 Script executed:
#!/bin/bash
# 1) Find any other definitions/usages of hsin/hcos to avoid macro conflicts.
rg -n -C2 -P '(^\s*#\s*define\s+hsin\b)|\bhsin\s*\('
rg -n -C2 -P '(^\s*#\s*define\s+hcos\b)|\bhcos\s*\('
# 2) Confirm CUTLASS fast trig references in the repo (sanity check).
rg -n -C2 -P '\bcutlass::fast_(sin|cos)\b'
# 3) Optional: locate bfloat16 sin/cos usage that might require wrappers.
rg -n -C2 -P '\b(bfloat16_t|half_t)\b.*\b(hsin|hcos)\s*\('Length of output: 1025
🌐 Web query:
Does CUTLASS provide cutlass::fast_sin and cutlass::fast_cos in cutlass/fast_math.h, and what types are supported (float, half_t, bfloat16_t)?
💡 Result:
Short answer: No — CUTLASS’s fast_math.h does not expose cutlass::fast_sin / cutlass::fast_cos. The fast_math header in CUTLASS implements a set of “fast_…” helpers (e.g. fast_tanh, fast_exp, etc.) with overloads for float / double and some device half support, but you won’t find cutlass::fast_sin/fast_cos there. The header and related GitHub issues show the library provides fast variants for operations like tanh (including float/double/half overloads) but not sin/cos or bfloat16 fast trig helpers. [1][2]
Sources:
- CUTLASS fast_math.h (repository header listing of the fast_math utilities). [1]
- CUTLASS GitHub issue showing fast_* functions (fast_tanh) and type overloads discussion. [2]
References:
[1] cutlass/include/cutlass/fast_math.h — NVIDIA/cutlass (GitHub).
[2] NVIDIA/cutlass issue discussing fast_tanh overloads in fast_math.h.
Remove/guard hsin/hcos — cutlass::fast_sin / cutlass::fast_cos not provided
CUTLASS’s fast_math.h does not expose cutlass::fast_sin/cutlass::fast_cos; the macros in src/tl_templates/cuda/common.h (lines 21–22) will reference undefined symbols and likely break builds.
- Replace the macros with device-callable alternatives (sinf/cosf or CUDA device intrinsics) or add explicit device wrappers that call a supported fast-trig implementation.
- Add overloads/fallbacks for half_t and bfloat16_t (convert to float → trig → convert back) — CUTLASS lacks bfloat16 fast-trig helpers.
- Ensure wrappers are annotated for device/host as needed.
src/tl_templates/cuda/common.h:21-22
🤖 Prompt for AI Agents
In src/tl_templates/cuda/common.h around lines 21-22, the macros #define hsin
cutlass::fast_sin and #define hcos cutlass::fast_cos reference non-existent
CUTLASS symbols; replace them with guarded, device-callable wrappers that call
supported trig functions (e.g., sinf/cosf or CUDA device intrinsics) and mark
them __host__ __device__; provide overloads/fallbacks for half_t and bfloat16_t
by converting to float, performing the trig operation, then converting back;
protect the replacements with #ifdef/#else to use cutlass implementations if
available, and ensure proper includes and namespace qualification so builds
won’t reference undefined symbols.
…plicit fastmath op to invoke (#875) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested. * Add IEEE-compliant mathematical operations and refactor fast math module This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic. * debug removed * Refactor IEEE math tests for improved readability and consistency This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code. * Update README.md to enhance formatting of precision comparison results This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.
…plicit fastmath op to invoke (tile-ai#875) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (tile-ai#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested.
…ile-ai#882) * Add fast math operations for CUDA: exp, exp10, log, log2, log10, tan, cos, and sin (tile-ai#865) * Refactor fast math operation definitions for consistency and readability in CUDA code. Consolidated multiple definitions into single lines and improved formatting in related test files for better clarity. * Remove unnecessary pass configurations for warp specialization and TMA lowering in fast math operation tests for CUDA. This simplifies the test setup while maintaining the focus on fast math functionality. * Update fastmath tests to reflect that tl.* intrinsics generate no fastmath versions and disable cache in main execution. * Fix formatting in fastmath test comments for clarity on tl.* intrinsics behavior. * Add precision comparison tool for CUDA operations This commit introduces a new Python script and CUDA source file for a precision comparison tool that evaluates the accuracy of various CUDA operations (including division, reciprocal, exponential, logarithmic, and trigonometric functions) across different implementations: CUDA Precise, CUDA Fast, Triton, Triton LibDevice, and TileLang. The tool generates test data, executes the operations, and summarizes the error statistics for each implementation against a double precision reference. Additionally, a README file is added to document the results of the comparisons for various operations. * Add precision comparison tool for CUDA operations This commit introduces a new precision comparison tool implemented in Python and CUDA, designed to evaluate the accuracy of various mathematical operations (division, reciprocal, exponential, logarithmic, trigonometric, square root, etc.) across different frameworks including CUDA Precise/Fast, Triton, Triton LibDevice, PyTorch, and TileLang. The tool includes functionality for generating test data, executing operations, and summarizing error statistics for each implementation. Additionally, it provides a comprehensive README with error metrics for each operation tested. * Add IEEE-compliant mathematical operations and refactor fast math module This commit introduces new high precision mathematical operations including ieee_add, ieee_sub, ieee_mul, ieee_fmaf, ieee_frcp, ieee_fsqrt, ieee_frsqrt, and ieee_fdiv to the TileLang framework. The fast math module has been refactored to remove the deprecated fastmath.py file and update the import paths accordingly. Additionally, the CUDA code generation has been enhanced to support these new operations, ensuring compatibility with IEEE standards for floating-point arithmetic. * debug removed * Refactor IEEE math tests for improved readability and consistency This commit enhances the formatting of the `test_ieee_math.py` and `test_mathops_fastmath.py` files by adjusting line breaks for better clarity. It also removes unnecessary comments and ensures that the main execution of tests is streamlined. These changes aim to improve the overall maintainability of the test code. * Update README.md to enhance formatting of precision comparison results This commit reformats the precision comparison results in the README.md file, converting the error statistics tables into a more structured markdown format. This change improves readability and accessibility of the data for various mathematical operations across different implementations, including FP32 Precise, Triton, TileLang, and CUDA.
This pull request adds new mathematical function macros to the CUDA common header to support fast sine and cosine operations, improving performance and consistency for trigonometric calculations.
Math function enhancements:
hsin) and fast cosine (hcos) usingcutlass::fast_sinandcutlass::fast_cosinsrc/tl_templates/cuda/common.h.Summary by CodeRabbit