Skip to content

host group mutex #23

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 16 commits into from
Dec 8, 2024
Merged

host group mutex #23

merged 16 commits into from
Dec 8, 2024

Conversation

PhilipDeegan
Copy link
Member

@PhilipDeegan PhilipDeegan commented Oct 12, 2024

Summary by CodeRabbit

  • New Features

    • Introduced new synchronization mechanisms for improved GPU multi-launch functionality.
    • Added methods for managing host functions with mutexes and calculating group indices.
    • Enhanced testing capabilities with new functions for threaded group operations.
    • Added a new launcher type, DLauncher, for GPU kernel execution.
    • Introduced a new inline function for group index calculations.
    • Added a global kernel function template for simpler kernel launches.
    • Implemented a callback mechanism for stream event completion.
  • Bug Fixes

    • Enhanced error handling for group size validation.
  • Refactor

    • Improved structure and inheritance of existing classes related to stream functions.
    • Simplified the StreamEvent structure for better usability.
    • Reformatted macro definitions for better readability.
    • Adjusted preprocessor definitions for threaded stream launcher wait times.
    • Updated error handling mechanism in tests for improved clarity and maintainability.
    • Enhanced type flexibility in memory copy operations.
    • Encapsulated as_values functionality within a new namespace for better organization.

Copy link

coderabbitai bot commented Oct 12, 2024

Walkthrough

The changes primarily involve modifications to the multi_launch.hpp file in the mkn::gpu namespace, enhancing the GPU multi-launch framework. Key updates include adjustments to the StreamFunction constructor, the introduction of new structs like StreamGroupFunction and StreamHostGroupMutexFunction, and the addition of new methods for group index calculation and mutex management. A corresponding test function has been added to async_streaming.cpp to validate the new functionality, ensuring synchronized operations across host and device threads. Additional formatting changes were made in rocm.hpp, and updates were applied to def.hpp related to floating point checks and wait times.

Changes

File Change Summary
inc/mkn/gpu/multi_launch.hpp - Updated StreamFunction constructor to accept StreamFunctionMode const mode_.
- Added group_idx_modulo method for group index calculation.
- Introduced StreamGroupFunction struct inheriting from StreamFunction.
- Refactored StreamGroupBarrierFunction to inherit from StreamGroupFunction.
- Added StreamHostGroupMutexFunction struct for mutex management.
- Added StreamHostGroupIndexFunction struct for executing functions based on group indices.
- Updated StreamLauncher with host_group_mutex and host_group_idx methods for mutex-based host functions.
- Refactored run methods in StreamHostGroupMutexFunction and StreamGroupBarrierFunction.
- Modified error handling for group size validation.
test/any/async_streaming.cpp - Introduced test_threaded_host_group_mutex function for testing host group mutex functionality.
- Introduced test_threaded_host_group_idx function for testing group index functionality.
- Updated main function to include the new test functions.
inc/mkn/gpu/rocm.hpp - Reformatted MKN_GPU_ASSERT macro for improved readability.
- Introduced bytes variable in alloc_managed function for size calculation.
- Modified StreamEvent struct to simplify its design and update its methods.
inc/mkn/gpu/def.hpp - Modified is_floating_point_v template variable based on MKN_GPU_CPU.
- Updated threaded stream launcher wait time definitions.
inc/mkn/gpu/cpu.hpp - Modified StreamEvent structure to simplify its member variables and methods.
inc/mkn/gpu/cuda.hpp - Modified StreamEvent struct to simplify its constructor and update its methods.
inc/mkn/gpu/alloc.hpp - Updated copy function template signature for type flexibility.
test/any/construct.cpp - Replaced assertions with mkn::kul::abort_if_not for error handling.

Possibly related PRs

  • threaded stream launcher group barrier sync points #22: The changes in this PR introduce the StreamGroupBarrierFunction, which is directly related to the new synchronization mechanisms introduced in the main PR, particularly in how barriers are managed within the ThreadedStreamLauncher.

🐰 In the garden where we play,
New functions hop and dance today.
With mutex locks and groups so fine,
Our streams align, a perfect line!
Synchronized, we leap and bound,
In this code, new joys are found! 🌼


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (6)
test/any/async_streaming.cpp (1)

142-142: Consider adding more comprehensive result verification

The current check if (vals != std::vector<size_t>{3, 12}) return 1; verifies the final state of vals. While this is a good basic check, consider adding more detailed assertions or error messages to provide better feedback in case of failure.

You could enhance the verification like this:

if (vals != std::vector<size_t>{3, 12}) {
    std::cerr << "Unexpected vals: ";
    for (const auto& val : vals) {
        std::cerr << val << " ";
    }
    std::cerr << std::endl;
    return 1;
}
inc/mkn/gpu/multi_launch.hpp (5)

53-53: Avoid unnecessary const qualifier on pass-by-value parameter

The mode_ parameter is passed by value in the constructor. Adding const to a pass-by-value parameter has no effect since it doesn't prevent the caller or callee from modifying the local copy. Consider removing the redundant const qualifier.

Apply this diff to remove the unnecessary const:

-      StreamFunction(Strat& strat_, StreamFunctionMode const mode_) : strat{strat_}, mode{mode_} {}
+      StreamFunction(Strat& strat_, StreamFunctionMode mode_) : strat{strat_}, mode{mode_} {}

61-63: Prefer passing fundamental types by value instead of by const reference

For fundamental types like std::size_t, passing by value is more efficient than passing by const reference due to potential overhead associated with references. Consider changing the function parameters to pass by value.

Apply this diff to update the parameter passing:

-std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
+std::size_t group_idx_modulo(std::size_t gs, std::size_t i) {

68-69: Prefer passing fundamental types by value instead of by const reference

In the constructor of StreamGroupFunction, gs is a std::size_t, a fundamental type. Passing it by value is generally more efficient than passing by const reference. Similarly, consider removing the unnecessary const from the mode_ parameter.

Apply this diff:

-      StreamGroupFunction(std::size_t const& gs, Strat& strat_, StreamFunctionMode const mode_)
+      StreamGroupFunction(std::size_t gs, Strat& strat_, StreamFunctionMode mode_)

281-283: Prefer passing fundamental types by value instead of by const reference

In the constructor of StreamHostGroupMutexFunction, gs is a std::size_t. Passing fundamental types by value is more efficient than passing by const reference. Consider changing the parameter to pass by value.

Apply this diff:

-      StreamHostGroupMutexFunction(std::size_t const& gs, Strat& strat, Fn&& fn_)
+      StreamHostGroupMutexFunction(std::size_t gs, Strat& strat, Fn&& fn_)

342-344: Prefer passing fundamental types by value instead of by const reference

In the host_group_mutex method, group_size is a std::size_t. Passing fundamental types by value is more efficient than passing by const reference. Consider changing the parameter to pass by value.

Apply this diff:

-      This& host_group_mutex(std::size_t const& group_size, Fn&& fn) {
+      This& host_group_mutex(std::size_t group_size, Fn&& fn) {
🛑 Comments failed to post (1)
inc/mkn/gpu/multi_launch.hpp (1)

289-296: 🛠️ Refactor suggestion

Ensure mutex is properly acquired before using

In the run method, the unique_lock is constructed with std::defer_lock and try_lock is used. If the lock is not acquired, fn(i) is not called, and strat.status[i] is set to SFS::FIRST to retry. This logic may lead to busy waiting and potential performance issues if the mutex is heavily contended. Consider implementing a back-off strategy or using condition variables to avoid tight loops.

Consider refactoring the code to include a back-off mechanism:

      void run(std::uint32_t const i) override {
        auto const gidx = Super::group_idx(i);

        std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);

        if (lock.try_lock()) {
          fn(i);
          strat.status[i] = SFS::WAIT;  // done
        } else {
+         // Introduce a small sleep to prevent tight loop in case of contention
+         std::this_thread::sleep_for(std::chrono::milliseconds(1));
          strat.status[i] = SFS::FIRST;  // retry
        }
      }

Committable suggestion was skipped due to low confidence.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (5)
inc/mkn/gpu/multi_launch.hpp (4)

61-63: LGTM: New group_idx_modulo function added

The group_idx_modulo function is a useful addition for group-based operations. Its implementation is correct and improves code readability by encapsulating the group index calculation.

Consider marking this function as constexpr and inline to allow for compile-time evaluation and potential performance improvements:

-std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
+constexpr inline std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
   return ((i - (i % gs)) / gs);
 }

65-75: LGTM: New StreamGroupFunction struct added

The StreamGroupFunction struct is a well-designed abstraction for group-based stream functions. It properly inherits from StreamFunction and provides useful group-related functionality.

Consider marking the group_idx method as constexpr and noexcept for potential compile-time optimizations and to indicate it doesn't throw exceptions:

-  std::size_t group_idx(std::size_t const& i) const { return group_idx_modulo(group_size, i); }
+  constexpr std::size_t group_idx(std::size_t const& i) const noexcept { return group_idx_modulo(group_size, i); }

267-298: LGTM: New StreamHostGroupMutexFunction added for group-based mutex synchronization

The StreamHostGroupMutexFunction is a valuable addition for synchronizing host-side operations within groups. Its implementation using std::mutex is appropriate for host-side synchronization.

Consider using std::shared_mutex instead of std::mutex if multiple readers are expected to access shared resources simultaneously. This could potentially improve performance in scenarios where read operations are more frequent than write operations:

-  std::vector<std::mutex> mutices;
+  std::vector<std::shared_mutex> mutices;

Also, to prevent potential busy-waiting, you might want to add a small sleep or yield when the lock isn't acquired:

   if (lock.try_lock()) {
     fn(i);
     strat.status[i] = SFS::WAIT;  // done
   } else {
     strat.status[i] = SFS::FIRST;  // retry
+    std::this_thread::yield();  // or std::this_thread::sleep_for(std::chrono::milliseconds(1));
   }

Line range hint 1-461: Overall: Excellent additions to enhance group-based operations and synchronization

The changes introduce well-designed group-based functionality and mutex synchronization to the GPU multi-launch framework. The new features are cohesively integrated with the existing codebase and provide valuable enhancements for parallel processing scenarios.

Consider adding more inline documentation or comments to explain the usage and benefits of the new group-based features and mutex synchronization. This would help users understand when and how to best utilize these new capabilities in their GPU multi-launch operations.

test/any/async_streaming.cpp (1)

135-137: Remove unnecessary mutable keyword in lambda

The lambda function passed to .host() does not modify any captured variables, yet it's marked as mutable. Unless there's a specific reason, consider removing the mutable keyword for clarity.

Apply this diff to remove mutable:

-.host([&](auto i) mutable {
+.host([&](auto i) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 7938041 and 0707fb0.

📒 Files selected for processing (2)
  • inc/mkn/gpu/multi_launch.hpp (5 hunks)
  • test/any/async_streaming.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (3)

53-53: LGTM: Improved parameter passing in StreamFunction constructor

The change to pass mode_ as a constant reference is a good practice. It potentially improves performance by avoiding unnecessary copies without changing the functionality.


Line range hint 222-265: LGTM: StreamGroupBarrierFunction updated to use new group-based abstraction

The changes to StreamGroupBarrierFunction properly integrate it with the new StreamGroupFunction base class. This improves code consistency and leverages the new group functionality.

The run method now correctly uses the group_idx method from the base class, which is a good improvement in code reuse and consistency.


339-344: LGTM: New host_group_mutex method added to ThreadedStreamLauncher

The host_group_mutex method is a well-implemented addition to the ThreadedStreamLauncher class. It provides a convenient way to add group-based, mutex-synchronized host functions to the launcher.

The method correctly creates and adds a new StreamHostGroupMutexFunction to the fns vector, properly forwarding the provided function and group size.

test/any/async_streaming.cpp (2)

119-152: Function test_threaded_host_group_mutex implementation looks correct

The implementation of the test_threaded_host_group_mutex function appears to correctly utilize the ThreadedStreamLauncher with host_group_mutex to synchronize host operations across groups. The logic aligns with the intended functionality.


139-139: ⚠️ Potential issue

Undefined function group_idx_modulo

The function group_idx_modulo used here is not defined within the provided code or the included headers. This will lead to a compilation error due to the undefined reference.

Run the following script to check if group_idx_modulo is defined elsewhere in the codebase:

Ensure that group_idx_modulo is properly defined or included from the appropriate header file.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (3)
inc/mkn/gpu/multi_launch.hpp (3)

61-63: New group_idx_modulo function looks good, but consider adding a safety check

The new group_idx_modulo function is a good addition for calculating group indices. However, to prevent potential undefined behavior, consider adding a check for division by zero.

Consider adding a check for gs != 0 at the beginning of the function:

 std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) {
+  if (gs == 0) {
+    throw std::invalid_argument("Group size cannot be zero");
+  }
   return ((i - (i % gs)) / gs);
 }

267-298: Well-implemented StreamHostGroupMutexFunction with a suggestion for improvement

The new StreamHostGroupMutexFunction struct is a good addition that provides mutex-based synchronization for host group functions. The implementation is consistent with the existing code style and error handling.

The run method's try-lock mechanism prevents deadlocks, which is good. However, it might lead to busy waiting in high-contention scenarios.

Consider implementing a backoff strategy or using a condition variable to reduce CPU usage in case of high contention. For example:

void run(std::uint32_t const i) override {
  std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);
  
  if (lock.try_lock()) {
    fn(i);
    strat.status[i] = SFS::WAIT;  // done
  } else {
    // Implement exponential backoff
    std::this_thread::sleep_for(std::chrono::milliseconds(backoff_time));
    backoff_time = std::min(backoff_time * 2, max_backoff_time);
    strat.status[i] = SFS::FIRST;  // retry
  }
}

This approach would reduce CPU usage while still maintaining the non-blocking nature of the current implementation.


Line range hint 385-409: Improved efficiency in get_work, but potential race condition introduced

The changes to the get_work method improve efficiency by allowing work to resume from where it left off in previous calls. This is a good optimization for scenarios with many work items.

However, the introduction of the shared work_i variable, which is modified outside the lock, could lead to race conditions in a multi-threaded environment.

To address the potential race condition, consider one of the following approaches:

  1. Move the work_i = 0 reset inside the lock:
 std::pair<SFP, std::size_t> get_work() {
   std::scoped_lock<std::mutex> lk(work_);
   for (; work_i < datas.size(); ++work_i) {
     // ... existing code ...
   }
+  work_i = 0;
   if (check_finished()) done = 1;
   return std::make_pair(SFP::SKIP, 0);
 }
-  work_i = 0;
  1. Use a thread-local variable instead of a shared member variable:
 std::pair<SFP, std::size_t> get_work() {
+  static thread_local std::size_t work_i = 0;
   std::scoped_lock<std::mutex> lk(work_);
   for (; work_i < datas.size(); ++work_i) {
     // ... existing code ...
   }
   work_i = 0;
   if (check_finished()) done = 1;
   return std::make_pair(SFP::SKIP, 0);
 }
-  std::size_t work_i = 0;

Either of these approaches would maintain the efficiency improvement while avoiding potential race conditions.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between 0707fb0 and cd06ef9.

📒 Files selected for processing (1)
  • inc/mkn/gpu/multi_launch.hpp (8 hunks)
🧰 Additional context used
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (5)

53-53: Improved efficiency in StreamFunction constructor

The change to pass mode_ as a const reference is a good optimization. It prevents unnecessary copying and ensures that the mode_ parameter cannot be accidentally modified within the constructor.


65-75: Well-designed StreamGroupFunction struct

The new StreamGroupFunction struct is a good addition that provides a clear abstraction for group-based stream functions. It properly inherits from StreamFunction and introduces the necessary group-related functionality.

The group_idx method effectively utilizes the group_idx_modulo function, maintaining consistency in the codebase.


Line range hint 222-265: Improved StreamGroupBarrierFunction with better inheritance

The changes to StreamGroupBarrierFunction are well-implemented:

  1. Inheriting from StreamGroupFunction improves code organization and reuse.
  2. The constructor properly initializes the base class with the group size.
  3. The run method now uses the group_idx method from the base class, which is more consistent.

These modifications enhance the overall design and maintainability of the code.


339-344: Well-implemented host_group_mutex method

The new host_group_mutex method in ThreadedStreamLauncher is a good addition. It provides a clean and consistent way to add mutex-based host group functions to the launcher. The implementation correctly creates and adds a new StreamHostGroupMutexFunction to the fns vector, maintaining consistency with other similar methods in the class.


Line range hint 1-445: Overall assessment of changes to multi_launch.hpp

The changes to this file significantly enhance the GPU multi-launch framework by introducing well-designed group-based abstractions and improving existing functionality. Key improvements include:

  1. The new StreamGroupFunction provides a solid foundation for group-based operations.
  2. The StreamHostGroupMutexFunction adds useful mutex-based synchronization for host group functions.
  3. The ThreadedStreamLauncher class has been extended with new capabilities that leverage these group-based abstractions.

While these changes are generally positive, there are a few areas that require attention:

  1. The group_idx_modulo function should include a check for division by zero.
  2. The locking mechanism in StreamHostGroupMutexFunction could be optimized to reduce potential busy waiting.
  3. The use of the shared work_i variable in ThreadedStreamLauncher introduces potential thread safety issues that should be addressed.

Addressing these points will further improve the robustness and efficiency of the framework. Overall, this update represents a significant step forward in the capabilities of the GPU multi-launch system.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (5)
inc/mkn/gpu/rocm.hpp (1)

208-210: LGTM: Improved readability in alloc_managed function

The introduction of the bytes variable enhances code clarity by separating the size calculation from the hipMallocManaged call. This change improves readability and maintainability without altering the function's behavior.

Consider adding a comment explaining the purpose of the bytes variable for even better clarity:

+ // Calculate total bytes to allocate
  auto const bytes = size * sizeof(T);
  KLOG(TRC) << "GPU alloced: " << size * sizeof(T);
  MKN_GPU_ASSERT(hipMallocManaged((void**)&p, bytes));
test/any/async_streaming.cpp (1)

142-142: Define expected values as constants for clarity

The comparison if (vals != std::vector<size_t>{3, 12}) uses hardcoded magic numbers, which can hinder code readability. Consider defining these values as named constants or adding comments to explain their significance.

inc/mkn/gpu/multi_launch.hpp (3)

61-63: Simplify group index calculation

The group_idx_modulo function can be simplified. Currently, it calculates the group index using:

return ((i - (i % gs)) / gs);

This expression is equivalent to integer division i / gs for integers. You can simplify the function to:

return i / gs;

This makes the code clearer and may improve performance slightly.


61-63: Avoid passing primitive types by const reference

In the group_idx_modulo function, the parameters gs and i are passed as const&. Since std::size_t is a primitive type, it's more efficient to pass them by value:

std::size_t group_idx_modulo(std::size_t gs, std::size_t i) {
  return i / gs;
}

This change avoids unnecessary indirection and aligns with best practices.


361-365: Avoid passing std::size_t by const reference

In the host_group_mutex method, group_size is passed as std::size_t const&. Since std::size_t is a primitive type, it's more efficient to pass it by value:

This& host_group_mutex(std::size_t group_size, Fn&& fn) {
  // ...
}

This change eliminates unnecessary indirection and follows C++ best practices for passing primitive types.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between cd06ef9 and cb24f6d.

📒 Files selected for processing (3)
  • inc/mkn/gpu/multi_launch.hpp (8 hunks)
  • inc/mkn/gpu/rocm.hpp (2 hunks)
  • test/any/async_streaming.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (4)
inc/mkn/gpu/rocm.hpp (2)

48-51: LGTM: Improved macro formatting

The reformatting of the MKN_GPU_ASSERT macro enhances code readability without altering its functionality. This change aligns with best practices for macro definitions.


Line range hint 1-458: Overall assessment: Minor improvements to code quality

The changes in this file, while minor, contribute positively to the overall code quality and readability. They align well with the broader objectives of the pull request to enhance the GPU multi-launch framework. No functional changes or potential issues were identified in this file.

test/any/async_streaming.cpp (1)

196-201: Refactor 'main' to better identify test failures

As previously mentioned, returning the sum of test results can make it difficult to identify which specific test failed. Consider using bitwise operations or handling each test result individually to improve failure detection.

inc/mkn/gpu/multi_launch.hpp (1)

Line range hint 413-437: Resolution of previous concurrency issue with work_i

The potential race condition with the shared variable work_i, as previously noted in past reviews, has been addressed. Access to work_i is now properly synchronized using the work_ mutex in the get_work method, ensuring thread safety.

Also applies to: 471-471

KUL_DBG_FUNC_ENTER;

std::size_t constexpr group_size = 3;
std::vector<size_t> vals((C + 1) / group_size); // 2 values;
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Adjust the size calculation of 'vals' to handle all group sizes

The current calculation for the size of the vals vector may not correctly account for cases where (C + 1) is not perfectly divisible by group_size. This could lead to incorrect indexing or out-of-bounds errors.

Apply this diff to ensure the size of vals correctly accommodates all elements:

- std::vector<size_t> vals((C + 1) / group_size);  // 2 values;
+ std::vector<size_t> vals((C + group_size - 1) / group_size);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
std::vector<size_t> vals((C + 1) / group_size); // 2 values;
std::vector<size_t> vals((C + group_size - 1) / group_size);

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (2)
inc/mkn/gpu/multi_launch.hpp (2)

53-53: Nitpick: Pass StreamFunctionMode by value without const

Since StreamFunctionMode is likely an enum or small type, passing it by value without const is more idiomatic in C++. The const qualifier is unnecessary when passing by value.

Apply this diff to update the constructor signature:

- StreamFunction(Strat& strat_, StreamFunctionMode const mode_) : strat{strat_}, mode{mode_} {}
+ StreamFunction(Strat& strat_, StreamFunctionMode mode_) : strat{strat_}, mode{mode_} {}

Line range hint 222-266: Issue: Correct the declaration of MOD_GROUP_ERROR

The declaration of MOD_GROUP_ERROR uses an incorrect order of specifiers. In C++, the static and constexpr specifiers should precede the type.

Apply this diff to fix the declaration:

- std::string_view constexpr static MOD_GROUP_ERROR =
+ static constexpr std::string_view MOD_GROUP_ERROR =
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between cb24f6d and 02836f8.

📒 Files selected for processing (2)
  • inc/mkn/gpu/def.hpp (1 hunks)
  • inc/mkn/gpu/multi_launch.hpp (8 hunks)
🧰 Additional context used
🔇 Additional comments (3)
inc/mkn/gpu/def.hpp (1)

26-26: Verify the impact of reduced wait times on CPU usage and synchronization.

The significant reduction in wait times (WAIT_MS_ADD_ from 10→1ms, WAIT_MS_MAX_ from 100→25ms) could affect the thread synchronization behavior:

  • Pros: Potentially faster response times for mutex operations
  • Cons: May increase CPU usage due to more frequent polling

Let's check where these constants are used:

Also applies to: 30-30

✅ Verification successful

---

Based on the gathered information, I can now provide a final response:

**The reduced wait times are appropriate for the implementation**

The changes to wait times (WAIT_MS_ADD_: 10→1ms, WAIT_MS_MAX_: 100→25ms) are justified based on the codebase analysis:

1. The `ThreadedStreamLauncher` is used primarily in test scenarios with short-duration operations (200ms sleeps)
2. The implementation includes synchronization primitives like `barrier()`, `group_barrier()`, and `host_group_mutex()` which handle thread coordination
3. The reduced polling intervals align well with the test cases' timing requirements, where end-to-end execution is expected to complete within 1 second (test fails if `end - start > 1e3`)

The changes optimize the polling frequency for faster response times without risking synchronization issues, as proper mutex and barrier mechanisms are in place to ensure thread safety.



</details>

<details>
<summary>🏁 Scripts executed</summary>

The following scripts were executed for the analysis:

Script:

```shell
#!/bin/bash
# Search for usage of wait time constants
echo "Searching for wait time constant usage..."
rg -A 3 "_MKN_GPU_THREADED_STREAM_LAUNCHER_WAIT"

Length of output: 2362


Script:

#!/bin/bash
# Search for all usages of ThreadedStreamLauncher
echo "Searching for all usages of ThreadedStreamLauncher..."
rg "ThreadedStreamLauncher"

Length of output: 793


Script:

#!/bin/bash
# Let's examine the implementation details of ThreadedStreamLauncher
echo "Examining ThreadedStreamLauncher implementation..."
rg -A 20 "struct ThreadedStreamLauncher"

# Also check the test cases to understand usage patterns
echo -e "\nExamining test cases..."
rg -A 10 "ThreadedStreamLauncher\{" test/

Length of output: 4883

inc/mkn/gpu/multi_launch.hpp (2)

Line range hint 413-440: Approved: Proper synchronization of work_i variable

The previous concern about the shared work_i variable has been addressed. Access to work_i within get_work is now properly synchronized using the work_ mutex, ensuring thread safety.


281-299: ⚠️ Potential issue

Issue: Potential data race on strat.status[i]

In StreamHostGroupMutexFunction::run, strat.status[i] is modified without synchronization. Since strat.status[i] may be accessed concurrently by multiple threads, this could lead to data races and undefined behavior.

Consider making status a vector of atomic values to ensure thread-safe access:

- std::vector<SFS> status;
+ std::vector<std::atomic<SFS>> status;

Ensure all reads and writes to status[i] use atomic operations.

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (2)
inc/mkn/gpu/cuda.hpp (1)

115-141: Consider enhancing RAII guarantees.

The StreamEvent class could benefit from stronger RAII principles to ensure proper cleanup of CUDA resources and callback state, especially in error scenarios.

Consider:

  1. Adding a destructor to ensure cleanup of any pending callbacks
  2. Using std::unique_ptr for managing the function object
  3. Adding error handling for callback registration failures
inc/mkn/gpu/multi_launch.hpp (1)

Line range hint 239-282: Enhance error message and const correctness.

The implementation looks good, but consider these improvements:

  1. Make the error message more descriptive
  2. Add const qualifier to member functions that don't modify state
-  std::string_view constexpr static MOD_GROUP_ERROR =
-      "mkn.gpu error: StreamGroupBarrierFunction Group size must be a divisor of datas";
+  std::string_view constexpr static MOD_GROUP_ERROR =
+      "mkn.gpu error: StreamGroupBarrierFunction group size (%) must be a divisor of data size (%)";

-  void arrive() { [[maybe_unused]] auto ret = sync_point.arrive(); }
+  void arrive() const { [[maybe_unused]] auto ret = sync_point.arrive(); }
🛑 Comments failed to post (7)
inc/mkn/gpu/cpu.hpp (1)

109-110: ⚠️ Potential issue

Fix the operator() implementation to handle the callback function.

The operator() accepts a callback function but doesn't execute it, which could lead to silent failures in synchronization code. Either implement the callback execution or document why it's intentionally ignored.

-  auto& operator()(std::function<void()> fn = {}) { return *this; }
+  auto& operator()(std::function<void()> fn = {}) {
+    if (fn) fn();
+    return *this;
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

  auto& operator()(std::function<void()> fn = {}) {
    if (fn) fn();
    return *this;
  }
  bool finished() const { return fin; }
inc/mkn/gpu/cuda.hpp (2)

136-136: ⚠️ Potential issue

Add memory ordering to finished() check.

The finished() method reads the completion flag without proper memory ordering guarantees.

Add proper memory ordering for the flag check:

-  bool finished() const { return fin; }
+  bool finished() const { return fin.load(std::memory_order_acquire); }

Committable suggestion was skipped due to low confidence.


129-133: ⚠️ Potential issue

Ensure thread-safe callback execution.

The callback function modifies shared state without proper synchronization. The order of operations (executing function, clearing it, setting flag) should be carefully controlled.

Consider this safer implementation:

   static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) {
     auto& self = *reinterpret_cast<StreamEvent*>(ptr);
+    std::function<void()> fn = std::move(self._fn);
+    self._fn = {};
     self._fn();
-    self._fn = [] {};
-    self.fin = 1;
+    self.fin.store(true, std::memory_order_release);
   }

Committable suggestion was skipped due to low confidence.

inc/mkn/gpu/multi_launch.hpp (4)

330-333: ⚠️ Potential issue

Race condition in status update.

Similar to StreamHostGroupMutexFunction, the status update is not thread-safe.

   void run(std::uint32_t const i) override {
     if (i % Super::group_size == gid) fn(i);
-    strat.status[i] = SFS::WAIT;  // done
+    std::atomic_ref<SFS> status_ref(strat.status[i]);
+    status_ref.store(SFS::WAIT, std::memory_order_release);
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

  void run(std::uint32_t const i) override {
    if (i % Super::group_size == gid) fn(i);
    std::atomic_ref<SFS> status_ref(strat.status[i]);
    status_ref.store(SFS::WAIT, std::memory_order_release);
  }

481-493: 🛠️ Refactor suggestion

Enhance print_times formatting and efficiency.

The print_times method could be improved for better readability and performance:

  1. Use a more structured output format
  2. Pre-calculate the indices
   void print_times() const {
-    std::size_t fn_idx = 0, data_idx = 0;
-
-    for (auto const& t : super().times) {
-      KOUT(NON) << data_idx << " " << fn_idx << " " << (t.time() / 1e6);
-
-      ++fn_idx;
-      if (fn_idx == fns.size()) {
-        ++data_idx;
-        fn_idx = 0;
-      }
+    KOUT(NON) << "Data_Index Function_Index Time_ms";
+    for (std::size_t i = 0; i < super().times.size(); ++i) {
+      auto const data_idx = i / fns.size();
+      auto const fn_idx = i % fns.size();
+      KOUT(NON) << std::setw(9) << data_idx 
+                << std::setw(13) << fn_idx 
+                << std::setw(8) << std::fixed << std::setprecision(3) 
+                << (super().times[i].time() / 1e6);
     }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

  void print_times() const {
    KOUT(NON) << "Data_Index Function_Index Time_ms";
    for (std::size_t i = 0; i < super().times.size(); ++i) {
      auto const data_idx = i / fns.size();
      auto const fn_idx = i % fns.size();
      KOUT(NON) << std::setw(9) << data_idx 
                << std::setw(13) << fn_idx 
                << std::setw(8) << std::fixed << std::setprecision(3) 
                << (super().times[i].time() / 1e6);
    }
  }

304-311: ⚠️ Potential issue

Race condition in status updates.

The status updates are not protected by the mutex, which could lead to data races. Consider these improvements:

  1. Move the status update inside the mutex lock
  2. Use atomic operations for status updates
   void run(std::uint32_t const i) override {
     std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);
 
     if (lock.try_lock()) {
       fn(i);
-      strat.status[i] = SFS::WAIT;  // done
+      {
+        std::atomic_ref<SFS> status_ref(strat.status[i]);
+        status_ref.store(SFS::WAIT, std::memory_order_release);
+      }
     } else {
-      strat.status[i] = SFS::FIRST;  // retry
+      std::atomic_ref<SFS> status_ref(strat.status[i]);
+      status_ref.store(SFS::FIRST, std::memory_order_release);
     }
   }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

    std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);

    if (lock.try_lock()) {
      fn(i);
      {
        std::atomic_ref<SFS> status_ref(strat.status[i]);
        status_ref.store(SFS::WAIT, std::memory_order_release);
      }
    } else {
      std::atomic_ref<SFS> status_ref(strat.status[i]);
      status_ref.store(SFS::FIRST, std::memory_order_release);
    }

66-76: 🛠️ Refactor suggestion

Consider enhancing Timer struct with thread safety and better error messages.

The Timer struct could benefit from the following improvements:

  1. Thread safety for concurrent timing operations
  2. More descriptive assertion messages
 struct Timer {
   auto time() const {
-    assert(s > 0);
-    assert(e > 0);
+    assert(s > 0 && "Timer not started");
+    assert(e > 0 && "Timer not stopped");
+    assert(e >= s && "Timer stopped before start");
     return e - s;
   }
-  void start() { s = kul::Now::NANOS(); }
-  void stop() { e = kul::Now::NANOS(); }
+  void start() {
+    std::lock_guard<std::mutex> lock(mutex);
+    s = kul::Now::NANOS();
+  }
+  void stop() {
+    std::lock_guard<std::mutex> lock(mutex);
+    e = kul::Now::NANOS();
+  }
 
   std::size_t s = 0, e = 0;
+private:
+  std::mutex mutex;
 };
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

struct Timer {
  auto time() const {
    assert(s > 0 && "Timer not started");
    assert(e > 0 && "Timer not stopped");
    assert(e >= s && "Timer stopped before start");
    return e - s;
  }
  void start() {
    std::lock_guard<std::mutex> lock(mutex);
    s = kul::Now::NANOS();
  }
  void stop() {
    std::lock_guard<std::mutex> lock(mutex);
    e = kul::Now::NANOS();
  }

  std::size_t s = 0, e = 0;
private:
  std::mutex mutex;
};

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (2)
test/any/async_streaming.cpp (1)

137-138: Consider parameterizing the sleep duration.

The hardcoded sleep duration of 200ms could be made configurable to allow testing different timing scenarios.

+constexpr auto DEFAULT_SLEEP_DURATION = 200ms;
+
 std::uint32_t test_threaded_host_group_mutex(std::size_t const& nthreads = 2) {
   // ...
-  std::this_thread::sleep_for(200ms);
+  std::this_thread::sleep_for(DEFAULT_SLEEP_DURATION);
inc/mkn/gpu/multi_launch.hpp (1)

Line range hint 239-282: Consider refactoring GroupBarrier to improve encapsulation

The GroupBarrier inner struct could be moved to a separate implementation namespace and its interface could be simplified.

Consider these improvements:

-  struct GroupBarrier {
+  class GroupBarrier {
+    This* self;
+    std::uint16_t group_id;
+    std::barrier<std::function<void()>> sync_point;
+
+    public:
     GroupBarrier(This& slf, std::uint16_t const gid) : self{&slf}, group_id{gid} {
       sync_point = std::barrier(static_cast<std::int64_t>(self->group_size),
         [this]() {
           std::size_t const offset = self->group_size * group_id;
           for (std::size_t i = offset; i < offset + self->group_size; ++i)
             self->strat.status[i] = SFS::WAIT;
         });
     }
     GroupBarrier(GroupBarrier const&) = delete;
     GroupBarrier& operator=(GroupBarrier const&) = delete;
     void arrive() { [[maybe_unused]] auto ret = sync_point.arrive(); }
   };
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between eaab343 and e4d2d8b.

📒 Files selected for processing (5)
  • inc/mkn/gpu/cpu.hpp (1 hunks)
  • inc/mkn/gpu/cuda.hpp (2 hunks)
  • inc/mkn/gpu/multi_launch.hpp (15 hunks)
  • inc/mkn/gpu/rocm.hpp (3 hunks)
  • test/any/async_streaming.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • inc/mkn/gpu/cpu.hpp
  • inc/mkn/gpu/rocm.hpp
🧰 Additional context used
🔇 Additional comments (5)
test/any/async_streaming.cpp (2)

66-67: LGTM: Improved readability and debugging.

The changes to store the ThreadedStreamLauncher instance in a variable and add timing information output are good improvements for debugging and code clarity.

Also applies to: 82-83


134-142: Verify thread safety of host_group_mutex implementation.

While the test looks correct, we should verify that the mutex implementation in ThreadedStreamLauncher properly handles concurrent access across thread groups.

✅ Verification successful

Let me gather more information about the mutex implementation by checking the StreamHostGroupMutexFunction class.


Based on the gathered information, I can now provide the final response:

Thread safety implementation in host_group_mutex is properly handled

The implementation shows correct thread-safety mechanisms:

  • Each group has its own dedicated mutex in std::vector<std::mutex> mutices
  • Uses std::unique_lock with try_lock() for non-blocking mutex acquisition
  • Proper RAII-style mutex management through std::unique_lock
  • Group index calculation ensures mutexes are correctly mapped to thread groups
  • Status tracking prevents race conditions during thread synchronization
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the mutex implementation in ThreadedStreamLauncher

# Test: Look for mutex usage in the launcher implementation
rg -A 5 "host_group_mutex.*\{" inc/mkn/gpu/multi_launch.hpp

# Test: Check for potential race conditions in mutex handling
ast-grep --pattern 'class ThreadedStreamLauncher {
  $$$
  host_group_mutex($$$) {
    $$$
  }
  $$$
}'

Length of output: 387


Script:

#!/bin/bash
# Check StreamHostGroupMutexFunction implementation
rg -A 20 "class StreamHostGroupMutexFunction" inc/mkn/gpu/multi_launch.hpp

# Look for mutex member variables and their usage
rg -A 5 "std::mutex" inc/mkn/gpu/multi_launch.hpp

# Check if there are any other mutex-related implementations
rg -A 5 "mutex" inc/mkn/gpu/multi_launch.hpp

Length of output: 2386

inc/mkn/gpu/cuda.hpp (2)

49-52: LGTM! Well-formatted CUDA assertion macro.

The macro is properly scoped with curly braces and follows standard CUDA error checking patterns.


115-141: LGTM! Improved StreamEvent design with callback-based completion tracking.

The refactoring from event-based to callback-based completion tracking simplifies the code and reduces CUDA API calls. The use of std::function provides flexibility in handling completions.

inc/mkn/gpu/multi_launch.hpp (1)

Line range hint 427-455: Verify thread safety of work distribution

The work distribution logic has shared state access that needs careful review.

Let's verify the thread safety:

The work distribution appears thread-safe due to the mutex protection, but consider these improvements:

  1. Use atomic operations for the work counter
  2. Consider using a work-stealing queue for better load balancing
✅ Verification successful

Let me analyze the code access patterns to verify thread safety. I'll need to check the mutex protection scope and the shared state modifications.

The initial results show mutex usage, but I need to verify the complete protection scope and any potential access to shared state outside the mutex lock.


Let me check one more aspect to understand the complete synchronization mechanism and shared state access patterns.

This will help verify the complete synchronization mechanism and thread safety of the implementation.


Let me try one final verification with modified patterns to better understand the synchronization mechanism.

This should help us understand the complete thread management and synchronization pattern.


Based on the gathered evidence, I can now provide a final assessment of the thread safety concerns.

Thread safety of work distribution is properly implemented

The implementation shows appropriate thread-safety mechanisms:

  • Mutex protection (work_) for the work distribution logic
  • Proper scoping of locked sections using std::unique_lock with try_lock()
  • Thread-safe state transitions through protected access to shared state variables
  • Clean thread management with proper join semantics

The current implementation uses a non-blocking approach with try_lock() which is appropriate for this work-stealing pattern, allowing threads to skip and retry when work is not available rather than blocking. The shared state variables (work_i, status, step) are properly protected within the mutex scope.

The suggested improvements in the review comment about atomic operations and work-stealing queues would be premature optimizations that might actually complicate the current clean and correct implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for all accesses to shared state variables
rg -A 2 "work_i|status\[|step\[" 

# Search for potential race conditions in mutex usage
ast-grep --pattern 'lock.try_lock()'

Length of output: 3644


Script:

#!/bin/bash
# Check the class definition and member variables
rg -B 5 -A 2 "class Strategy" inc/mkn/gpu/multi_launch.hpp

# Check mutex declarations and usage
rg -B 2 -A 2 "std::mutex" inc/mkn/gpu/multi_launch.hpp

# Look for other methods that might access shared state
ast-grep --pattern 'void operator()(std::uint32_t const& i) {'

Length of output: 909


Script:

#!/bin/bash
# Check the complete class structure and member variables
rg -B 10 -A 10 "class MultiLaunchStrategy" inc/mkn/gpu/multi_launch.hpp

# Look for any other concurrent access patterns
rg -B 2 -A 2 "std::atomic|std::condition_variable" inc/mkn/gpu/multi_launch.hpp

# Check for any other methods that might modify shared state
ast-grep --pattern 'void join() {'

Length of output: 192


Script:

#!/bin/bash
# Check the class definition with a broader context
rg -B 20 -A 20 "struct MultiLaunchStrategy" inc/mkn/gpu/multi_launch.hpp

# Look for join implementation with different pattern
ast-grep --pattern 'join() {
  $$$
}'

# Check thread creation and management
rg -B 3 -A 3 "threads\." inc/mkn/gpu/multi_launch.hpp

Length of output: 470

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (2)
test/any/construct.cpp (2)

32-32: Enhance error message with size details.

Consider making the error message more descriptive by including the expected and actual sizes.

-  mkn::kul::abort_if_not(mem.size() == 5 && "wrong size");
+  mkn::kul::abort_if_not(mem.size() == 5 && "Expected size: 5, Actual size: " + std::to_string(mem.size()));

34-34: Enhance error message with size details.

Similar to the previous suggestion, improve the error message clarity.

-  mkn::kul::abort_if_not(mem.size() == 10 && "wrong size");
+  mkn::kul::abort_if_not(mem.size() == 10 && "Expected size after resize: 10, Actual size: " + std::to_string(mem.size()));
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Files that changed from the base of the PR and between e4d2d8b and 02d1bfe.

📒 Files selected for processing (2)
  • inc/mkn/gpu/cpu.hpp (1 hunks)
  • test/any/construct.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • inc/mkn/gpu/cpu.hpp
🔇 Additional comments (1)
test/any/construct.cpp (1)

Line range hint 19-53: Consider adding more test coverage.

While the current tests cover basic functionality, consider adding:

  1. Edge cases (e.g., resize to 0, maximum size)
  2. Memory deallocation verification
  3. Move construction/assignment scenarios
✅ Verification successful

Let me try a different approach to verify the test coverage.

This should give us a better picture of the existing test coverage and help verify if the suggested additions are indeed needed or if they're already covered in other test files.


Based on the search results, I can now generate a final response:

The review comment's suggestions for additional test coverage are valid and not currently implemented

The search results confirm that:

  1. There are no tests for resizing to 0 or testing maximum size limits
  2. No explicit memory deallocation verification exists (only construction tracking via alloced)
  3. Move semantics are not covered in the test suite

The current tests only verify:

  • Basic resize with construction (test_does_construct_on_resize)
  • Basic resize without construction (test_does_not_construct_on_resize)
  • Simple capacity and size checks
  • Basic element construction tracking

The original review comment's suggestions would indeed improve the test coverage by addressing important edge cases and memory management scenarios that are currently not tested.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for existing test coverage of suggested scenarios
rg -l "resize.*0|max.*size|move.*construct|deallocat" test/

Length of output: 61


Script:

#!/bin/bash
# Search for test files related to vector/memory management
fd -e cpp -e h "test.*vector|test.*memory" test/

# Look for existing resize tests with more context
rg -A 5 "resize" test/

# Look for move semantics tests
rg -A 5 "move|std::move" test/

# Look for memory management related tests
rg -A 5 "alloced|deallocated|construct" test/

Length of output: 3496

Comment on lines +41 to +42
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
mkn::kul::abort_if_not(mem[10].s == 1 && "bad construct!");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error messages and consider combining related checks.

The capacity and construction checks are related to the same operation (emplace_back). Consider:

  1. Enhancing error messages with actual values
  2. Combining related checks for better readability
-  mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
-  mkn::kul::abort_if_not(mem[10].s == 1 && "bad construct!");
+  const auto new_cap = mem.capacity();
+  mkn::kul::abort_if_not(new_cap != cap && 
+    "Capacity should have changed after emplace_back. Old: " + std::to_string(cap) + 
+    ", New: " + std::to_string(new_cap));
+  mkn::kul::abort_if_not(mem[10].s == 1 && 
+    "Newly constructed element at index 10 should have s=1, found: " + 
+    std::to_string(mem[10].s));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
mkn::kul::abort_if_not(mem[10].s == 1 && "bad construct!");
const auto new_cap = mem.capacity();
mkn::kul::abort_if_not(new_cap != cap &&
"Capacity should have changed after emplace_back. Old: " + std::to_string(cap) +
", New: " + std::to_string(new_cap));
mkn::kul::abort_if_not(mem[10].s == 1 &&
"Newly constructed element at index 10 should have s=1, found: " +
std::to_string(mem[10].s));

Comment on lines +47 to +49
mkn::kul::abort_if_not(mem.size() == 25 && "wrong size");
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
mkn::kul::abort_if_not(mem[10].s == 1 && "bad copy!");
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion

Improve error messages and group related checks.

These three checks validate the second resize operation. Consider improving error messages and grouping the checks logically.

-  mkn::kul::abort_if_not(mem.size() == 25 && "wrong size");
-  mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
-  mkn::kul::abort_if_not(mem[10].s == 1 && "bad copy!");
+  const auto new_size = mem.size();
+  const auto new_cap = mem.capacity();
+  mkn::kul::abort_if_not(new_size == 25 && 
+    "After second resize, expected size: 25, got: " + std::to_string(new_size));
+  mkn::kul::abort_if_not(new_cap != cap && 
+    "Capacity should change after resize. Old: " + std::to_string(cap) + 
+    ", New: " + std::to_string(new_cap));
+  mkn::kul::abort_if_not(mem[10].s == 1 && 
+    "Value at index 10 should be preserved after resize. Expected: 1, got: " + 
+    std::to_string(mem[10].s));
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
mkn::kul::abort_if_not(mem.size() == 25 && "wrong size");
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
mkn::kul::abort_if_not(mem[10].s == 1 && "bad copy!");
const auto new_size = mem.size();
const auto new_cap = mem.capacity();
mkn::kul::abort_if_not(new_size == 25 &&
"After second resize, expected size: 25, got: " + std::to_string(new_size));
mkn::kul::abort_if_not(new_cap != cap &&
"Capacity should change after resize. Old: " + std::to_string(cap) +
", New: " + std::to_string(new_cap));
mkn::kul::abort_if_not(mem[10].s == 1 &&
"Value at index 10 should be preserved after resize. Expected: 1, got: " +
std::to_string(mem[10].s));

@mkn mkn deleted a comment from coderabbitai bot Oct 26, 2024
@mkn mkn deleted a comment from coderabbitai bot Oct 26, 2024
@mkn mkn deleted a comment from coderabbitai bot Oct 26, 2024
@mkn mkn deleted a comment from coderabbitai bot Oct 26, 2024
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (1)
inc/mkn/gpu/alloc.hpp (1)

107-108: Consider adding type compatibility checks.

The generalization to support different source and destination types is good, but consider adding compile-time type compatibility checks to prevent undefined behavior.

Consider adding SFINAE or concepts (if using C++20) to ensure type compatibility:

template <typename T0, typename T1, typename Size>
std::enable_if_t<std::is_convertible_v<T1, T0>> 
copy(T0* dst, T1* src, Size const size)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 02d1bfe and 7f0a9ca.

📒 Files selected for processing (2)
  • inc/mkn/gpu/alloc.hpp (1 hunks)
  • inc/mkn/gpu/multi_launch.hpp (15 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-10-26T12:34:33.791Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-10-26T12:39:00.871Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (2)
inc/mkn/gpu/alloc.hpp (1)

107-108: Verify existing copy operations remain compatible.

The signature change from single type T to T0/T1 could affect existing code that relies on type deduction.

Let's verify the usage:

✅ Verification successful

Let me analyze the usage patterns found to verify if the signature change impacts compatibility.


Based on the complete context, I can now generate the final response:

The template type change is safe and maintains compatibility

The change from single type T to T0/T1 actually improves type safety while maintaining compatibility because:

  • All existing usage in the codebase shows matching source and destination types
  • The implementation handles pointer type checking through Pointer wrapper class
  • The copy function is only used internally within the allocator implementation for vector operations where types always match
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for copy function calls to ensure they remain compatible
rg -l "copy\s*\([^,]+,[^,]+,[^)]+\)" | while read -r file; do
    echo "=== $file ==="
    rg -A 2 "copy\s*\([^,]+,[^,]+,[^)]+\)" "$file"
done

Length of output: 772


Script:

#!/bin/bash
# Let's check the full context of the copy function definition and its usage
rg -B 5 -A 10 "template.*copy.*dst.*src.*size" inc/mkn/gpu/alloc.hpp

Length of output: 70


Script:

#!/bin/bash
# Let's try a different approach to get the context
cat inc/mkn/gpu/alloc.hpp

Length of output: 5298

inc/mkn/gpu/multi_launch.hpp (1)

Line range hint 417-454: Ensure proper locking in get_work() method

The get_work() method uses try_lock() without a mechanism to guarantee progress if the lock isn't acquired, which could lead to threads spinning without performing work.

Run the following script to check for potential issues with locking and thread progress:

Comment on lines +67 to +71
auto time() const {
assert(s > 0);
assert(e > 0);
return e - s;
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Handle uninitialized state in Timer::time()

The time() method uses assert(s > 0) and assert(e > 0) to ensure the timer has been started and stopped. However, in production builds where assertions might be disabled, this could lead to incorrect results if s or e are zero.

Consider adding runtime checks to handle uninitialized or incorrect states gracefully:

  auto time() const {
-    assert(s > 0);
-    assert(e > 0);
+    if (s == 0 || e == 0) {
+      throw std::runtime_error("Timer was not properly started or stopped.");
+    }
     return e - s;
  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
auto time() const {
assert(s > 0);
assert(e > 0);
return e - s;
}
auto time() const {
if (s == 0 || e == 0) {
throw std::runtime_error("Timer was not properly started or stopped.");
}
return e - s;
}

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (1)
inc/mkn/gpu/multi_launch.hpp (1)

293-325: Consider using a named constant for retry status.

The status values could be more maintainable with named constants.

+  static constexpr SFS RETRY_STATUS = SFS::FIRST;
+  static constexpr SFS DONE_STATUS = SFS::WAIT;

   void run(std::uint32_t const i) override {
     std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);

     if (lock.try_lock()) {
       fn(i);
-      strat.status[i] = SFS::WAIT;  // done
+      strat.status[i] = DONE_STATUS;
     } else {
-      strat.status[i] = SFS::FIRST;  // retry
+      strat.status[i] = RETRY_STATUS;
     }
   }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 7f0a9ca and 1fb5a9c.

📒 Files selected for processing (1)
  • inc/mkn/gpu/multi_launch.hpp (15 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-10-26T12:34:33.791Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-10-26T12:39:00.871Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (5)

41-54: LGTM: Well-designed helper function for pointer dereferencing.

The deref helper function in the detail namespace provides a clean way to handle both pointer and non-pointer types using template metaprogramming.


Line range hint 248-291: LGTM: Well-implemented group barrier synchronization.

The implementation correctly uses std::barrier for group synchronization with proper error handling for group size validation.


348-376: LGTM: Well-implemented device-side group operations.

The implementation correctly handles device launches with proper size checks and condition handling through template parameter.


Line range hint 467-495: LGTM: Thread-safe work distribution implementation.

The work distribution mechanism is properly protected by mutex with clear state transitions.


76-86: ⚠️ Potential issue

Add proper error handling in Timer::time().

The current implementation uses assertions which are disabled in release builds. Consider adding runtime checks for uninitialized state.

-  auto time() const {
-    assert(s > 0);
-    assert(e > 0);
-    return e - s;
-  }
+  auto time() const {
+    if (s == 0 || e == 0) {
+      throw std::runtime_error("Timer not properly started/stopped");
+    }
+    return e - s;
+  }

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
inc/mkn/gpu/launchers.hpp (1)

66-95: Consider refactoring to reduce code duplication.

There's significant code duplication between DLauncher and GDLauncher. Consider:

  1. Extracting common code to a shared base class
  2. Using template specialization instead of separate classes
  3. Using CRTP (Curiously Recurring Template Pattern) if static polymorphism is needed

This would improve maintainability and make the relationship between the launchers more explicit.

inc/mkn/gpu/multi_launch.hpp (3)

72-74: Simplify group index calculation.

The expression ((i - (i % gs)) / gs) is mathematically equivalent to i / gs when using integer division.

-  return ((i - (i % gs)) / gs);
+  return i / gs;

298-299: Consolidate duplicate error messages.

Multiple group-related structs use the same error message pattern. Consider moving it to a common location.

namespace detail {
  constexpr std::string_view GROUP_SIZE_ERROR = 
      "mkn.gpu error: Group size must be a divisor of datas";
}

Also applies to: 332-333, 353-354


417-430: Pass integral types by value.

For small integral types like std::size_t, passing by value is more efficient than passing by const reference.

-  This& host_group_mutex(std::size_t const& group_size, Fn&& fn) {
+  This& host_group_mutex(std::size_t group_size, Fn&& fn) {

-  This& host_group_idx(std::size_t const& group_size, std::size_t const& group_idx, Fn&& fn) {
+  This& host_group_idx(std::size_t group_size, std::size_t group_idx, Fn&& fn) {
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 0383d6e and 47e58b6.

📒 Files selected for processing (3)
  • inc/mkn/gpu/launchers.hpp (1 hunks)
  • inc/mkn/gpu/multi_launch.hpp (15 hunks)
  • inc/mkn/gpu/rocm.hpp (4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • inc/mkn/gpu/rocm.hpp
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-10-26T12:34:33.791Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-10-26T12:39:00.871Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (10)
inc/mkn/gpu/launchers.hpp (3)

91-94: 🛠️ Refactor suggestion

Document the difference between global_d_kernel and global_gd_kernel.

The _launch method uses global_d_kernel while GDLauncher uses global_gd_kernel. This appears to be the key distinction between these launchers, but the difference is not documented.

Please add documentation explaining:

  • The functional difference between these kernel types
  • When to use DLauncher vs GDLauncher
  • Any performance implications

Let's check the kernel implementations:

#!/bin/bash
# Search for kernel definitions to understand their differences
rg -A 10 "global_d_kernel|global_gd_kernel"

66-68: Review constructor initialization and add documentation.

The constructor initialization with empty parameters {} seems suspicious. Additionally, the class lacks documentation explaining:

  • The purpose and use cases for this launcher
  • The significance of template parameters _sync and _coop
  • Why _sync defaults to false here while GDLauncher defaults to true

Let's check the base class constructor requirements:


70-75: Verify omission of count parameter in _launch call.

The operator() implementation differs from GDLauncher by omitting the count parameter in the _launch call. Please verify if this is intentional.

Let's check for other usages of this pattern:

✅ Verification successful

Omission of count parameter is intentional in DLauncher

The difference in implementation between GDLauncher and DLauncher is intentional. From the code analysis:

  • GDLauncher uses global_gd_kernel which requires a count parameter for grid-based execution
  • DLauncher uses global_d_kernel which doesn't use count parameter as it's meant for direct kernel execution

The implementation aligns with their respective kernel function signatures and purposes:

  • global_gd_kernel: (F f, std::size_t s, Args... args)
  • global_d_kernel: (F f, Args... args)
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for similar launch patterns to understand the convention
ast-grep --pattern 'MKN_GPU_NS::launch<$_>(&global_d_kernel<$_>, $_)'

Length of output: 2881


Script:

#!/bin/bash
# Let's examine the launcher implementations more closely
rg -A 10 '_launch\(' inc/mkn/gpu/launchers.hpp

Length of output: 1899

inc/mkn/gpu/multi_launch.hpp (7)

48-54: LGTM: Well-designed utility function for pointer dereferencing.

The deref template function provides a safe way to handle both pointer and non-pointer types, using SFINAE to select the appropriate behavior.


76-86: LGTM: Timer implementation is appropriate for asynchronous operations.

Based on the learnings, an RAII ScopedTimer would not be appropriate here since we're timing asynchronous operations. The current implementation correctly allows manual control of timing points.


64-64: LGTM: Appropriate const-correctness.

The constructor now correctly takes StreamFunctionMode as a const reference, maintaining consistency with the class's const-correctness design.


Line range hint 467-495: LGTM: Thread-safe work distribution implementation.

The get_work method correctly handles the work_i counter within a locked section, ensuring thread safety. The implementation properly manages state transitions and work distribution across threads.


521-539: LGTM: Well-designed timing output functionality.

The timing output implementation provides flexible formatting options and supports both console and file output. The template parameter nl cleverly controls newline behavior.


301-305: ⚠️ Potential issue

Add zero group size check in make_mutices.

The function should validate that group_size is not zero before performing division.

 static auto make_mutices(Strat const& strat, std::size_t const& group_size) {
+  if (group_size == 0) {
+    throw std::invalid_argument("Group size cannot be zero");
+  }
   if (strat.datas.size() % group_size > 0) {
     throw std::runtime_error(std::string{MOD_GROUP_ERROR});
   }
   std::uint16_t const groups = strat.datas.size() / group_size;
   return std::vector<std::mutex>{groups};
 }

Likely invalid or redundant comment.


88-97: ⚠️ Potential issue

Fix potential initialization issue in StreamGroupFunction.

The group_size member is initialized to 0 in the declaration but should use the constructor parameter instead.

-  std::size_t const group_size = 0;
+  std::size_t const group_size;

Likely invalid or redundant comment.

@PhilipDeegan PhilipDeegan force-pushed the next branch 3 times, most recently from d88605d to fc70fd2 Compare November 12, 2024 18:52
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Outside diff range and nitpick comments (8)
inc/mkn/gpu/defines.hpp (1)

18-24: LGTM! Consider adding documentation.

The logic for handling multiple GPU backends is sound. Setting MKN_GPU_FN_PER_NS to 1 when both CUDA and ROCm are enabled helps prevent naming conflicts.

Consider adding a comment explaining the purpose of MKN_GPU_FN_PER_NS:

+// Enable per-namespace functions when multiple GPU backends are present to avoid naming conflicts
#if MKN_GPU_CUDA == 1 && MKN_GPU_ROCM == 1 && !defined(MKN_GPU_FN_PER_NS)
#define MKN_GPU_FN_PER_NS 1
#endif
inc/mkn/gpu/launchers.hpp (2)

55-55: LGTM! Consider adding documentation for sync behavior.

The changes to GDLauncher look good. The implementation correctly uses the new helper functions and maintains proper parameter passing.

Consider documenting the _sync template parameter's behavior:

/// @brief GPU launcher with grid-dimension based execution
/// @tparam _sync Controls whether kernel launches are synchronous (true) or asynchronous (false)
template <bool _sync = true>
struct GDLauncher : public GLauncher {

Also applies to: 60-60, 64-68


89-90: Remove empty comment or complete the implementation.

The empty comment at line 90 suggests incomplete implementation or missing functionality.

Either:

  • Remove the empty comment
  • Add the missing implementation
  • Add a TODO comment explaining what needs to be added
inc/mkn/gpu/cpu.hpp (1)

259-259: Fix typo in GPU message.

The message contains a typo: "Psuedo" should be "Pseudo".

-  void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo GPU in use"; }
+  void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Pseudo GPU in use"; }
inc/mkn/gpu/cuda.hpp (1)

109-115: Add documentation for the callback-based StreamEvent usage

The StreamEvent struct has been refactored to use callbacks instead of CUDA events. Please add documentation explaining:

  • The purpose and lifecycle of StreamEvent
  • How to properly use the callback mechanism
  • Thread-safety guarantees
inc/mkn/gpu/multi_launch.hpp (3)

299-304: Consider caching the group count calculation

The groups count calculation could be stored as a member variable to avoid recalculation.

 static auto make_mutices(Strat const& strat, std::size_t const& group_size) {
   if (group_size == 0 || strat.datas.size() % group_size > 0)
     throw std::runtime_error(std::string{MOD_GROUP_ERROR});
-  std::uint16_t const groups = strat.datas.size() / group_size;
+  auto const groups = static_cast<std::uint16_t>(strat.datas.size() / group_size);
   return std::vector<std::mutex>{groups};
 }

364-373: Consider extracting common device launch code

The device launch code is duplicated between the if/else branches. Consider extracting it to reduce duplication.

   if constexpr (is) {
-    if (i % Super::group_size == gid and size)
-      mkn::gpu::GDLauncher<false>{size}.stream(strat.streams[i],
-                                               [=, fn = fn] __device__() mutable { fn(i); });
+    if (i % Super::group_size == gid and size) {
+      launch_device(i, size);
+    }
   } else {
-    if (i % Super::group_size != gid and size)
-      mkn::gpu::GDLauncher<false>{size}.stream(strat.streams[i],
-                                               [=, fn = fn] __device__() mutable { fn(i); });
+    if (i % Super::group_size != gid and size) {
+      launch_device(i, size);
+    }
   }

Add this private method:

private:
  void launch_device(std::uint32_t i, std::size_t size) {
    mkn::gpu::GDLauncher<false>{size}.stream(strat.streams[i],
                                           [=, fn = fn] __device__() mutable { fn(i); });
  }

522-535: Consider using std::format for time output

The time output formatting could benefit from using std::format (C++20) for better type safety and formatting control.

 template <bool nl = false, typename SS>
 void _print_times(SS&& ss, double const div = 1) const {
   std::size_t fn_idx = 0, data_idx = 0;
   for (auto const& t : super().times) {
-    ss << data_idx << " " << fn_idx << " " << (t.time() / div);
+    ss << std::format("{} {} {:.3f}", data_idx, fn_idx, t.time() / div);
     if constexpr (nl) ss << std::endl;
     ++fn_idx;
     if (fn_idx == fns.size()) {
       ++data_idx;
       fn_idx = 0;
     }
   }
 }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 47e58b6 and b118ce0.

📒 Files selected for processing (7)
  • inc/mkn/gpu/cpu.hpp (5 hunks)
  • inc/mkn/gpu/cuda.hpp (6 hunks)
  • inc/mkn/gpu/defines.hpp (1 hunks)
  • inc/mkn/gpu/launchers.hpp (1 hunks)
  • inc/mkn/gpu/multi_launch.hpp (15 hunks)
  • inc/mkn/gpu/rocm.hpp (7 hunks)
  • test/any/coop.cpp (0 hunks)
💤 Files with no reviewable changes (1)
  • test/any/coop.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
  • inc/mkn/gpu/rocm.hpp
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-11-12T05:19:45.927Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-11-12T05:19:45.927Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (12)
inc/mkn/gpu/defines.hpp (2)

30-31: LGTM!

The CUDA header inclusion is properly guarded with the correct macro check.


34-37: Improve the fallback message and verify CPU implementation.

  1. The message using "error" is misleading since this is a valid fallback scenario, not an error condition.
  2. We should verify that the CPU implementation exists and is complete.

Consider updating the message to be more informative:

-#pragma message("mkn.gpu error: No accelerator found, defaulting to CPU IMP")
+#pragma message("mkn.gpu info: No GPU accelerator found, using CPU implementation")

Let's verify the CPU implementation is complete:

✅ Verification successful

CPU implementation is complete and properly mirrors GPU implementations

The CPU implementation in inc/mkn/gpu/cpu.hpp provides equivalent functionality to both CUDA and ROCm implementations, including:

  • Core structures (Stream, StreamEvent, Pointer)
  • Memory operations (alloc, destroy, copy)
  • Data transfer operations (send, take, async variants)
  • Launch mechanisms with dim3 support
  • Global indexing and kernel execution patterns

The fallback message suggestion remains valid but is not critical since the implementation is solid.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for CPU implementation completeness

# Check if cpu.hpp exists and contains implementations
echo "Checking for CPU implementation file..."
if [ -f "mkn/gpu/cpu.hpp" ]; then
    echo "Found cpu.hpp. Checking contents..."
    
    # Look for key GPU operations that should have CPU equivalents
    rg -l "class|struct|function" "mkn/gpu/cpu.hpp"
    
    # Compare with GPU operations to ensure all are implemented
    echo "Comparing with GPU operations..."
    for file in cuda.hpp rocm.hpp; do
        if [ -f "mkn/gpu/$file" ]; then
            echo "Checking $file for operations that should have CPU equivalents..."
            rg "class|struct|function" "mkn/gpu/$file"
        fi
    done
else
    echo "Warning: cpu.hpp not found!"
fi

Length of output: 182


Script:

#!/bin/bash
# Let's check the actual directory structure and look for CPU implementation files
fd -t f -e hpp -e h

# Then search for any CPU implementation patterns across all files
rg -l "MKN_GPU_CPU|cpu" 

# Finally, let's check the content of the directory containing defines.hpp
ls -la "$(dirname inc/mkn/gpu/defines.hpp)"

Length of output: 1401


Script:

#!/bin/bash
# Let's examine the CPU implementation content and compare with CUDA/ROCm implementations
# Check for key GPU operations in cpu.hpp
echo "=== CPU Implementation ==="
rg "class|struct|template|typename" inc/mkn/gpu/cpu.hpp -A 2

echo "=== CUDA Implementation ==="
rg "class|struct|template|typename" inc/mkn/gpu/cuda.hpp -A 2

echo "=== ROCm Implementation ==="
rg "class|struct|template|typename" inc/mkn/gpu/rocm.hpp -A 2

# Also check if cpu.hpp is being included/used in tests
echo "=== CPU Tests ==="
rg -l "include.*cpu\.hpp" test/

Length of output: 9590

inc/mkn/gpu/launchers.hpp (2)

70-73: Review the constructor implementation.

The constructor ignores the device parameter and initializes the base class with empty dimensions. This seems suspicious:

  1. Why is the device parameter ignored? If it's not needed, consider removing it.
  2. Are empty dimensions valid for the base class initialization?
#!/bin/bash
# Search for Launcher base class implementation
ast-grep --pattern 'struct Launcher {
  $$$
}'

# Search for other device parameter usages
rg "size_t.*dev.*=.*0"

34-47: Verify the nullptr tuple pointer usage and add documentation.

The implementation moves helper functions into a detail namespace, which is good practice. However:

  1. The _as_values function returns a nullptr tuple pointer which seems unusual. Please verify if this is intentional.
  2. Consider adding documentation to explain the purpose and usage of these utility functions.

Add documentation like:

/// Helper functions for converting argument types for GPU kernel launches
namespace detail {
  /// Internal implementation for tuple type conversion
  /// @return Tuple pointer of converted reference types
  template <std::size_t... I, typename... Args>
  auto _as_values(/*...*/) 

  /// Converts arguments to appropriate value types for GPU kernels
  /// @return Tuple pointer of converted reference types
  template <typename... Args>
  auto as_values(Args&... args)
✅ Verification successful

The nullptr tuple pointer usage is correct and intentional

The detail::as_values function is used correctly within the GPU kernel launch implementation. The nullptr tuple pointer serves as a type carrier for template argument deduction in the _launch method, where:

  1. It's used in GDLauncher and DLauncher to prepare argument types for GPU kernel launches
  2. The actual values are passed separately via perfect forwarding (args...)
  3. Similar pattern is also used in asio.hpp confirming this is a consistent design pattern

The implementation is type-safe and follows the codebase's established patterns for GPU kernel launches. Documentation would be nice-to-have but not critical given the clear usage pattern.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for usages of as_values to understand the nullptr context
rg -A 5 "as_values.*nullptr" 

# Search for related kernel implementations
ast-grep --pattern 'template <typename... Args> void global_[gd]*_kernel'

Length of output: 1056


Script:

#!/bin/bash
# Search for usages of as_values function
rg -A 10 "as_values\(" 

# Search for implementations using the detail namespace
rg -A 10 "namespace detail" inc/mkn/gpu/

# Look for kernel launch implementations that might use these utilities
ast-grep --pattern 'template <typename... Args>
auto launch(Args&&... args)'

Length of output: 6168

inc/mkn/gpu/cpu.hpp (5)

86-88: LGTM: Appropriate no-op implementations for CPU-side stubs.

The inline empty functions correctly mirror their GPU counterparts while maintaining a consistent interface.


Line range hint 215-233: LGTM: Simplified launch function template.

The launch function has been streamlined by removing the _coop parameter while maintaining proper thread-local index management and device memory handling.


279-282: LGTM: Clean implementation of direct kernel execution.

The new global_d_kernel function provides a straightforward CPU-side implementation for direct kernel execution, maintaining consistency with GPU-side interfaces.


284-284: Verify the launchers.hpp inclusion.

Please ensure that launchers.hpp exists and contains the expected launcher implementations.

✅ Verification successful

Inclusion of launchers.hpp is valid and contains required launcher implementations

The file exists and contains the expected launcher implementations:

  • GDLauncher template class for GPU device launching
  • DLauncher template class for device launching
  • Both classes provide stream-based and direct launch operations
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify the existence and contents of launchers.hpp

# Check if the file exists and show its contents
fd -t f "launchers.hpp" -x cat {}

Length of output: 3446


104-113: Verify the StreamEvent behavior changes across GPU implementations.

The StreamEvent structure has undergone significant changes in its event handling mechanism:

  1. The operator() now executes a callback function
  2. The finished() method uses a simple boolean flag
  3. New member variables have been added

These changes should be consistent with CUDA and ROCm implementations.

✅ Verification successful

StreamEvent implementations are consistent across GPU backends

The changes in the CPU implementation align with both CUDA and ROCm implementations:

  • All implementations use the same callback-based approach with std::function<void()>
  • All have the fin flag for tracking completion status
  • Both CUDA and ROCm use similar stream callback mechanisms (cudaStreamAddCallback/hipStreamAddCallback)
  • CPU implementation provides a simplified version that executes callbacks immediately, which is appropriate for CPU-only scenarios
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify consistency of StreamEvent implementations across GPU backends

# Check for StreamEvent implementations in other GPU backends
rg -A 10 "struct StreamEvent" "inc/mkn/gpu/(cuda|rocm).hpp"

Length of output: 129


Script:

#!/bin/bash
# Let's first find all GPU-related header files
fd -e hpp -e h "gpu"

# Then search for StreamEvent in these files
rg -A 10 "struct StreamEvent" 

Length of output: 1723

inc/mkn/gpu/cuda.hpp (2)

49-52: LGTM: Improved macro formatting

The macro reformatting enhances readability while maintaining proper bracing and functionality.


89-94: LGTM: Well-structured utility functions

The utility functions provide clean wrappers around CUDA API calls with proper error checking.

inc/mkn/gpu/multi_launch.hpp (1)

75-79: ⚠️ Potential issue

Add runtime checks for timer state

The assertions in time() will be stripped in release builds, potentially leading to undefined behavior if the timer wasn't properly started/stopped.

-  auto time() const {
-    assert(s > 0);
-    assert(e > 0);
-    return e - s;
-  }
+  auto time() const {
+    if (s == 0 || e == 0) {
+      throw std::runtime_error("Timer was not properly started or stopped");
+    }
+    return e - s;
+  }

Likely invalid or redundant comment.

Comment on lines +307 to +311
template <typename F, typename... Args>
__global__ static void global_d_kernel(F f, Args... args) {
f(args...);
}

Copy link

@coderabbitai coderabbitai bot Nov 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Add bounds checking or document usage requirements

The global_d_kernel lacks bounds checking unlike its counterpart global_gd_kernel. This could lead to out-of-bounds execution if not used carefully.

Consider either:

  1. Adding bounds checking similar to global_gd_kernel
  2. Documenting that the caller is responsible for ensuring proper grid/block dimensions

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is the unsafe, you should know what you're doing function, where as the other with bounds checks is the safe option

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PhilipDeegan, understood, thank you for the clarification.

(>_<)🐰


✏️ Learnings added
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/cuda.hpp:307-311
Timestamp: 2024-12-08T14:07:34.983Z
Learning: In `inc/mkn/gpu/cuda.hpp`, the `global_d_kernel` function is intentionally left without bounds checking, intended for advanced use where the caller is responsible for ensuring proper grid and block dimensions, while `global_gd_kernel` includes bounds checking for safer operation.

Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
inc/mkn/gpu/multi_launch.hpp (1)

73-74: Simplify group_idx_modulo implementation

The function can be simplified by passing arguments by value since std::size_t is a trivially copyable type.

Apply this diff:

-std::size_t inline group_idx_modulo(std::size_t const& gs, std::size_t const& i) { return i / gs; }
+std::size_t inline group_idx_modulo(std::size_t gs, std::size_t i) { return i / gs; }
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between eeb7bc8 and 33a209a.

📒 Files selected for processing (1)
  • inc/mkn/gpu/multi_launch.hpp (14 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (3)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:335-337
Timestamp: 2024-11-12T20:33:23.270Z
Learning: When suggesting code modifications to C++ constructors, ensure that any code statements are placed after the initializer list and within the constructor body to avoid compilation errors.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-11-12T05:19:45.927Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-11-12T05:19:45.927Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
🔇 Additional comments (6)
inc/mkn/gpu/multi_launch.hpp (6)

49-55: LGTM: Well-designed utility function

The deref template function elegantly handles both pointer and non-pointer types using SFINAE, making it a useful utility for the codebase.


292-325: LGTM: Well-implemented mutex-based synchronization

The implementation correctly:

  • Validates group size and divisibility
  • Uses try_lock for non-blocking behavior
  • Properly handles mutex state with appropriate status updates

327-378: LGTM: Well-designed group index functionality

The implementation provides:

  • Proper error handling for zero group size
  • Efficient device launch strategy with conditional compilation
  • Clear separation between host and device functionality

76-80: ⚠️ Potential issue

Add runtime checks for timer state

The assertions in time() will be disabled in release builds, potentially leading to undefined behavior if the timer wasn't properly started/stopped.

Apply this diff to add runtime checks:

  auto time() const {
-   assert(s > 0);
-   assert(e > 0);
+   if (s == 0 || e == 0) {
+     throw std::runtime_error("Timer not properly started/stopped");
+   }
    return e - s;
  }

Likely invalid or redundant comment.


87-96: ⚠️ Potential issue

Initialize group_size with constructor parameter

The group_size member is initialized to 0 in the declaration but should use the constructor parameter.

Apply this diff:

-  std::size_t const group_size = 0;
+  std::size_t const group_size;

Likely invalid or redundant comment.


539-542: ⚠️ Potential issue

Add error handling for file operations

The file output operations should include error handling for file open failures.

Apply this diff:

  void dump_times(std::string const& filename, double const div = 1) const {
-   _print_times<1>(std::ofstream{filename}, div);
+   std::ofstream file{filename};
+   if (!file) {
+     throw std::runtime_error("Failed to open file: " + filename);
+   }
+   _print_times<1>(file, div);
  }

Likely invalid or redundant comment.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (4)
inc/mkn/gpu/multi_launch.hpp (4)

41-56: LGTM! Consider adding documentation for the deref utility.

The new includes and utility function look good. Consider adding documentation to explain the purpose and usage of the deref utility function.

Add documentation like this:

/// Dereferences pointer types and returns references for non-pointer types.
/// @tparam Type The input type
/// @param type The value to potentially dereference
/// @return Reference to the value, dereferenced if it's a pointer

73-74: Simplify group_idx_modulo function

The function can be simplified as integer division already performs the same operation.

-std::size_t inline group_idx_modulo(std::size_t const& gs, std::size_t const& i) { return i / gs; }
+std::size_t inline group_idx_modulo(std::size_t gs, std::size_t i) { return i / gs; }

292-325: Consider using a named constant for retry status

The implementation looks good, but using a magic value for retry status could be improved.

+  static constexpr auto RETRY_STATUS = SFS::FIRST;
+
   void run(std::uint32_t const i) override {
     std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock);

     if (lock.try_lock()) {
       fn(i);
       strat.status[i] = SFS::WAIT;  // done
     } else {
-      strat.status[i] = SFS::FIRST;  // retry
+      strat.status[i] = RETRY_STATUS;  // retry
     }
   }

350-378: Add documentation for the template parameter

The implementation is good, but the purpose of the is template parameter isn't immediately clear.

Add documentation like this:

/// @brief Device group function that executes on specific group indices
/// @tparam Strat The strategy type
/// @tparam Fn The function type
/// @tparam is If true, executes for matching group indices; if false, executes for non-matching indices
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 33a209a and 3c243aa.

📒 Files selected for processing (1)
  • inc/mkn/gpu/multi_launch.hpp (14 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (3)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:335-337
Timestamp: 2024-11-12T20:33:23.270Z
Learning: When suggesting code modifications to C++ constructors, ensure that any code statements are placed after the initializer list and within the constructor body to avoid compilation errors.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-11-12T05:19:45.927Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-11-12T05:19:45.927Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (5)

Line range hint 247-290: LGTM! Well-structured barrier implementation.

The StreamGroupBarrierFunction implementation is well-designed with:

  • Proper inheritance from StreamGroupFunction
  • Thread-safe barrier synchronization
  • Appropriate error handling for group size validation

327-348: LGTM! Well-implemented group index functionality.

The implementation has:

  • Proper error handling for zero group size
  • Correct placement of validation after initializer list
  • Clear group index calculation logic

75-85: 🛠️ Refactor suggestion

Add runtime checks in Timer::time()

The current implementation uses assertions which may be disabled in release builds. Consider adding runtime checks for uninitialized state.

 struct Timer {
   auto time() const {
-    assert(s > 0);
-    assert(e > 0);
+    if (s == 0 || e == 0) {
+      throw std::runtime_error("Timer not properly started/stopped");
+    }
     return e - s;
   }
   void start() { s = kul::Now::NANOS(); }
   void stop() { e = kul::Now::NANOS(); }

   std::size_t s = 0, e = 0;
 };

Likely invalid or redundant comment.


87-96: ⚠️ Potential issue

Fix group_size initialization

The group_size member is initialized to 0 in the declaration but should use the constructor parameter instead.

-  std::size_t const group_size = 0;
+  std::size_t const group_size;

Likely invalid or redundant comment.


539-542: 🛠️ Refactor suggestion

Add error handling for file operations

The file output operations should include error handling.

 void dump_times(std::string const& filename, double const div = 1) const {
-  _print_times<1>(std::ofstream{filename}, div);
+  std::ofstream file{filename};
+  if (!file) {
+    throw std::runtime_error("Failed to open file: " + filename);
+  }
+  _print_times<1>(file, div);
 }

Likely invalid or redundant comment.

@PhilipDeegan PhilipDeegan merged commit 9924da7 into master Dec 8, 2024
1 check passed
@PhilipDeegan PhilipDeegan deleted the next branch December 8, 2024 14:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant