-
Notifications
You must be signed in to change notification settings - Fork 1
host group mutex #23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
host group mutex #23
Conversation
WalkthroughThe changes primarily involve modifications to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (6)
test/any/async_streaming.cpp (1)
142-142
: Consider adding more comprehensive result verificationThe current check
if (vals != std::vector<size_t>{3, 12}) return 1;
verifies the final state ofvals
. While this is a good basic check, consider adding more detailed assertions or error messages to provide better feedback in case of failure.You could enhance the verification like this:
if (vals != std::vector<size_t>{3, 12}) { std::cerr << "Unexpected vals: "; for (const auto& val : vals) { std::cerr << val << " "; } std::cerr << std::endl; return 1; }inc/mkn/gpu/multi_launch.hpp (5)
53-53
: Avoid unnecessaryconst
qualifier on pass-by-value parameterThe
mode_
parameter is passed by value in the constructor. Addingconst
to a pass-by-value parameter has no effect since it doesn't prevent the caller or callee from modifying the local copy. Consider removing the redundantconst
qualifier.Apply this diff to remove the unnecessary
const
:- StreamFunction(Strat& strat_, StreamFunctionMode const mode_) : strat{strat_}, mode{mode_} {} + StreamFunction(Strat& strat_, StreamFunctionMode mode_) : strat{strat_}, mode{mode_} {}
61-63
: Prefer passing fundamental types by value instead of byconst
referenceFor fundamental types like
std::size_t
, passing by value is more efficient than passing byconst
reference due to potential overhead associated with references. Consider changing the function parameters to pass by value.Apply this diff to update the parameter passing:
-std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) { +std::size_t group_idx_modulo(std::size_t gs, std::size_t i) {
68-69
: Prefer passing fundamental types by value instead of byconst
referenceIn the constructor of
StreamGroupFunction
,gs
is astd::size_t
, a fundamental type. Passing it by value is generally more efficient than passing byconst
reference. Similarly, consider removing the unnecessaryconst
from themode_
parameter.Apply this diff:
- StreamGroupFunction(std::size_t const& gs, Strat& strat_, StreamFunctionMode const mode_) + StreamGroupFunction(std::size_t gs, Strat& strat_, StreamFunctionMode mode_)
281-283
: Prefer passing fundamental types by value instead of byconst
referenceIn the constructor of
StreamHostGroupMutexFunction
,gs
is astd::size_t
. Passing fundamental types by value is more efficient than passing byconst
reference. Consider changing the parameter to pass by value.Apply this diff:
- StreamHostGroupMutexFunction(std::size_t const& gs, Strat& strat, Fn&& fn_) + StreamHostGroupMutexFunction(std::size_t gs, Strat& strat, Fn&& fn_)
342-344
: Prefer passing fundamental types by value instead of byconst
referenceIn the
host_group_mutex
method,group_size
is astd::size_t
. Passing fundamental types by value is more efficient than passing byconst
reference. Consider changing the parameter to pass by value.Apply this diff:
- This& host_group_mutex(std::size_t const& group_size, Fn&& fn) { + This& host_group_mutex(std::size_t group_size, Fn&& fn) {
🛑 Comments failed to post (1)
inc/mkn/gpu/multi_launch.hpp (1)
289-296: 🛠️ Refactor suggestion
Ensure mutex is properly acquired before using
In the
run
method, theunique_lock
is constructed withstd::defer_lock
andtry_lock
is used. If the lock is not acquired,fn(i)
is not called, andstrat.status[i]
is set toSFS::FIRST
to retry. This logic may lead to busy waiting and potential performance issues if the mutex is heavily contended. Consider implementing a back-off strategy or using condition variables to avoid tight loops.Consider refactoring the code to include a back-off mechanism:
void run(std::uint32_t const i) override { auto const gidx = Super::group_idx(i); std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock); if (lock.try_lock()) { fn(i); strat.status[i] = SFS::WAIT; // done } else { + // Introduce a small sleep to prevent tight loop in case of contention + std::this_thread::sleep_for(std::chrono::milliseconds(1)); strat.status[i] = SFS::FIRST; // retry } }Committable suggestion was skipped due to low confidence.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (5)
inc/mkn/gpu/multi_launch.hpp (4)
61-63
: LGTM: New group_idx_modulo function addedThe
group_idx_modulo
function is a useful addition for group-based operations. Its implementation is correct and improves code readability by encapsulating the group index calculation.Consider marking this function as
constexpr
andinline
to allow for compile-time evaluation and potential performance improvements:-std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) { +constexpr inline std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) { return ((i - (i % gs)) / gs); }
65-75
: LGTM: New StreamGroupFunction struct addedThe
StreamGroupFunction
struct is a well-designed abstraction for group-based stream functions. It properly inherits fromStreamFunction
and provides useful group-related functionality.Consider marking the
group_idx
method asconstexpr
andnoexcept
for potential compile-time optimizations and to indicate it doesn't throw exceptions:- std::size_t group_idx(std::size_t const& i) const { return group_idx_modulo(group_size, i); } + constexpr std::size_t group_idx(std::size_t const& i) const noexcept { return group_idx_modulo(group_size, i); }
267-298
: LGTM: New StreamHostGroupMutexFunction added for group-based mutex synchronizationThe
StreamHostGroupMutexFunction
is a valuable addition for synchronizing host-side operations within groups. Its implementation usingstd::mutex
is appropriate for host-side synchronization.Consider using
std::shared_mutex
instead ofstd::mutex
if multiple readers are expected to access shared resources simultaneously. This could potentially improve performance in scenarios where read operations are more frequent than write operations:- std::vector<std::mutex> mutices; + std::vector<std::shared_mutex> mutices;Also, to prevent potential busy-waiting, you might want to add a small sleep or yield when the lock isn't acquired:
if (lock.try_lock()) { fn(i); strat.status[i] = SFS::WAIT; // done } else { strat.status[i] = SFS::FIRST; // retry + std::this_thread::yield(); // or std::this_thread::sleep_for(std::chrono::milliseconds(1)); }
Line range hint
1-461
: Overall: Excellent additions to enhance group-based operations and synchronizationThe changes introduce well-designed group-based functionality and mutex synchronization to the GPU multi-launch framework. The new features are cohesively integrated with the existing codebase and provide valuable enhancements for parallel processing scenarios.
Consider adding more inline documentation or comments to explain the usage and benefits of the new group-based features and mutex synchronization. This would help users understand when and how to best utilize these new capabilities in their GPU multi-launch operations.
test/any/async_streaming.cpp (1)
135-137
: Remove unnecessarymutable
keyword in lambdaThe lambda function passed to
.host()
does not modify any captured variables, yet it's marked asmutable
. Unless there's a specific reason, consider removing themutable
keyword for clarity.Apply this diff to remove
mutable
:-.host([&](auto i) mutable { +.host([&](auto i) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- inc/mkn/gpu/multi_launch.hpp (5 hunks)
- test/any/async_streaming.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (3)
53-53
: LGTM: Improved parameter passing in StreamFunction constructorThe change to pass
mode_
as a constant reference is a good practice. It potentially improves performance by avoiding unnecessary copies without changing the functionality.
Line range hint
222-265
: LGTM: StreamGroupBarrierFunction updated to use new group-based abstractionThe changes to
StreamGroupBarrierFunction
properly integrate it with the newStreamGroupFunction
base class. This improves code consistency and leverages the new group functionality.The
run
method now correctly uses thegroup_idx
method from the base class, which is a good improvement in code reuse and consistency.
339-344
: LGTM: New host_group_mutex method added to ThreadedStreamLauncherThe
host_group_mutex
method is a well-implemented addition to theThreadedStreamLauncher
class. It provides a convenient way to add group-based, mutex-synchronized host functions to the launcher.The method correctly creates and adds a new
StreamHostGroupMutexFunction
to thefns
vector, properly forwarding the provided function and group size.test/any/async_streaming.cpp (2)
119-152
: Functiontest_threaded_host_group_mutex
implementation looks correctThe implementation of the
test_threaded_host_group_mutex
function appears to correctly utilize theThreadedStreamLauncher
withhost_group_mutex
to synchronize host operations across groups. The logic aligns with the intended functionality.
139-139
:⚠️ Potential issueUndefined function
group_idx_modulo
The function
group_idx_modulo
used here is not defined within the provided code or the included headers. This will lead to a compilation error due to the undefined reference.Run the following script to check if
group_idx_modulo
is defined elsewhere in the codebase:Ensure that
group_idx_modulo
is properly defined or included from the appropriate header file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
🧹 Outside diff range and nitpick comments (3)
inc/mkn/gpu/multi_launch.hpp (3)
61-63
: Newgroup_idx_modulo
function looks good, but consider adding a safety checkThe new
group_idx_modulo
function is a good addition for calculating group indices. However, to prevent potential undefined behavior, consider adding a check for division by zero.Consider adding a check for
gs != 0
at the beginning of the function:std::size_t group_idx_modulo(std::size_t const& gs, std::size_t const& i) { + if (gs == 0) { + throw std::invalid_argument("Group size cannot be zero"); + } return ((i - (i % gs)) / gs); }
267-298
: Well-implementedStreamHostGroupMutexFunction
with a suggestion for improvementThe new
StreamHostGroupMutexFunction
struct is a good addition that provides mutex-based synchronization for host group functions. The implementation is consistent with the existing code style and error handling.The
run
method's try-lock mechanism prevents deadlocks, which is good. However, it might lead to busy waiting in high-contention scenarios.Consider implementing a backoff strategy or using a condition variable to reduce CPU usage in case of high contention. For example:
void run(std::uint32_t const i) override { std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock); if (lock.try_lock()) { fn(i); strat.status[i] = SFS::WAIT; // done } else { // Implement exponential backoff std::this_thread::sleep_for(std::chrono::milliseconds(backoff_time)); backoff_time = std::min(backoff_time * 2, max_backoff_time); strat.status[i] = SFS::FIRST; // retry } }This approach would reduce CPU usage while still maintaining the non-blocking nature of the current implementation.
Line range hint
385-409
: Improved efficiency inget_work
, but potential race condition introducedThe changes to the
get_work
method improve efficiency by allowing work to resume from where it left off in previous calls. This is a good optimization for scenarios with many work items.However, the introduction of the shared
work_i
variable, which is modified outside the lock, could lead to race conditions in a multi-threaded environment.To address the potential race condition, consider one of the following approaches:
- Move the
work_i = 0
reset inside the lock:std::pair<SFP, std::size_t> get_work() { std::scoped_lock<std::mutex> lk(work_); for (; work_i < datas.size(); ++work_i) { // ... existing code ... } + work_i = 0; if (check_finished()) done = 1; return std::make_pair(SFP::SKIP, 0); } - work_i = 0;
- Use a thread-local variable instead of a shared member variable:
std::pair<SFP, std::size_t> get_work() { + static thread_local std::size_t work_i = 0; std::scoped_lock<std::mutex> lk(work_); for (; work_i < datas.size(); ++work_i) { // ... existing code ... } work_i = 0; if (check_finished()) done = 1; return std::make_pair(SFP::SKIP, 0); } - std::size_t work_i = 0;Either of these approaches would maintain the efficiency improvement while avoiding potential race conditions.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
- inc/mkn/gpu/multi_launch.hpp (8 hunks)
🧰 Additional context used
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (5)
53-53
: Improved efficiency inStreamFunction
constructorThe change to pass
mode_
as a const reference is a good optimization. It prevents unnecessary copying and ensures that themode_
parameter cannot be accidentally modified within the constructor.
65-75
: Well-designedStreamGroupFunction
structThe new
StreamGroupFunction
struct is a good addition that provides a clear abstraction for group-based stream functions. It properly inherits fromStreamFunction
and introduces the necessary group-related functionality.The
group_idx
method effectively utilizes thegroup_idx_modulo
function, maintaining consistency in the codebase.
Line range hint
222-265
: ImprovedStreamGroupBarrierFunction
with better inheritanceThe changes to
StreamGroupBarrierFunction
are well-implemented:
- Inheriting from
StreamGroupFunction
improves code organization and reuse.- The constructor properly initializes the base class with the group size.
- The
run
method now uses thegroup_idx
method from the base class, which is more consistent.These modifications enhance the overall design and maintainability of the code.
339-344
: Well-implementedhost_group_mutex
methodThe new
host_group_mutex
method inThreadedStreamLauncher
is a good addition. It provides a clean and consistent way to add mutex-based host group functions to the launcher. The implementation correctly creates and adds a newStreamHostGroupMutexFunction
to thefns
vector, maintaining consistency with other similar methods in the class.
Line range hint
1-445
: Overall assessment of changes tomulti_launch.hpp
The changes to this file significantly enhance the GPU multi-launch framework by introducing well-designed group-based abstractions and improving existing functionality. Key improvements include:
- The new
StreamGroupFunction
provides a solid foundation for group-based operations.- The
StreamHostGroupMutexFunction
adds useful mutex-based synchronization for host group functions.- The
ThreadedStreamLauncher
class has been extended with new capabilities that leverage these group-based abstractions.While these changes are generally positive, there are a few areas that require attention:
- The
group_idx_modulo
function should include a check for division by zero.- The locking mechanism in
StreamHostGroupMutexFunction
could be optimized to reduce potential busy waiting.- The use of the shared
work_i
variable inThreadedStreamLauncher
introduces potential thread safety issues that should be addressed.Addressing these points will further improve the robustness and efficiency of the framework. Overall, this update represents a significant step forward in the capabilities of the GPU multi-launch system.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (5)
inc/mkn/gpu/rocm.hpp (1)
208-210
: LGTM: Improved readability in alloc_managed functionThe introduction of the
bytes
variable enhances code clarity by separating the size calculation from thehipMallocManaged
call. This change improves readability and maintainability without altering the function's behavior.Consider adding a comment explaining the purpose of the
bytes
variable for even better clarity:+ // Calculate total bytes to allocate auto const bytes = size * sizeof(T); KLOG(TRC) << "GPU alloced: " << size * sizeof(T); MKN_GPU_ASSERT(hipMallocManaged((void**)&p, bytes));
test/any/async_streaming.cpp (1)
142-142
: Define expected values as constants for clarityThe comparison
if (vals != std::vector<size_t>{3, 12})
uses hardcoded magic numbers, which can hinder code readability. Consider defining these values as named constants or adding comments to explain their significance.inc/mkn/gpu/multi_launch.hpp (3)
61-63
: Simplify group index calculationThe
group_idx_modulo
function can be simplified. Currently, it calculates the group index using:return ((i - (i % gs)) / gs);
This expression is equivalent to integer division
i / gs
for integers. You can simplify the function to:return i / gs;
This makes the code clearer and may improve performance slightly.
61-63
: Avoid passing primitive types by const referenceIn the
group_idx_modulo
function, the parametersgs
andi
are passed asconst&
. Sincestd::size_t
is a primitive type, it's more efficient to pass them by value:std::size_t group_idx_modulo(std::size_t gs, std::size_t i) { return i / gs; }This change avoids unnecessary indirection and aligns with best practices.
361-365
: Avoid passingstd::size_t
by const referenceIn the
host_group_mutex
method,group_size
is passed asstd::size_t const&
. Sincestd::size_t
is a primitive type, it's more efficient to pass it by value:This& host_group_mutex(std::size_t group_size, Fn&& fn) { // ... }This change eliminates unnecessary indirection and follows C++ best practices for passing primitive types.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
- inc/mkn/gpu/multi_launch.hpp (8 hunks)
- inc/mkn/gpu/rocm.hpp (2 hunks)
- test/any/async_streaming.cpp (1 hunks)
🧰 Additional context used
🔇 Additional comments (4)
inc/mkn/gpu/rocm.hpp (2)
48-51
: LGTM: Improved macro formattingThe reformatting of the
MKN_GPU_ASSERT
macro enhances code readability without altering its functionality. This change aligns with best practices for macro definitions.
Line range hint
1-458
: Overall assessment: Minor improvements to code qualityThe changes in this file, while minor, contribute positively to the overall code quality and readability. They align well with the broader objectives of the pull request to enhance the GPU multi-launch framework. No functional changes or potential issues were identified in this file.
test/any/async_streaming.cpp (1)
196-201
: Refactor 'main' to better identify test failuresAs previously mentioned, returning the sum of test results can make it difficult to identify which specific test failed. Consider using bitwise operations or handling each test result individually to improve failure detection.
inc/mkn/gpu/multi_launch.hpp (1)
Line range hint
413-437
: Resolution of previous concurrency issue withwork_i
The potential race condition with the shared variable
work_i
, as previously noted in past reviews, has been addressed. Access towork_i
is now properly synchronized using thework_
mutex in theget_work
method, ensuring thread safety.Also applies to: 471-471
KUL_DBG_FUNC_ENTER; | ||
|
||
std::size_t constexpr group_size = 3; | ||
std::vector<size_t> vals((C + 1) / group_size); // 2 values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adjust the size calculation of 'vals' to handle all group sizes
The current calculation for the size of the vals
vector may not correctly account for cases where (C + 1)
is not perfectly divisible by group_size
. This could lead to incorrect indexing or out-of-bounds errors.
Apply this diff to ensure the size of vals
correctly accommodates all elements:
- std::vector<size_t> vals((C + 1) / group_size); // 2 values;
+ std::vector<size_t> vals((C + group_size - 1) / group_size);
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
std::vector<size_t> vals((C + 1) / group_size); // 2 values; | |
std::vector<size_t> vals((C + group_size - 1) / group_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (2)
inc/mkn/gpu/multi_launch.hpp (2)
53-53
: Nitpick: PassStreamFunctionMode
by value withoutconst
Since
StreamFunctionMode
is likely anenum
or small type, passing it by value withoutconst
is more idiomatic in C++. Theconst
qualifier is unnecessary when passing by value.Apply this diff to update the constructor signature:
- StreamFunction(Strat& strat_, StreamFunctionMode const mode_) : strat{strat_}, mode{mode_} {} + StreamFunction(Strat& strat_, StreamFunctionMode mode_) : strat{strat_}, mode{mode_} {}
Line range hint
222-266
: Issue: Correct the declaration ofMOD_GROUP_ERROR
The declaration of
MOD_GROUP_ERROR
uses an incorrect order of specifiers. In C++, thestatic
andconstexpr
specifiers should precede the type.Apply this diff to fix the declaration:
- std::string_view constexpr static MOD_GROUP_ERROR = + static constexpr std::string_view MOD_GROUP_ERROR =
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- inc/mkn/gpu/def.hpp (1 hunks)
- inc/mkn/gpu/multi_launch.hpp (8 hunks)
🧰 Additional context used
🔇 Additional comments (3)
inc/mkn/gpu/def.hpp (1)
26-26
: Verify the impact of reduced wait times on CPU usage and synchronization.The significant reduction in wait times (WAIT_MS_ADD_ from 10→1ms, WAIT_MS_MAX_ from 100→25ms) could affect the thread synchronization behavior:
- Pros: Potentially faster response times for mutex operations
- Cons: May increase CPU usage due to more frequent polling
Let's check where these constants are used:
Also applies to: 30-30
✅ Verification successful
--- Based on the gathered information, I can now provide a final response: **The reduced wait times are appropriate for the implementation** The changes to wait times (WAIT_MS_ADD_: 10→1ms, WAIT_MS_MAX_: 100→25ms) are justified based on the codebase analysis: 1. The `ThreadedStreamLauncher` is used primarily in test scenarios with short-duration operations (200ms sleeps) 2. The implementation includes synchronization primitives like `barrier()`, `group_barrier()`, and `host_group_mutex()` which handle thread coordination 3. The reduced polling intervals align well with the test cases' timing requirements, where end-to-end execution is expected to complete within 1 second (test fails if `end - start > 1e3`) The changes optimize the polling frequency for faster response times without risking synchronization issues, as proper mutex and barrier mechanisms are in place to ensure thread safety. </details> <details> <summary>🏁 Scripts executed</summary> The following scripts were executed for the analysis: Script: ```shell #!/bin/bash # Search for usage of wait time constants echo "Searching for wait time constant usage..." rg -A 3 "_MKN_GPU_THREADED_STREAM_LAUNCHER_WAIT"
Length of output: 2362
Script:
#!/bin/bash # Search for all usages of ThreadedStreamLauncher echo "Searching for all usages of ThreadedStreamLauncher..." rg "ThreadedStreamLauncher"Length of output: 793
Script:
#!/bin/bash # Let's examine the implementation details of ThreadedStreamLauncher echo "Examining ThreadedStreamLauncher implementation..." rg -A 20 "struct ThreadedStreamLauncher" # Also check the test cases to understand usage patterns echo -e "\nExamining test cases..." rg -A 10 "ThreadedStreamLauncher\{" test/Length of output: 4883
inc/mkn/gpu/multi_launch.hpp (2)
Line range hint
413-440
: Approved: Proper synchronization ofwork_i
variableThe previous concern about the shared
work_i
variable has been addressed. Access towork_i
withinget_work
is now properly synchronized using thework_
mutex, ensuring thread safety.
281-299
:⚠️ Potential issueIssue: Potential data race on
strat.status[i]
In
StreamHostGroupMutexFunction::run
,strat.status[i]
is modified without synchronization. Sincestrat.status[i]
may be accessed concurrently by multiple threads, this could lead to data races and undefined behavior.Consider making
status
a vector of atomic values to ensure thread-safe access:- std::vector<SFS> status; + std::vector<std::atomic<SFS>> status;Ensure all reads and writes to
status[i]
use atomic operations.Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Caution
Inline review comments failed to post. This is likely due to GitHub's limits when posting large numbers of comments.
Actionable comments posted: 7
🧹 Outside diff range and nitpick comments (2)
inc/mkn/gpu/cuda.hpp (1)
115-141
: Consider enhancing RAII guarantees.The StreamEvent class could benefit from stronger RAII principles to ensure proper cleanup of CUDA resources and callback state, especially in error scenarios.
Consider:
- Adding a destructor to ensure cleanup of any pending callbacks
- Using std::unique_ptr for managing the function object
- Adding error handling for callback registration failures
inc/mkn/gpu/multi_launch.hpp (1)
Line range hint
239-282
: Enhance error message and const correctness.The implementation looks good, but consider these improvements:
- Make the error message more descriptive
- Add const qualifier to member functions that don't modify state
- std::string_view constexpr static MOD_GROUP_ERROR = - "mkn.gpu error: StreamGroupBarrierFunction Group size must be a divisor of datas"; + std::string_view constexpr static MOD_GROUP_ERROR = + "mkn.gpu error: StreamGroupBarrierFunction group size (%) must be a divisor of data size (%)"; - void arrive() { [[maybe_unused]] auto ret = sync_point.arrive(); } + void arrive() const { [[maybe_unused]] auto ret = sync_point.arrive(); }
🛑 Comments failed to post (7)
inc/mkn/gpu/cpu.hpp (1)
109-110:
⚠️ Potential issueFix the operator() implementation to handle the callback function.
The operator() accepts a callback function but doesn't execute it, which could lead to silent failures in synchronization code. Either implement the callback execution or document why it's intentionally ignored.
- auto& operator()(std::function<void()> fn = {}) { return *this; } + auto& operator()(std::function<void()> fn = {}) { + if (fn) fn(); + return *this; + }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.auto& operator()(std::function<void()> fn = {}) { if (fn) fn(); return *this; } bool finished() const { return fin; }
inc/mkn/gpu/cuda.hpp (2)
136-136:
⚠️ Potential issueAdd memory ordering to finished() check.
The finished() method reads the completion flag without proper memory ordering guarantees.
Add proper memory ordering for the flag check:
- bool finished() const { return fin; } + bool finished() const { return fin.load(std::memory_order_acquire); }Committable suggestion was skipped due to low confidence.
129-133:
⚠️ Potential issueEnsure thread-safe callback execution.
The callback function modifies shared state without proper synchronization. The order of operations (executing function, clearing it, setting flag) should be carefully controlled.
Consider this safer implementation:
static void Callback(cudaStream_t /*stream*/, cudaError_t /*status*/, void* ptr) { auto& self = *reinterpret_cast<StreamEvent*>(ptr); + std::function<void()> fn = std::move(self._fn); + self._fn = {}; self._fn(); - self._fn = [] {}; - self.fin = 1; + self.fin.store(true, std::memory_order_release); }Committable suggestion was skipped due to low confidence.
inc/mkn/gpu/multi_launch.hpp (4)
330-333:
⚠️ Potential issueRace condition in status update.
Similar to StreamHostGroupMutexFunction, the status update is not thread-safe.
void run(std::uint32_t const i) override { if (i % Super::group_size == gid) fn(i); - strat.status[i] = SFS::WAIT; // done + std::atomic_ref<SFS> status_ref(strat.status[i]); + status_ref.store(SFS::WAIT, std::memory_order_release); }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.void run(std::uint32_t const i) override { if (i % Super::group_size == gid) fn(i); std::atomic_ref<SFS> status_ref(strat.status[i]); status_ref.store(SFS::WAIT, std::memory_order_release); }
481-493: 🛠️ Refactor suggestion
Enhance print_times formatting and efficiency.
The print_times method could be improved for better readability and performance:
- Use a more structured output format
- Pre-calculate the indices
void print_times() const { - std::size_t fn_idx = 0, data_idx = 0; - - for (auto const& t : super().times) { - KOUT(NON) << data_idx << " " << fn_idx << " " << (t.time() / 1e6); - - ++fn_idx; - if (fn_idx == fns.size()) { - ++data_idx; - fn_idx = 0; - } + KOUT(NON) << "Data_Index Function_Index Time_ms"; + for (std::size_t i = 0; i < super().times.size(); ++i) { + auto const data_idx = i / fns.size(); + auto const fn_idx = i % fns.size(); + KOUT(NON) << std::setw(9) << data_idx + << std::setw(13) << fn_idx + << std::setw(8) << std::fixed << std::setprecision(3) + << (super().times[i].time() / 1e6); } }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.void print_times() const { KOUT(NON) << "Data_Index Function_Index Time_ms"; for (std::size_t i = 0; i < super().times.size(); ++i) { auto const data_idx = i / fns.size(); auto const fn_idx = i % fns.size(); KOUT(NON) << std::setw(9) << data_idx << std::setw(13) << fn_idx << std::setw(8) << std::fixed << std::setprecision(3) << (super().times[i].time() / 1e6); } }
304-311:
⚠️ Potential issueRace condition in status updates.
The status updates are not protected by the mutex, which could lead to data races. Consider these improvements:
- Move the status update inside the mutex lock
- Use atomic operations for status updates
void run(std::uint32_t const i) override { std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock); if (lock.try_lock()) { fn(i); - strat.status[i] = SFS::WAIT; // done + { + std::atomic_ref<SFS> status_ref(strat.status[i]); + status_ref.store(SFS::WAIT, std::memory_order_release); + } } else { - strat.status[i] = SFS::FIRST; // retry + std::atomic_ref<SFS> status_ref(strat.status[i]); + status_ref.store(SFS::FIRST, std::memory_order_release); } }📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock); if (lock.try_lock()) { fn(i); { std::atomic_ref<SFS> status_ref(strat.status[i]); status_ref.store(SFS::WAIT, std::memory_order_release); } } else { std::atomic_ref<SFS> status_ref(strat.status[i]); status_ref.store(SFS::FIRST, std::memory_order_release); }
66-76: 🛠️ Refactor suggestion
Consider enhancing Timer struct with thread safety and better error messages.
The Timer struct could benefit from the following improvements:
- Thread safety for concurrent timing operations
- More descriptive assertion messages
struct Timer { auto time() const { - assert(s > 0); - assert(e > 0); + assert(s > 0 && "Timer not started"); + assert(e > 0 && "Timer not stopped"); + assert(e >= s && "Timer stopped before start"); return e - s; } - void start() { s = kul::Now::NANOS(); } - void stop() { e = kul::Now::NANOS(); } + void start() { + std::lock_guard<std::mutex> lock(mutex); + s = kul::Now::NANOS(); + } + void stop() { + std::lock_guard<std::mutex> lock(mutex); + e = kul::Now::NANOS(); + } std::size_t s = 0, e = 0; +private: + std::mutex mutex; };📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.struct Timer { auto time() const { assert(s > 0 && "Timer not started"); assert(e > 0 && "Timer not stopped"); assert(e >= s && "Timer stopped before start"); return e - s; } void start() { std::lock_guard<std::mutex> lock(mutex); s = kul::Now::NANOS(); } void stop() { std::lock_guard<std::mutex> lock(mutex); e = kul::Now::NANOS(); } std::size_t s = 0, e = 0; private: std::mutex mutex; };
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 6
🧹 Outside diff range and nitpick comments (2)
test/any/async_streaming.cpp (1)
137-138
: Consider parameterizing the sleep duration.The hardcoded sleep duration of 200ms could be made configurable to allow testing different timing scenarios.
+constexpr auto DEFAULT_SLEEP_DURATION = 200ms; + std::uint32_t test_threaded_host_group_mutex(std::size_t const& nthreads = 2) { // ... - std::this_thread::sleep_for(200ms); + std::this_thread::sleep_for(DEFAULT_SLEEP_DURATION);inc/mkn/gpu/multi_launch.hpp (1)
Line range hint
239-282
: Consider refactoring GroupBarrier to improve encapsulationThe GroupBarrier inner struct could be moved to a separate implementation namespace and its interface could be simplified.
Consider these improvements:
- struct GroupBarrier { + class GroupBarrier { + This* self; + std::uint16_t group_id; + std::barrier<std::function<void()>> sync_point; + + public: GroupBarrier(This& slf, std::uint16_t const gid) : self{&slf}, group_id{gid} { sync_point = std::barrier(static_cast<std::int64_t>(self->group_size), [this]() { std::size_t const offset = self->group_size * group_id; for (std::size_t i = offset; i < offset + self->group_size; ++i) self->strat.status[i] = SFS::WAIT; }); } GroupBarrier(GroupBarrier const&) = delete; GroupBarrier& operator=(GroupBarrier const&) = delete; void arrive() { [[maybe_unused]] auto ret = sync_point.arrive(); } };
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
- inc/mkn/gpu/cpu.hpp (1 hunks)
- inc/mkn/gpu/cuda.hpp (2 hunks)
- inc/mkn/gpu/multi_launch.hpp (15 hunks)
- inc/mkn/gpu/rocm.hpp (3 hunks)
- test/any/async_streaming.cpp (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
- inc/mkn/gpu/cpu.hpp
- inc/mkn/gpu/rocm.hpp
🧰 Additional context used
🔇 Additional comments (5)
test/any/async_streaming.cpp (2)
66-67
: LGTM: Improved readability and debugging.The changes to store the ThreadedStreamLauncher instance in a variable and add timing information output are good improvements for debugging and code clarity.
Also applies to: 82-83
134-142
: Verify thread safety of host_group_mutex implementation.While the test looks correct, we should verify that the mutex implementation in ThreadedStreamLauncher properly handles concurrent access across thread groups.
✅ Verification successful
Let me gather more information about the mutex implementation by checking the
StreamHostGroupMutexFunction
class.
Based on the gathered information, I can now provide the final response:
Thread safety implementation in host_group_mutex is properly handled
The implementation shows correct thread-safety mechanisms:
- Each group has its own dedicated mutex in
std::vector<std::mutex> mutices
- Uses
std::unique_lock
withtry_lock()
for non-blocking mutex acquisition- Proper RAII-style mutex management through
std::unique_lock
- Group index calculation ensures mutexes are correctly mapped to thread groups
- Status tracking prevents race conditions during thread synchronization
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Check the mutex implementation in ThreadedStreamLauncher # Test: Look for mutex usage in the launcher implementation rg -A 5 "host_group_mutex.*\{" inc/mkn/gpu/multi_launch.hpp # Test: Check for potential race conditions in mutex handling ast-grep --pattern 'class ThreadedStreamLauncher { $$$ host_group_mutex($$$) { $$$ } $$$ }'Length of output: 387
Script:
#!/bin/bash # Check StreamHostGroupMutexFunction implementation rg -A 20 "class StreamHostGroupMutexFunction" inc/mkn/gpu/multi_launch.hpp # Look for mutex member variables and their usage rg -A 5 "std::mutex" inc/mkn/gpu/multi_launch.hpp # Check if there are any other mutex-related implementations rg -A 5 "mutex" inc/mkn/gpu/multi_launch.hppLength of output: 2386
inc/mkn/gpu/cuda.hpp (2)
49-52
: LGTM! Well-formatted CUDA assertion macro.The macro is properly scoped with curly braces and follows standard CUDA error checking patterns.
115-141
: LGTM! Improved StreamEvent design with callback-based completion tracking.The refactoring from event-based to callback-based completion tracking simplifies the code and reduces CUDA API calls. The use of std::function provides flexibility in handling completions.
inc/mkn/gpu/multi_launch.hpp (1)
Line range hint
427-455
: Verify thread safety of work distributionThe work distribution logic has shared state access that needs careful review.
Let's verify the thread safety:
The work distribution appears thread-safe due to the mutex protection, but consider these improvements:
- Use atomic operations for the work counter
- Consider using a work-stealing queue for better load balancing
✅ Verification successful
Let me analyze the code access patterns to verify thread safety. I'll need to check the mutex protection scope and the shared state modifications.
The initial results show mutex usage, but I need to verify the complete protection scope and any potential access to shared state outside the mutex lock.
Let me check one more aspect to understand the complete synchronization mechanism and shared state access patterns.
This will help verify the complete synchronization mechanism and thread safety of the implementation.
Let me try one final verification with modified patterns to better understand the synchronization mechanism.
This should help us understand the complete thread management and synchronization pattern.
Based on the gathered evidence, I can now provide a final assessment of the thread safety concerns.
Thread safety of work distribution is properly implemented
The implementation shows appropriate thread-safety mechanisms:
- Mutex protection (
work_
) for the work distribution logic- Proper scoping of locked sections using
std::unique_lock
withtry_lock()
- Thread-safe state transitions through protected access to shared state variables
- Clean thread management with proper join semantics
The current implementation uses a non-blocking approach with
try_lock()
which is appropriate for this work-stealing pattern, allowing threads to skip and retry when work is not available rather than blocking. The shared state variables (work_i
,status
,step
) are properly protected within the mutex scope.The suggested improvements in the review comment about atomic operations and work-stealing queues would be premature optimizations that might actually complicate the current clean and correct implementation.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Search for all accesses to shared state variables rg -A 2 "work_i|status\[|step\[" # Search for potential race conditions in mutex usage ast-grep --pattern 'lock.try_lock()'Length of output: 3644
Script:
#!/bin/bash # Check the class definition and member variables rg -B 5 -A 2 "class Strategy" inc/mkn/gpu/multi_launch.hpp # Check mutex declarations and usage rg -B 2 -A 2 "std::mutex" inc/mkn/gpu/multi_launch.hpp # Look for other methods that might access shared state ast-grep --pattern 'void operator()(std::uint32_t const& i) {'Length of output: 909
Script:
#!/bin/bash # Check the complete class structure and member variables rg -B 10 -A 10 "class MultiLaunchStrategy" inc/mkn/gpu/multi_launch.hpp # Look for any other concurrent access patterns rg -B 2 -A 2 "std::atomic|std::condition_variable" inc/mkn/gpu/multi_launch.hpp # Check for any other methods that might modify shared state ast-grep --pattern 'void join() {'Length of output: 192
Script:
#!/bin/bash # Check the class definition with a broader context rg -B 20 -A 20 "struct MultiLaunchStrategy" inc/mkn/gpu/multi_launch.hpp # Look for join implementation with different pattern ast-grep --pattern 'join() { $$$ }' # Check thread creation and management rg -B 3 -A 3 "threads\." inc/mkn/gpu/multi_launch.hppLength of output: 470
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (2)
test/any/construct.cpp (2)
32-32
: Enhance error message with size details.Consider making the error message more descriptive by including the expected and actual sizes.
- mkn::kul::abort_if_not(mem.size() == 5 && "wrong size"); + mkn::kul::abort_if_not(mem.size() == 5 && "Expected size: 5, Actual size: " + std::to_string(mem.size()));
34-34
: Enhance error message with size details.Similar to the previous suggestion, improve the error message clarity.
- mkn::kul::abort_if_not(mem.size() == 10 && "wrong size"); + mkn::kul::abort_if_not(mem.size() == 10 && "Expected size after resize: 10, Actual size: " + std::to_string(mem.size()));
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
- inc/mkn/gpu/cpu.hpp (1 hunks)
- test/any/construct.cpp (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- inc/mkn/gpu/cpu.hpp
🔇 Additional comments (1)
test/any/construct.cpp (1)
Line range hint
19-53
: Consider adding more test coverage.While the current tests cover basic functionality, consider adding:
- Edge cases (e.g., resize to 0, maximum size)
- Memory deallocation verification
- Move construction/assignment scenarios
✅ Verification successful
Let me try a different approach to verify the test coverage.
This should give us a better picture of the existing test coverage and help verify if the suggested additions are indeed needed or if they're already covered in other test files.
Based on the search results, I can now generate a final response:
The review comment's suggestions for additional test coverage are valid and not currently implemented
The search results confirm that:
- There are no tests for resizing to 0 or testing maximum size limits
- No explicit memory deallocation verification exists (only construction tracking via
alloced
)- Move semantics are not covered in the test suite
The current tests only verify:
- Basic resize with construction (
test_does_construct_on_resize
)- Basic resize without construction (
test_does_not_construct_on_resize
)- Simple capacity and size checks
- Basic element construction tracking
The original review comment's suggestions would indeed improve the test coverage by addressing important edge cases and memory management scenarios that are currently not tested.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Check for existing test coverage of suggested scenarios rg -l "resize.*0|max.*size|move.*construct|deallocat" test/Length of output: 61
Script:
#!/bin/bash # Search for test files related to vector/memory management fd -e cpp -e h "test.*vector|test.*memory" test/ # Look for existing resize tests with more context rg -A 5 "resize" test/ # Look for move semantics tests rg -A 5 "move|std::move" test/ # Look for memory management related tests rg -A 5 "alloced|deallocated|construct" test/Length of output: 3496
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!"); | ||
mkn::kul::abort_if_not(mem[10].s == 1 && "bad construct!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error messages and consider combining related checks.
The capacity and construction checks are related to the same operation (emplace_back). Consider:
- Enhancing error messages with actual values
- Combining related checks for better readability
- mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
- mkn::kul::abort_if_not(mem[10].s == 1 && "bad construct!");
+ const auto new_cap = mem.capacity();
+ mkn::kul::abort_if_not(new_cap != cap &&
+ "Capacity should have changed after emplace_back. Old: " + std::to_string(cap) +
+ ", New: " + std::to_string(new_cap));
+ mkn::kul::abort_if_not(mem[10].s == 1 &&
+ "Newly constructed element at index 10 should have s=1, found: " +
+ std::to_string(mem[10].s));
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!"); | |
mkn::kul::abort_if_not(mem[10].s == 1 && "bad construct!"); | |
const auto new_cap = mem.capacity(); | |
mkn::kul::abort_if_not(new_cap != cap && | |
"Capacity should have changed after emplace_back. Old: " + std::to_string(cap) + | |
", New: " + std::to_string(new_cap)); | |
mkn::kul::abort_if_not(mem[10].s == 1 && | |
"Newly constructed element at index 10 should have s=1, found: " + | |
std::to_string(mem[10].s)); |
mkn::kul::abort_if_not(mem.size() == 25 && "wrong size"); | ||
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!"); | ||
mkn::kul::abort_if_not(mem[10].s == 1 && "bad copy!"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion
Improve error messages and group related checks.
These three checks validate the second resize operation. Consider improving error messages and grouping the checks logically.
- mkn::kul::abort_if_not(mem.size() == 25 && "wrong size");
- mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!");
- mkn::kul::abort_if_not(mem[10].s == 1 && "bad copy!");
+ const auto new_size = mem.size();
+ const auto new_cap = mem.capacity();
+ mkn::kul::abort_if_not(new_size == 25 &&
+ "After second resize, expected size: 25, got: " + std::to_string(new_size));
+ mkn::kul::abort_if_not(new_cap != cap &&
+ "Capacity should change after resize. Old: " + std::to_string(cap) +
+ ", New: " + std::to_string(new_cap));
+ mkn::kul::abort_if_not(mem[10].s == 1 &&
+ "Value at index 10 should be preserved after resize. Expected: 1, got: " +
+ std::to_string(mem[10].s));
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
mkn::kul::abort_if_not(mem.size() == 25 && "wrong size"); | |
mkn::kul::abort_if_not(mem.capacity() != cap && "capacity bad!"); | |
mkn::kul::abort_if_not(mem[10].s == 1 && "bad copy!"); | |
const auto new_size = mem.size(); | |
const auto new_cap = mem.capacity(); | |
mkn::kul::abort_if_not(new_size == 25 && | |
"After second resize, expected size: 25, got: " + std::to_string(new_size)); | |
mkn::kul::abort_if_not(new_cap != cap && | |
"Capacity should change after resize. Old: " + std::to_string(cap) + | |
", New: " + std::to_string(new_cap)); | |
mkn::kul::abort_if_not(mem[10].s == 1 && | |
"Value at index 10 should be preserved after resize. Expected: 1, got: " + | |
std::to_string(mem[10].s)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (1)
inc/mkn/gpu/alloc.hpp (1)
107-108
: Consider adding type compatibility checks.The generalization to support different source and destination types is good, but consider adding compile-time type compatibility checks to prevent undefined behavior.
Consider adding SFINAE or concepts (if using C++20) to ensure type compatibility:
template <typename T0, typename T1, typename Size> std::enable_if_t<std::is_convertible_v<T1, T0>> copy(T0* dst, T1* src, Size const size)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (2)
inc/mkn/gpu/alloc.hpp
(1 hunks)inc/mkn/gpu/multi_launch.hpp
(15 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-10-26T12:34:33.791Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-10-26T12:39:00.871Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (2)
inc/mkn/gpu/alloc.hpp (1)
107-108
: Verify existing copy operations remain compatible.
The signature change from single type T to T0/T1 could affect existing code that relies on type deduction.
Let's verify the usage:
✅ Verification successful
Let me analyze the usage patterns found to verify if the signature change impacts compatibility.
Based on the complete context, I can now generate the final response:
The template type change is safe and maintains compatibility
The change from single type T
to T0/T1
actually improves type safety while maintaining compatibility because:
- All existing usage in the codebase shows matching source and destination types
- The implementation handles pointer type checking through
Pointer
wrapper class - The copy function is only used internally within the allocator implementation for vector operations where types always match
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for copy function calls to ensure they remain compatible
rg -l "copy\s*\([^,]+,[^,]+,[^)]+\)" | while read -r file; do
echo "=== $file ==="
rg -A 2 "copy\s*\([^,]+,[^,]+,[^)]+\)" "$file"
done
Length of output: 772
Script:
#!/bin/bash
# Let's check the full context of the copy function definition and its usage
rg -B 5 -A 10 "template.*copy.*dst.*src.*size" inc/mkn/gpu/alloc.hpp
Length of output: 70
Script:
#!/bin/bash
# Let's try a different approach to get the context
cat inc/mkn/gpu/alloc.hpp
Length of output: 5298
inc/mkn/gpu/multi_launch.hpp (1)
Line range hint 417-454
: Ensure proper locking in get_work()
method
The get_work()
method uses try_lock()
without a mechanism to guarantee progress if the lock isn't acquired, which could lead to threads spinning without performing work.
Run the following script to check for potential issues with locking and thread progress:
auto time() const { | ||
assert(s > 0); | ||
assert(e > 0); | ||
return e - s; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle uninitialized state in Timer::time()
The time()
method uses assert(s > 0)
and assert(e > 0)
to ensure the timer has been started and stopped. However, in production builds where assertions might be disabled, this could lead to incorrect results if s
or e
are zero.
Consider adding runtime checks to handle uninitialized or incorrect states gracefully:
auto time() const {
- assert(s > 0);
- assert(e > 0);
+ if (s == 0 || e == 0) {
+ throw std::runtime_error("Timer was not properly started or stopped.");
+ }
return e - s;
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
auto time() const { | |
assert(s > 0); | |
assert(e > 0); | |
return e - s; | |
} | |
auto time() const { | |
if (s == 0 || e == 0) { | |
throw std::runtime_error("Timer was not properly started or stopped."); | |
} | |
return e - s; | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Outside diff range and nitpick comments (1)
inc/mkn/gpu/multi_launch.hpp (1)
293-325
: Consider using a named constant for retry status.The status values could be more maintainable with named constants.
+ static constexpr SFS RETRY_STATUS = SFS::FIRST; + static constexpr SFS DONE_STATUS = SFS::WAIT; void run(std::uint32_t const i) override { std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock); if (lock.try_lock()) { fn(i); - strat.status[i] = SFS::WAIT; // done + strat.status[i] = DONE_STATUS; } else { - strat.status[i] = SFS::FIRST; // retry + strat.status[i] = RETRY_STATUS; } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
inc/mkn/gpu/multi_launch.hpp
(15 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-10-26T12:34:33.791Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-10-26T12:39:00.871Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (5)
41-54
: LGTM: Well-designed helper function for pointer dereferencing.
The deref
helper function in the detail namespace provides a clean way to handle both pointer and non-pointer types using template metaprogramming.
Line range hint 248-291
: LGTM: Well-implemented group barrier synchronization.
The implementation correctly uses std::barrier for group synchronization with proper error handling for group size validation.
348-376
: LGTM: Well-implemented device-side group operations.
The implementation correctly handles device launches with proper size checks and condition handling through template parameter.
Line range hint 467-495
: LGTM: Thread-safe work distribution implementation.
The work distribution mechanism is properly protected by mutex with clear state transitions.
76-86
:
Add proper error handling in Timer::time().
The current implementation uses assertions which are disabled in release builds. Consider adding runtime checks for uninitialized state.
- auto time() const {
- assert(s > 0);
- assert(e > 0);
- return e - s;
- }
+ auto time() const {
+ if (s == 0 || e == 0) {
+ throw std::runtime_error("Timer not properly started/stopped");
+ }
+ return e - s;
+ }
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
inc/mkn/gpu/launchers.hpp (1)
66-95
: Consider refactoring to reduce code duplication.There's significant code duplication between
DLauncher
andGDLauncher
. Consider:
- Extracting common code to a shared base class
- Using template specialization instead of separate classes
- Using CRTP (Curiously Recurring Template Pattern) if static polymorphism is needed
This would improve maintainability and make the relationship between the launchers more explicit.
inc/mkn/gpu/multi_launch.hpp (3)
72-74
: Simplify group index calculation.The expression
((i - (i % gs)) / gs)
is mathematically equivalent toi / gs
when using integer division.- return ((i - (i % gs)) / gs); + return i / gs;
298-299
: Consolidate duplicate error messages.Multiple group-related structs use the same error message pattern. Consider moving it to a common location.
namespace detail { constexpr std::string_view GROUP_SIZE_ERROR = "mkn.gpu error: Group size must be a divisor of datas"; }Also applies to: 332-333, 353-354
417-430
: Pass integral types by value.For small integral types like
std::size_t
, passing by value is more efficient than passing by const reference.- This& host_group_mutex(std::size_t const& group_size, Fn&& fn) { + This& host_group_mutex(std::size_t group_size, Fn&& fn) { - This& host_group_idx(std::size_t const& group_size, std::size_t const& group_idx, Fn&& fn) { + This& host_group_idx(std::size_t group_size, std::size_t group_idx, Fn&& fn) {
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (3)
inc/mkn/gpu/launchers.hpp
(1 hunks)inc/mkn/gpu/multi_launch.hpp
(15 hunks)inc/mkn/gpu/rocm.hpp
(4 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- inc/mkn/gpu/rocm.hpp
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-10-26T12:34:33.791Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-10-26T12:39:00.871Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (10)
inc/mkn/gpu/launchers.hpp (3)
91-94
: 🛠️ Refactor suggestion
Document the difference between global_d_kernel and global_gd_kernel.
The _launch
method uses global_d_kernel
while GDLauncher
uses global_gd_kernel
. This appears to be the key distinction between these launchers, but the difference is not documented.
Please add documentation explaining:
- The functional difference between these kernel types
- When to use
DLauncher
vsGDLauncher
- Any performance implications
Let's check the kernel implementations:
#!/bin/bash
# Search for kernel definitions to understand their differences
rg -A 10 "global_d_kernel|global_gd_kernel"
66-68
: Review constructor initialization and add documentation.
The constructor initialization with empty parameters {}
seems suspicious. Additionally, the class lacks documentation explaining:
- The purpose and use cases for this launcher
- The significance of template parameters
_sync
and_coop
- Why
_sync
defaults tofalse
here whileGDLauncher
defaults totrue
Let's check the base class constructor requirements:
70-75
: Verify omission of count parameter in _launch call.
The operator()
implementation differs from GDLauncher
by omitting the count
parameter in the _launch
call. Please verify if this is intentional.
Let's check for other usages of this pattern:
✅ Verification successful
Omission of count parameter is intentional in DLauncher
The difference in implementation between GDLauncher
and DLauncher
is intentional. From the code analysis:
GDLauncher
usesglobal_gd_kernel
which requires a count parameter for grid-based executionDLauncher
usesglobal_d_kernel
which doesn't use count parameter as it's meant for direct kernel execution
The implementation aligns with their respective kernel function signatures and purposes:
global_gd_kernel
:(F f, std::size_t s, Args... args)
global_d_kernel
:(F f, Args... args)
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for similar launch patterns to understand the convention
ast-grep --pattern 'MKN_GPU_NS::launch<$_>(&global_d_kernel<$_>, $_)'
Length of output: 2881
Script:
#!/bin/bash
# Let's examine the launcher implementations more closely
rg -A 10 '_launch\(' inc/mkn/gpu/launchers.hpp
Length of output: 1899
inc/mkn/gpu/multi_launch.hpp (7)
48-54
: LGTM: Well-designed utility function for pointer dereferencing.
The deref
template function provides a safe way to handle both pointer and non-pointer types, using SFINAE to select the appropriate behavior.
76-86
: LGTM: Timer implementation is appropriate for asynchronous operations.
Based on the learnings, an RAII ScopedTimer would not be appropriate here since we're timing asynchronous operations. The current implementation correctly allows manual control of timing points.
64-64
: LGTM: Appropriate const-correctness.
The constructor now correctly takes StreamFunctionMode
as a const reference, maintaining consistency with the class's const-correctness design.
Line range hint 467-495
: LGTM: Thread-safe work distribution implementation.
The get_work
method correctly handles the work_i
counter within a locked section, ensuring thread safety. The implementation properly manages state transitions and work distribution across threads.
521-539
: LGTM: Well-designed timing output functionality.
The timing output implementation provides flexible formatting options and supports both console and file output. The template parameter nl
cleverly controls newline behavior.
301-305
:
Add zero group size check in make_mutices.
The function should validate that group_size is not zero before performing division.
static auto make_mutices(Strat const& strat, std::size_t const& group_size) {
+ if (group_size == 0) {
+ throw std::invalid_argument("Group size cannot be zero");
+ }
if (strat.datas.size() % group_size > 0) {
throw std::runtime_error(std::string{MOD_GROUP_ERROR});
}
std::uint16_t const groups = strat.datas.size() / group_size;
return std::vector<std::mutex>{groups};
}
Likely invalid or redundant comment.
88-97
:
Fix potential initialization issue in StreamGroupFunction.
The group_size
member is initialized to 0 in the declaration but should use the constructor parameter instead.
- std::size_t const group_size = 0;
+ std::size_t const group_size;
Likely invalid or redundant comment.
d88605d
to
fc70fd2
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
🧹 Outside diff range and nitpick comments (8)
inc/mkn/gpu/defines.hpp (1)
18-24
: LGTM! Consider adding documentation.The logic for handling multiple GPU backends is sound. Setting
MKN_GPU_FN_PER_NS
to 1 when both CUDA and ROCm are enabled helps prevent naming conflicts.Consider adding a comment explaining the purpose of
MKN_GPU_FN_PER_NS
:+// Enable per-namespace functions when multiple GPU backends are present to avoid naming conflicts #if MKN_GPU_CUDA == 1 && MKN_GPU_ROCM == 1 && !defined(MKN_GPU_FN_PER_NS) #define MKN_GPU_FN_PER_NS 1 #endifinc/mkn/gpu/launchers.hpp (2)
55-55
: LGTM! Consider adding documentation for sync behavior.The changes to
GDLauncher
look good. The implementation correctly uses the new helper functions and maintains proper parameter passing.Consider documenting the
_sync
template parameter's behavior:/// @brief GPU launcher with grid-dimension based execution /// @tparam _sync Controls whether kernel launches are synchronous (true) or asynchronous (false) template <bool _sync = true> struct GDLauncher : public GLauncher {Also applies to: 60-60, 64-68
89-90
: Remove empty comment or complete the implementation.The empty comment at line 90 suggests incomplete implementation or missing functionality.
Either:
- Remove the empty comment
- Add the missing implementation
- Add a TODO comment explaining what needs to be added
inc/mkn/gpu/cpu.hpp (1)
259-259
: Fix typo in GPU message.The message contains a typo: "Psuedo" should be "Pseudo".
- void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Psuedo GPU in use"; } + void inline prinfo(std::size_t /*dev*/ = 0) { KOUT(NON) << "Pseudo GPU in use"; }inc/mkn/gpu/cuda.hpp (1)
109-115
: Add documentation for the callback-based StreamEvent usageThe StreamEvent struct has been refactored to use callbacks instead of CUDA events. Please add documentation explaining:
- The purpose and lifecycle of StreamEvent
- How to properly use the callback mechanism
- Thread-safety guarantees
inc/mkn/gpu/multi_launch.hpp (3)
299-304
: Consider caching the group count calculationThe groups count calculation could be stored as a member variable to avoid recalculation.
static auto make_mutices(Strat const& strat, std::size_t const& group_size) { if (group_size == 0 || strat.datas.size() % group_size > 0) throw std::runtime_error(std::string{MOD_GROUP_ERROR}); - std::uint16_t const groups = strat.datas.size() / group_size; + auto const groups = static_cast<std::uint16_t>(strat.datas.size() / group_size); return std::vector<std::mutex>{groups}; }
364-373
: Consider extracting common device launch codeThe device launch code is duplicated between the if/else branches. Consider extracting it to reduce duplication.
if constexpr (is) { - if (i % Super::group_size == gid and size) - mkn::gpu::GDLauncher<false>{size}.stream(strat.streams[i], - [=, fn = fn] __device__() mutable { fn(i); }); + if (i % Super::group_size == gid and size) { + launch_device(i, size); + } } else { - if (i % Super::group_size != gid and size) - mkn::gpu::GDLauncher<false>{size}.stream(strat.streams[i], - [=, fn = fn] __device__() mutable { fn(i); }); + if (i % Super::group_size != gid and size) { + launch_device(i, size); + } }Add this private method:
private: void launch_device(std::uint32_t i, std::size_t size) { mkn::gpu::GDLauncher<false>{size}.stream(strat.streams[i], [=, fn = fn] __device__() mutable { fn(i); }); }
522-535
: Consider using std::format for time outputThe time output formatting could benefit from using std::format (C++20) for better type safety and formatting control.
template <bool nl = false, typename SS> void _print_times(SS&& ss, double const div = 1) const { std::size_t fn_idx = 0, data_idx = 0; for (auto const& t : super().times) { - ss << data_idx << " " << fn_idx << " " << (t.time() / div); + ss << std::format("{} {} {:.3f}", data_idx, fn_idx, t.time() / div); if constexpr (nl) ss << std::endl; ++fn_idx; if (fn_idx == fns.size()) { ++data_idx; fn_idx = 0; } } }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (7)
inc/mkn/gpu/cpu.hpp
(5 hunks)inc/mkn/gpu/cuda.hpp
(6 hunks)inc/mkn/gpu/defines.hpp
(1 hunks)inc/mkn/gpu/launchers.hpp
(1 hunks)inc/mkn/gpu/multi_launch.hpp
(15 hunks)inc/mkn/gpu/rocm.hpp
(7 hunks)test/any/coop.cpp
(0 hunks)
💤 Files with no reviewable changes (1)
- test/any/coop.cpp
🚧 Files skipped from review as they are similar to previous changes (1)
- inc/mkn/gpu/rocm.hpp
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (2)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-11-12T05:19:45.927Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-11-12T05:19:45.927Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
🔇 Additional comments (12)
inc/mkn/gpu/defines.hpp (2)
30-31
: LGTM!
The CUDA header inclusion is properly guarded with the correct macro check.
34-37
: Improve the fallback message and verify CPU implementation.
- The message using "error" is misleading since this is a valid fallback scenario, not an error condition.
- We should verify that the CPU implementation exists and is complete.
Consider updating the message to be more informative:
-#pragma message("mkn.gpu error: No accelerator found, defaulting to CPU IMP")
+#pragma message("mkn.gpu info: No GPU accelerator found, using CPU implementation")
Let's verify the CPU implementation is complete:
✅ Verification successful
CPU implementation is complete and properly mirrors GPU implementations
The CPU implementation in inc/mkn/gpu/cpu.hpp
provides equivalent functionality to both CUDA and ROCm implementations, including:
- Core structures (Stream, StreamEvent, Pointer)
- Memory operations (alloc, destroy, copy)
- Data transfer operations (send, take, async variants)
- Launch mechanisms with dim3 support
- Global indexing and kernel execution patterns
The fallback message suggestion remains valid but is not critical since the implementation is solid.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check for CPU implementation completeness
# Check if cpu.hpp exists and contains implementations
echo "Checking for CPU implementation file..."
if [ -f "mkn/gpu/cpu.hpp" ]; then
echo "Found cpu.hpp. Checking contents..."
# Look for key GPU operations that should have CPU equivalents
rg -l "class|struct|function" "mkn/gpu/cpu.hpp"
# Compare with GPU operations to ensure all are implemented
echo "Comparing with GPU operations..."
for file in cuda.hpp rocm.hpp; do
if [ -f "mkn/gpu/$file" ]; then
echo "Checking $file for operations that should have CPU equivalents..."
rg "class|struct|function" "mkn/gpu/$file"
fi
done
else
echo "Warning: cpu.hpp not found!"
fi
Length of output: 182
Script:
#!/bin/bash
# Let's check the actual directory structure and look for CPU implementation files
fd -t f -e hpp -e h
# Then search for any CPU implementation patterns across all files
rg -l "MKN_GPU_CPU|cpu"
# Finally, let's check the content of the directory containing defines.hpp
ls -la "$(dirname inc/mkn/gpu/defines.hpp)"
Length of output: 1401
Script:
#!/bin/bash
# Let's examine the CPU implementation content and compare with CUDA/ROCm implementations
# Check for key GPU operations in cpu.hpp
echo "=== CPU Implementation ==="
rg "class|struct|template|typename" inc/mkn/gpu/cpu.hpp -A 2
echo "=== CUDA Implementation ==="
rg "class|struct|template|typename" inc/mkn/gpu/cuda.hpp -A 2
echo "=== ROCm Implementation ==="
rg "class|struct|template|typename" inc/mkn/gpu/rocm.hpp -A 2
# Also check if cpu.hpp is being included/used in tests
echo "=== CPU Tests ==="
rg -l "include.*cpu\.hpp" test/
Length of output: 9590
inc/mkn/gpu/launchers.hpp (2)
70-73
: Review the constructor implementation.
The constructor ignores the device parameter and initializes the base class with empty dimensions. This seems suspicious:
- Why is the device parameter ignored? If it's not needed, consider removing it.
- Are empty dimensions valid for the base class initialization?
#!/bin/bash
# Search for Launcher base class implementation
ast-grep --pattern 'struct Launcher {
$$$
}'
# Search for other device parameter usages
rg "size_t.*dev.*=.*0"
34-47
: Verify the nullptr tuple pointer usage and add documentation.
The implementation moves helper functions into a detail
namespace, which is good practice. However:
- The
_as_values
function returns a nullptr tuple pointer which seems unusual. Please verify if this is intentional. - Consider adding documentation to explain the purpose and usage of these utility functions.
Add documentation like:
/// Helper functions for converting argument types for GPU kernel launches
namespace detail {
/// Internal implementation for tuple type conversion
/// @return Tuple pointer of converted reference types
template <std::size_t... I, typename... Args>
auto _as_values(/*...*/)
/// Converts arguments to appropriate value types for GPU kernels
/// @return Tuple pointer of converted reference types
template <typename... Args>
auto as_values(Args&... args)
✅ Verification successful
The nullptr tuple pointer usage is correct and intentional
The detail::as_values
function is used correctly within the GPU kernel launch implementation. The nullptr tuple pointer serves as a type carrier for template argument deduction in the _launch
method, where:
- It's used in
GDLauncher
andDLauncher
to prepare argument types for GPU kernel launches - The actual values are passed separately via perfect forwarding (
args...
) - Similar pattern is also used in
asio.hpp
confirming this is a consistent design pattern
The implementation is type-safe and follows the codebase's established patterns for GPU kernel launches. Documentation would be nice-to-have but not critical given the clear usage pattern.
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Search for usages of as_values to understand the nullptr context
rg -A 5 "as_values.*nullptr"
# Search for related kernel implementations
ast-grep --pattern 'template <typename... Args> void global_[gd]*_kernel'
Length of output: 1056
Script:
#!/bin/bash
# Search for usages of as_values function
rg -A 10 "as_values\("
# Search for implementations using the detail namespace
rg -A 10 "namespace detail" inc/mkn/gpu/
# Look for kernel launch implementations that might use these utilities
ast-grep --pattern 'template <typename... Args>
auto launch(Args&&... args)'
Length of output: 6168
inc/mkn/gpu/cpu.hpp (5)
86-88
: LGTM: Appropriate no-op implementations for CPU-side stubs.
The inline empty functions correctly mirror their GPU counterparts while maintaining a consistent interface.
Line range hint 215-233
: LGTM: Simplified launch function template.
The launch function has been streamlined by removing the _coop parameter while maintaining proper thread-local index management and device memory handling.
279-282
: LGTM: Clean implementation of direct kernel execution.
The new global_d_kernel function provides a straightforward CPU-side implementation for direct kernel execution, maintaining consistency with GPU-side interfaces.
284-284
: Verify the launchers.hpp inclusion.
Please ensure that launchers.hpp exists and contains the expected launcher implementations.
✅ Verification successful
Inclusion of launchers.hpp is valid and contains required launcher implementations
The file exists and contains the expected launcher implementations:
GDLauncher
template class for GPU device launchingDLauncher
template class for device launching- Both classes provide stream-based and direct launch operations
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify the existence and contents of launchers.hpp
# Check if the file exists and show its contents
fd -t f "launchers.hpp" -x cat {}
Length of output: 3446
104-113
: Verify the StreamEvent behavior changes across GPU implementations.
The StreamEvent structure has undergone significant changes in its event handling mechanism:
- The operator() now executes a callback function
- The finished() method uses a simple boolean flag
- New member variables have been added
These changes should be consistent with CUDA and ROCm implementations.
✅ Verification successful
StreamEvent implementations are consistent across GPU backends
The changes in the CPU implementation align with both CUDA and ROCm implementations:
- All implementations use the same callback-based approach with
std::function<void()>
- All have the
fin
flag for tracking completion status - Both CUDA and ROCm use similar stream callback mechanisms (
cudaStreamAddCallback
/hipStreamAddCallback
) - CPU implementation provides a simplified version that executes callbacks immediately, which is appropriate for CPU-only scenarios
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Verify consistency of StreamEvent implementations across GPU backends
# Check for StreamEvent implementations in other GPU backends
rg -A 10 "struct StreamEvent" "inc/mkn/gpu/(cuda|rocm).hpp"
Length of output: 129
Script:
#!/bin/bash
# Let's first find all GPU-related header files
fd -e hpp -e h "gpu"
# Then search for StreamEvent in these files
rg -A 10 "struct StreamEvent"
Length of output: 1723
inc/mkn/gpu/cuda.hpp (2)
49-52
: LGTM: Improved macro formatting
The macro reformatting enhances readability while maintaining proper bracing and functionality.
89-94
: LGTM: Well-structured utility functions
The utility functions provide clean wrappers around CUDA API calls with proper error checking.
inc/mkn/gpu/multi_launch.hpp (1)
75-79
:
Add runtime checks for timer state
The assertions in time()
will be stripped in release builds, potentially leading to undefined behavior if the timer wasn't properly started/stopped.
- auto time() const {
- assert(s > 0);
- assert(e > 0);
- return e - s;
- }
+ auto time() const {
+ if (s == 0 || e == 0) {
+ throw std::runtime_error("Timer was not properly started or stopped");
+ }
+ return e - s;
+ }
Likely invalid or redundant comment.
template <typename F, typename... Args> | ||
__global__ static void global_d_kernel(F f, Args... args) { | ||
f(args...); | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add bounds checking or document usage requirements
The global_d_kernel
lacks bounds checking unlike its counterpart global_gd_kernel
. This could lead to out-of-bounds execution if not used carefully.
Consider either:
- Adding bounds checking similar to
global_gd_kernel
- Documenting that the caller is responsible for ensuring proper grid/block dimensions
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is the unsafe, you should know what you're doing function, where as the other with bounds checks is the safe option
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PhilipDeegan, understood, thank you for the clarification.
(>_<)🐰
✏️ Learnings added
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/cuda.hpp:307-311
Timestamp: 2024-12-08T14:07:34.983Z
Learning: In `inc/mkn/gpu/cuda.hpp`, the `global_d_kernel` function is intentionally left without bounds checking, intended for advanced use where the caller is responsible for ensuring proper grid and block dimensions, while `global_gd_kernel` includes bounds checking for safer operation.
Note: Learnings are effective only in the context of similar code segments. To apply general coding guidelines, please configure review instructions. You can manage existing learnings in the app.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (1)
inc/mkn/gpu/multi_launch.hpp (1)
73-74
: Simplify group_idx_modulo implementationThe function can be simplified by passing arguments by value since std::size_t is a trivially copyable type.
Apply this diff:
-std::size_t inline group_idx_modulo(std::size_t const& gs, std::size_t const& i) { return i / gs; } +std::size_t inline group_idx_modulo(std::size_t gs, std::size_t i) { return i / gs; }
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
inc/mkn/gpu/multi_launch.hpp
(14 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (3)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:335-337
Timestamp: 2024-11-12T20:33:23.270Z
Learning: When suggesting code modifications to C++ constructors, ensure that any code statements are placed after the initializer list and within the constructor body to avoid compilation errors.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-11-12T05:19:45.927Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-11-12T05:19:45.927Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
🔇 Additional comments (6)
inc/mkn/gpu/multi_launch.hpp (6)
49-55
: LGTM: Well-designed utility function
The deref
template function elegantly handles both pointer and non-pointer types using SFINAE, making it a useful utility for the codebase.
292-325
: LGTM: Well-implemented mutex-based synchronization
The implementation correctly:
- Validates group size and divisibility
- Uses try_lock for non-blocking behavior
- Properly handles mutex state with appropriate status updates
327-378
: LGTM: Well-designed group index functionality
The implementation provides:
- Proper error handling for zero group size
- Efficient device launch strategy with conditional compilation
- Clear separation between host and device functionality
76-80
:
Add runtime checks for timer state
The assertions in time()
will be disabled in release builds, potentially leading to undefined behavior if the timer wasn't properly started/stopped.
Apply this diff to add runtime checks:
auto time() const {
- assert(s > 0);
- assert(e > 0);
+ if (s == 0 || e == 0) {
+ throw std::runtime_error("Timer not properly started/stopped");
+ }
return e - s;
}
Likely invalid or redundant comment.
87-96
:
Initialize group_size with constructor parameter
The group_size member is initialized to 0 in the declaration but should use the constructor parameter.
Apply this diff:
- std::size_t const group_size = 0;
+ std::size_t const group_size;
Likely invalid or redundant comment.
539-542
:
Add error handling for file operations
The file output operations should include error handling for file open failures.
Apply this diff:
void dump_times(std::string const& filename, double const div = 1) const {
- _print_times<1>(std::ofstream{filename}, div);
+ std::ofstream file{filename};
+ if (!file) {
+ throw std::runtime_error("Failed to open file: " + filename);
+ }
+ _print_times<1>(file, div);
}
Likely invalid or redundant comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Outside diff range and nitpick comments (4)
inc/mkn/gpu/multi_launch.hpp (4)
41-56
: LGTM! Consider adding documentation for the deref utility.The new includes and utility function look good. Consider adding documentation to explain the purpose and usage of the
deref
utility function.Add documentation like this:
/// Dereferences pointer types and returns references for non-pointer types. /// @tparam Type The input type /// @param type The value to potentially dereference /// @return Reference to the value, dereferenced if it's a pointer
73-74
: Simplify group_idx_modulo functionThe function can be simplified as integer division already performs the same operation.
-std::size_t inline group_idx_modulo(std::size_t const& gs, std::size_t const& i) { return i / gs; } +std::size_t inline group_idx_modulo(std::size_t gs, std::size_t i) { return i / gs; }
292-325
: Consider using a named constant for retry statusThe implementation looks good, but using a magic value for retry status could be improved.
+ static constexpr auto RETRY_STATUS = SFS::FIRST; + void run(std::uint32_t const i) override { std::unique_lock<std::mutex> lock(mutices[Super::group_idx(i)], std::defer_lock); if (lock.try_lock()) { fn(i); strat.status[i] = SFS::WAIT; // done } else { - strat.status[i] = SFS::FIRST; // retry + strat.status[i] = RETRY_STATUS; // retry } }
350-378
: Add documentation for the template parameterThe implementation is good, but the purpose of the
is
template parameter isn't immediately clear.Add documentation like this:
/// @brief Device group function that executes on specific group indices /// @tparam Strat The strategy type /// @tparam Fn The function type /// @tparam is If true, executes for matching group indices; if false, executes for non-matching indices
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (1)
inc/mkn/gpu/multi_launch.hpp
(14 hunks)
🧰 Additional context used
📓 Learnings (1)
inc/mkn/gpu/multi_launch.hpp (3)
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:335-337
Timestamp: 2024-11-12T20:33:23.270Z
Learning: When suggesting code modifications to C++ constructors, ensure that any code statements are placed after the initializer list and within the constructor body to avoid compilation errors.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:66-76
Timestamp: 2024-11-12T05:19:45.927Z
Learning: An RAII ScopedTimer is not appropriate when timing asynchronous functions in this codebase.
Learnt from: PhilipDeegan
PR: mkn/mkn.gpu#23
File: inc/mkn/gpu/multi_launch.hpp:506-506
Timestamp: 2024-11-12T05:19:45.927Z
Learning: In the `ThreadedStreamLauncher` class in `multi_launch.hpp`, the member variable `work_i` is safely accessed within the locked section guarded by the `work_` mutex in the `get_work()` method, ensuring thread safety in the multi-threaded environment.
🔇 Additional comments (5)
inc/mkn/gpu/multi_launch.hpp (5)
Line range hint 247-290
: LGTM! Well-structured barrier implementation.
The StreamGroupBarrierFunction implementation is well-designed with:
- Proper inheritance from StreamGroupFunction
- Thread-safe barrier synchronization
- Appropriate error handling for group size validation
327-348
: LGTM! Well-implemented group index functionality.
The implementation has:
- Proper error handling for zero group size
- Correct placement of validation after initializer list
- Clear group index calculation logic
75-85
: 🛠️ Refactor suggestion
Add runtime checks in Timer::time()
The current implementation uses assertions which may be disabled in release builds. Consider adding runtime checks for uninitialized state.
struct Timer {
auto time() const {
- assert(s > 0);
- assert(e > 0);
+ if (s == 0 || e == 0) {
+ throw std::runtime_error("Timer not properly started/stopped");
+ }
return e - s;
}
void start() { s = kul::Now::NANOS(); }
void stop() { e = kul::Now::NANOS(); }
std::size_t s = 0, e = 0;
};
Likely invalid or redundant comment.
87-96
:
Fix group_size initialization
The group_size member is initialized to 0 in the declaration but should use the constructor parameter instead.
- std::size_t const group_size = 0;
+ std::size_t const group_size;
Likely invalid or redundant comment.
539-542
: 🛠️ Refactor suggestion
Add error handling for file operations
The file output operations should include error handling.
void dump_times(std::string const& filename, double const div = 1) const {
- _print_times<1>(std::ofstream{filename}, div);
+ std::ofstream file{filename};
+ if (!file) {
+ throw std::runtime_error("Failed to open file: " + filename);
+ }
+ _print_times<1>(file, div);
}
Likely invalid or redundant comment.
Summary by CodeRabbit
New Features
DLauncher
, for GPU kernel execution.Bug Fixes
Refactor
StreamEvent
structure for better usability.as_values
functionality within a new namespace for better organization.