-
Notifications
You must be signed in to change notification settings - Fork 756
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022
[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022
Conversation
…ocal range The reduction implementation for the data types not having fast atomics may require running an additional kernel as many times as needed to converge all partial sums into the last one scalar sum, which possible only when the work-group size is greater than 1. The additional kernel used work-group size specified in the original user's kernel, which is not necessary, and causes endless loop when local range has only 1 element. The patch checks the max available work-group size on the device, it also checks the local memory available and chooses the work-group size for the additional kernels, which eliminates the endless loop and makes the converge process faster as bigger work-group size is chosen. Signed-off-by: Vyacheslav N Klochkov <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. I left some comments about things we should remember to keep looking into, and a suggestion to add a TODO that you should feel free to ignore.
sycl/include/CL/sycl/handler.hpp
Outdated
constexpr bool HFR = Reduction::has_fast_reduce; | ||
size_t OneElemSize = HFR ? 0 : sizeof(typename Reduction::result_type); | ||
MaxWGSize = intel::detail::reduGetMaxWGSize(QueueCopy, OneElemSize); | ||
assert(MaxWGSize > 1 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is fine for now, because the new behavior is an improvement and it's pretty unlikely that this assertion will ever trigger. But we might want to consider introducing a fallback path that always works with 1 work-item, just by iterating over all the partial results sequentially. This would be terribly slow and we'd probably want to issue a performance warning, but at least there would be no cases that didn't work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Your comment made me think that the main kernel also needs that safety check.
The only bad think in such check is that it does nothing in all cases (except pretty crazy test cases), but does take time to it. Perhaps, it is really minor overhead and that the info requests to device work really fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I follow what you mean about the main kernel. The main kernel is the one that the user provided, plus the reduction prologue/epilogue code, right?
Are you suggesting that if the user provides an ND-range of 1 work-item, you could just perform a serial reduction? Or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For this scenario:
a) User wants nd_range<1>{16,16}
b) Reduction::result_type is let's say 32Kb, and thus MaxWGSize == 2 (because the local memory is 64Kb)
the local accessor at the line https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/intel/reduction.hpp#L652 cannot be used as it would require 16*16Kb
Handling nd_range<1>{16,16} would require a kernel that runs user's lambda, but does not do the reduction/loop https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/intel/reduction.hpp#L678
instead it would simply write Reducer::MValue to global accessor/array with partial sums.
// Let's say MaxWGSize = 128 and NWorkItems is (128+32). | ||
// It seems better to have 5 groups 32 work-items each than 2 groups with | ||
// 128 work-items in the 1st group and 32 work-items in the 2nd group. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We should set up some benchmarks for this sort of thing. I agree that load balance is important, but we might need to pay attention to the total number of kernels we're launching as well.
device Dev = Queue->get_device(); | ||
size_t WGSize = Dev.get_info<info::device::max_work_group_size>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think it's worth adding a TODO here noting that the code assumes that the reduction kernel can be launched with the maximum work-group size? I think eventually we want to pay attention to the maximum work-group size for the kernel itself (which may be different).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added TODO comment.
If some kernel uses local 1-dim accessor, do you think a query to kernel-on-device would return a max-work-group-size that taking the local memory limits into account? I doubt it can do that because for such query there is no reliable information proving that local accessor(s) would have same amount of elements as the local_range.size().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, I don't think we can count on the query to account for local memory. We'd need two queries as you have here, but the max work-group size one would be kernel-specific instead of device-specific.
It might be hard to do this right now, because I think the program interface only allows you to access this information today by re-compiling the kernel. Paying the JIT cost for the kernel every time it's launched just to check what sizes it supports doesn't seem like a good idea.
Signed-off-by: Vyacheslav N Klochkov <[email protected]>
…ional/aux kernel Signed-off-by: Vyacheslav N Klochkov <[email protected]>
Signed-off-by: Vyacheslav N Klochkov <[email protected]>
Signed-off-by: Vyacheslav N Klochkov <[email protected]>
sycl/include/CL/sycl/handler.hpp
Outdated
assert(MaxWGSize > 1 && | ||
"Work group size must be greater than 1 to avoid endless loop."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the old code work in this case?
If not, then, I believe, there should be test for this case with an XFAIL now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The old code simply used the WGSize specified by user for the main kernel. If WGSize was set by user to 1 work-item, then tests hanged. I already added new test cases to LIT tests.
With the patch MaxWGSize=1 is still possible situation, for example, if device has 64k of local memory and user passes custom type to reduction where 1 element of that custom type is more than 32k (i.e. only 1 element fits into WG local memory). I believe that is a very untypical/corner case for which we normally don't create LIT tests, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM overall
…ving an assert Signed-off-by: Vyacheslav N Klochkov <[email protected]>
@s-kanaev , @sergey-semenov , @Pennycook - please approve if you are Ok with the current patch. @s-kanaev: I wrote a comment regarding your request to create a LIT test checking one specific situation. Briefly it is: such LIT test is for very special corner case. If you still think it is needed, I'll add it. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the change I highlighted was deliberate, LGTM.
…ocal range
The reduction implementation for the data types not having fast atomics
may require running an additional kernel as many times as needed to
converge all partial sums into the last one scalar sum, which possible
only when the work-group size is greater than 1.
The additional kernel used work-group size specified in the original
user's kernel, which is not necessary, and causes endless loop when
local range has only 1 element.
The patch checks the max available work-group size on the device,
it also checks the local memory available and chooses the work-group
size for the additional kernels, which eliminates the endless loop
and makes the converge process faster as bigger work-group size is chosen.
Signed-off-by: Vyacheslav N Klochkov [email protected]