[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022

v-klochkov · 2020-07-01T06:39:44Z

…ocal range

The reduction implementation for the data types not having fast atomics
may require running an additional kernel as many times as needed to
converge all partial sums into the last one scalar sum, which possible
only when the work-group size is greater than 1.
The additional kernel used work-group size specified in the original
user's kernel, which is not necessary, and causes endless loop when
local range has only 1 element.

The patch checks the max available work-group size on the device,
it also checks the local memory available and chooses the work-group
size for the additional kernels, which eliminates the endless loop
and makes the converge process faster as bigger work-group size is chosen.

Signed-off-by: Vyacheslav N Klochkov [email protected]

…ocal range The reduction implementation for the data types not having fast atomics may require running an additional kernel as many times as needed to converge all partial sums into the last one scalar sum, which possible only when the work-group size is greater than 1. The additional kernel used work-group size specified in the original user's kernel, which is not necessary, and causes endless loop when local range has only 1 element. The patch checks the max available work-group size on the device, it also checks the local memory available and chooses the work-group size for the additional kernels, which eliminates the endless loop and makes the converge process faster as bigger work-group size is chosen. Signed-off-by: Vyacheslav N Klochkov <[email protected]>

Pennycook

LGTM. I left some comments about things we should remember to keep looking into, and a suggestion to add a TODO that you should feel free to ignore.

Pennycook · 2020-07-01T14:13:24Z

sycl/include/CL/sycl/handler.hpp

+      constexpr bool HFR = Reduction::has_fast_reduce;
+      size_t OneElemSize = HFR ? 0 : sizeof(typename Reduction::result_type);
+      MaxWGSize = intel::detail::reduGetMaxWGSize(QueueCopy, OneElemSize);
+      assert(MaxWGSize > 1 &&


I think this is fine for now, because the new behavior is an improvement and it's pretty unlikely that this assertion will ever trigger. But we might want to consider introducing a fallback path that always works with 1 work-item, just by iterating over all the partial results sequentially. This would be terribly slow and we'd probably want to issue a performance warning, but at least there would be no cases that didn't work.

Thank you. Your comment made me think that the main kernel also needs that safety check.
The only bad think in such check is that it does nothing in all cases (except pretty crazy test cases), but does take time to it. Perhaps, it is really minor overhead and that the info requests to device work really fast.

I'm not sure I follow what you mean about the main kernel. The main kernel is the one that the user provided, plus the reduction prologue/epilogue code, right?

Are you suggesting that if the user provides an ND-range of 1 work-item, you could just perform a serial reduction? Or something else?

For this scenario:
a) User wants nd_range<1>{16,16}
b) Reduction::result_type is let's say 32Kb, and thus MaxWGSize == 2 (because the local memory is 64Kb)
the local accessor at the line https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/intel/reduction.hpp#L652 cannot be used as it would require 16*16Kb

Handling nd_range<1>{16,16} would require a kernel that runs user's lambda, but does not do the reduction/loop https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/intel/reduction.hpp#L678
instead it would simply write Reducer::MValue to global accessor/array with partial sums.

Pennycook · 2020-07-01T14:21:37Z

sycl/source/detail/reduction.cpp

+      // Let's say MaxWGSize = 128 and NWorkItems is (128+32).
+      // It seems better to have 5 groups 32 work-items each than 2 groups with
+      // 128 work-items in the 1st group and 32 work-items in the 2nd group.


We should set up some benchmarks for this sort of thing. I agree that load balance is important, but we might need to pay attention to the total number of kernels we're launching as well.

Pennycook · 2020-07-01T14:24:29Z

sycl/source/detail/reduction.cpp

+  device Dev = Queue->get_device();
+  size_t WGSize = Dev.get_info<info::device::max_work_group_size>();


Do you think it's worth adding a TODO here noting that the code assumes that the reduction kernel can be launched with the maximum work-group size? I think eventually we want to pay attention to the maximum work-group size for the kernel itself (which may be different).

I added TODO comment.
If some kernel uses local 1-dim accessor, do you think a query to kernel-on-device would return a max-work-group-size that taking the local memory limits into account? I doubt it can do that because for such query there is no reliable information proving that local accessor(s) would have same amount of elements as the local_range.size().

Right, I don't think we can count on the query to account for local memory. We'd need two queries as you have here, but the max work-group size one would be kernel-specific instead of device-specific.

It might be hard to do this right now, because I think the program interface only allows you to access this information today by re-compiling the kernel. Paying the JIT cost for the kernel every time it's launched just to check what sizes it supports doesn't seem like a good idea.

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

…ional/aux kernel Signed-off-by: Vyacheslav N Klochkov <[email protected]>

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

s-kanaev · 2020-07-03T14:58:01Z

sycl/include/CL/sycl/handler.hpp

+    assert(MaxWGSize > 1 &&
+           "Work group size must be greater than 1 to avoid endless loop.");


Does the old code work in this case?
If not, then, I believe, there should be test for this case with an XFAIL now.

The old code simply used the WGSize specified by user for the main kernel. If WGSize was set by user to 1 work-item, then tests hanged. I already added new test cases to LIT tests.

With the patch MaxWGSize=1 is still possible situation, for example, if device has 64k of local memory and user passes custom type to reduction where 1 element of that custom type is more than 32k (i.e. only 1 element fits into WG local memory). I believe that is a very untypical/corner case for which we normally don't create LIT tests, right?

sergey-semenov

LGTM overall

sycl/include/CL/sycl/handler.hpp

…ving an assert Signed-off-by: Vyacheslav N Klochkov <[email protected]>

v-klochkov · 2020-07-20T23:20:00Z

@s-kanaev , @sergey-semenov , @Pennycook - please approve if you are Ok with the current patch.

@s-kanaev: I wrote a comment regarding your request to create a LIT test checking one specific situation. Briefly it is: such LIT test is for very special corner case. If you still think it is needed, I'll add it.

Pennycook

If the change I highlighted was deliberate, LGTM.

sycl/test/reduction/reduction_nd_s1_rw.cpp

v-klochkov requested a review from a team as a code owner July 1, 2020 06:39

v-klochkov requested review from sergey-semenov and Pennycook July 1, 2020 06:39

Pennycook previously approved these changes Jul 1, 2020

View reviewed changes

v-klochkov added 2 commits July 1, 2020 10:47

Add missing global symbols to sycl_symbols_linux.dump lit/check

82a8b8b

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

Check the max work group size for the main kernel, not only for addit…

f9137f4

…ional/aux kernel Signed-off-by: Vyacheslav N Klochkov <[email protected]>

v-klochkov dismissed Pennycook’s stale review via f9137f4 July 1, 2020 18:18

v-klochkov added 2 commits July 1, 2020 11:19

Fix syntax error

c5e1b62

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

[SYCL][LIT] Additional fix for global linux symbols check/test

3203415

Signed-off-by: Vyacheslav N Klochkov <[email protected]>

s-kanaev reviewed Jul 3, 2020

View reviewed changes

v-klochkov requested review from rdeodhar and s-kanaev July 7, 2020 16:41

sergey-semenov reviewed Jul 14, 2020

View reviewed changes

sycl/include/CL/sycl/handler.hpp Outdated Show resolved Hide resolved

Address reviewer's comment regarding throwing exception instead of ha…

63be65a

…ving an assert Signed-off-by: Vyacheslav N Klochkov <[email protected]>

v-klochkov requested a review from sergey-semenov July 18, 2020 00:45

Pennycook reviewed Jul 21, 2020

View reviewed changes

sycl/test/reduction/reduction_nd_s1_rw.cpp Show resolved Hide resolved

sergey-semenov approved these changes Jul 21, 2020

View reviewed changes

bader merged commit e6b6ae7 into intel:sycl Jul 21, 2020

v-klochkov deleted the public_reduction_fix_endless_loop branch July 22, 2020 04:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022

[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022

v-klochkov commented Jul 1, 2020

Pennycook left a comment

Pennycook Jul 1, 2020

v-klochkov Jul 1, 2020

Pennycook Jul 1, 2020

v-klochkov Jul 1, 2020

Pennycook Jul 1, 2020

Pennycook Jul 1, 2020

v-klochkov Jul 1, 2020

Pennycook Jul 1, 2020

s-kanaev Jul 3, 2020

v-klochkov Jul 7, 2020 •

edited

Loading

sergey-semenov left a comment

v-klochkov commented Jul 20, 2020

Pennycook left a comment

		device Dev = Queue->get_device();
		size_t WGSize = Dev.get_info<info::device::max_work_group_size>();

		assert(MaxWGSize > 1 &&
		"Work group size must be greater than 1 to avoid endless loop.");

[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022

[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022

Conversation

v-klochkov commented Jul 1, 2020

Pennycook left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

v-klochkov Jul 7, 2020 • edited Loading

Choose a reason for hiding this comment

sergey-semenov left a comment

Choose a reason for hiding this comment

v-klochkov commented Jul 20, 2020

Pennycook left a comment

Choose a reason for hiding this comment

v-klochkov Jul 7, 2020 •

edited

Loading