Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Fix endless-loop in reduction with nd_range having 1 element l… #2022

Merged
merged 6 commits into from
Jul 21, 2020

Conversation

v-klochkov
Copy link
Contributor

…ocal range

The reduction implementation for the data types not having fast atomics
may require running an additional kernel as many times as needed to
converge all partial sums into the last one scalar sum, which possible
only when the work-group size is greater than 1.
The additional kernel used work-group size specified in the original
user's kernel, which is not necessary, and causes endless loop when
local range has only 1 element.

The patch checks the max available work-group size on the device,
it also checks the local memory available and chooses the work-group
size for the additional kernels, which eliminates the endless loop
and makes the converge process faster as bigger work-group size is chosen.

Signed-off-by: Vyacheslav N Klochkov [email protected]

…ocal range

The reduction implementation for the data types not having fast atomics
may require running an additional kernel as many times as needed to
converge all partial sums into the last one scalar sum, which possible
only when the work-group size is greater than 1.
The additional kernel used work-group size specified in the original
user's kernel, which is not necessary, and causes endless loop when
local range has only 1 element.

The patch checks the max available work-group size on the device,
it also checks the local memory available and chooses the work-group
size for the additional kernels, which eliminates the endless loop
and makes the converge process faster as bigger work-group size is chosen.

Signed-off-by: Vyacheslav N Klochkov <[email protected]>
@v-klochkov v-klochkov requested a review from a team as a code owner July 1, 2020 06:39
Pennycook
Pennycook previously approved these changes Jul 1, 2020
Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I left some comments about things we should remember to keep looking into, and a suggestion to add a TODO that you should feel free to ignore.

constexpr bool HFR = Reduction::has_fast_reduce;
size_t OneElemSize = HFR ? 0 : sizeof(typename Reduction::result_type);
MaxWGSize = intel::detail::reduGetMaxWGSize(QueueCopy, OneElemSize);
assert(MaxWGSize > 1 &&
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is fine for now, because the new behavior is an improvement and it's pretty unlikely that this assertion will ever trigger. But we might want to consider introducing a fallback path that always works with 1 work-item, just by iterating over all the partial results sequentially. This would be terribly slow and we'd probably want to issue a performance warning, but at least there would be no cases that didn't work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. Your comment made me think that the main kernel also needs that safety check.
The only bad think in such check is that it does nothing in all cases (except pretty crazy test cases), but does take time to it. Perhaps, it is really minor overhead and that the info requests to device work really fast.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I follow what you mean about the main kernel. The main kernel is the one that the user provided, plus the reduction prologue/epilogue code, right?

Are you suggesting that if the user provides an ND-range of 1 work-item, you could just perform a serial reduction? Or something else?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this scenario:
a) User wants nd_range<1>{16,16}
b) Reduction::result_type is let's say 32Kb, and thus MaxWGSize == 2 (because the local memory is 64Kb)
the local accessor at the line https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/intel/reduction.hpp#L652 cannot be used as it would require 16*16Kb

Handling nd_range<1>{16,16} would require a kernel that runs user's lambda, but does not do the reduction/loop https://github.com/intel/llvm/blob/sycl/sycl/include/CL/sycl/intel/reduction.hpp#L678
instead it would simply write Reducer::MValue to global accessor/array with partial sums.

Comment on lines +29 to +31
// Let's say MaxWGSize = 128 and NWorkItems is (128+32).
// It seems better to have 5 groups 32 work-items each than 2 groups with
// 128 work-items in the 1st group and 32 work-items in the 2nd group.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should set up some benchmarks for this sort of thing. I agree that load balance is important, but we might need to pay attention to the total number of kernels we're launching as well.

Comment on lines +48 to +49
device Dev = Queue->get_device();
size_t WGSize = Dev.get_info<info::device::max_work_group_size>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think it's worth adding a TODO here noting that the code assumes that the reduction kernel can be launched with the maximum work-group size? I think eventually we want to pay attention to the maximum work-group size for the kernel itself (which may be different).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added TODO comment.
If some kernel uses local 1-dim accessor, do you think a query to kernel-on-device would return a max-work-group-size that taking the local memory limits into account? I doubt it can do that because for such query there is no reliable information proving that local accessor(s) would have same amount of elements as the local_range.size().

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, I don't think we can count on the query to account for local memory. We'd need two queries as you have here, but the max work-group size one would be kernel-specific instead of device-specific.

It might be hard to do this right now, because I think the program interface only allows you to access this information today by re-compiling the kernel. Paying the JIT cost for the kernel every time it's launched just to check what sizes it supports doesn't seem like a good idea.

Signed-off-by: Vyacheslav N Klochkov <[email protected]>
Comment on lines 1076 to 1077
assert(MaxWGSize > 1 &&
"Work group size must be greater than 1 to avoid endless loop.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the old code work in this case?
If not, then, I believe, there should be test for this case with an XFAIL now.

Copy link
Contributor Author

@v-klochkov v-klochkov Jul 7, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The old code simply used the WGSize specified by user for the main kernel. If WGSize was set by user to 1 work-item, then tests hanged. I already added new test cases to LIT tests.

With the patch MaxWGSize=1 is still possible situation, for example, if device has 64k of local memory and user passes custom type to reduction where 1 element of that custom type is more than 32k (i.e. only 1 element fits into WG local memory). I believe that is a very untypical/corner case for which we normally don't create LIT tests, right?

@v-klochkov v-klochkov requested review from rdeodhar and s-kanaev July 7, 2020 16:41
Copy link
Contributor

@sergey-semenov sergey-semenov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall

@v-klochkov
Copy link
Contributor Author

@s-kanaev , @sergey-semenov , @Pennycook - please approve if you are Ok with the current patch.

@s-kanaev: I wrote a comment regarding your request to create a LIT test checking one specific situation. Briefly it is: such LIT test is for very special corner case. If you still think it is needed, I'll add it.

Copy link
Contributor

@Pennycook Pennycook left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the change I highlighted was deliberate, LGTM.

@bader bader merged commit e6b6ae7 into intel:sycl Jul 21, 2020
@v-klochkov v-klochkov deleted the public_reduction_fix_endless_loop branch July 22, 2020 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants