Skip to content
Merged
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 5 additions & 0 deletions sycl/include/sycl/ext/oneapi/experimental/ballot_group.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -142,6 +142,7 @@ inline std::enable_if_t<sycl::is_group_v<std::decay_t<Group>> &&
get_ballot_group(Group group, bool predicate) {
(void)group;
#ifdef __SYCL_DEVICE_ONLY__
#if defined(__SPIR__) || defined(__NVPTX__)
// ballot_group partitions into two groups using the predicate
// Membership mask for one group is negation of the other
sub_group_mask mask = sycl::ext::oneapi::group_ballot(group, predicate);
Expand All @@ -150,6 +151,10 @@ get_ballot_group(Group group, bool predicate) {
} else {
return ballot_group<sycl::sub_group>(~mask, predicate);
}
#else
static_assert(false,
"ballot_group is not currently supported on this platform.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is a static assertion our best option here? I worry it may be a little to harsh on users that don't intend to run such kernels on other platforms.

@Pennycook - What are your thoughts here? My thoughts are that ideally if we want to restrict the platforms of these, we would want to have an aspect for it and mark related features as requiring it so we can report failure at runtime instead if the user tried to launch a kernel with these.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @steffenlarsen. Furthermore, won't these asserts trigger any time the functions are compiled, regardless of whether they're actually used? I don't think we want to be in a position where we can't compile for a specific backend just because there's a usage of one of these non-uniform groups somewhere in a header.

I think something like an aspect, optional feature, or other device queries makes sense here. We didn't include anything like this in the initial proposal because it wasn't clear what should and shouldn't be allowed... For example, should a device be allowed to support only certain fixed_size_group sizes, or certain scopes? I think we need to get more implementation experience -- across more than just SPIR-V and NVPTX -- before we can properly answer those questions.

I'm not opposed to adding some broad aspects/queries for whether the groups are supported at all.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can add the aspect then tell programmers it is their responsibility to use the aspect as a check in code that is using non uniform groups?

RE implementations on other backends:

For HIP AMD I'm not sure that that any of these groups can be fully implemented on existing hardware. The issue is that amd cards don't have independent forward progress (as it is defined here https://developer.nvidia.com/blog/inside-volta/), and they don't support __syncwarp(mask);

HIP does support some subsets of these group features, (ballot,any, or taking a mask), such that I think it would be possible to implement basically all of ballot_group and fixed_size_group, minus barrier. Although due to the lack of independent forward progress guarantees I guess could mean that even then certain code might not be portable in the sense it doesn't hang etc.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can add the aspect then tell programmers it is their responsibility to use the aspect as a check in code that is using non uniform groups?

Exactly.

For HIP AMD I'm not sure that that any of these groups can be fully implemented on existing hardware. The issue is that amd cards don't have independent forward progress (as it is defined here https://developer.nvidia.com/blog/inside-volta/), and they don't support __syncwarp(mask);

Intel GPUs don't support "independent forward progress" either, and the extension doesn't require it. Working around the lack of __syncwarp(mask) probably requires some assumptions and/or compiler smarts... since we only need to be able to synchronize the work-items in the active branch, it might be sufficient to run the equivalent of an unmasked __syncwarp, or to do nothing at all. For SPIR-V targets we currently just issue a memory barrier, which seems to work (see here).

I'm still not comfortable with this and I think we need more test-cases to prove definitively whether things are working as expected. But that's why the extension is experimental. 😄

Although due to the lack of independent forward progress guarantees I guess could mean that even then certain code might not be portable in the sense it doesn't hang etc.

I agree, but this is always going to be true. I don't think the presence of non-uniform groups makes this any worse than it is already. Any code that makes assumptions about forward progress will be non-portable, and that's why we're defining the sycl_ext_oneapi_forward_progress extension.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, we can add the aspect then tell programmers it is their responsibility to use the aspect as a check in code that is using non uniform groups?

Exactly.

For HIP AMD I'm not sure that that any of these groups can be fully implemented on existing hardware. The issue is that amd cards don't have independent forward progress (as it is defined here https://developer.nvidia.com/blog/inside-volta/), and they don't support __syncwarp(mask);

Intel GPUs don't support "independent forward progress" either, and the extension doesn't require it. Working around the lack of __syncwarp(mask) probably requires some assumptions and/or compiler smarts... since we only need to be able to synchronize the work-items in the active branch, it might be sufficient to run the equivalent of an unmasked __syncwarp, or to do nothing at all. For SPIR-V targets we currently just issue a memory barrier, which seems to work (see here).

I'm still not comfortable with this and I think we need more test-cases to prove definitively whether things are working as expected. But that's why the extension is experimental. 😄

I don't know for sure how amd devices would behave - we'd just have to experiment and see what happens. I guess that we don't want to support e.g. ballot_group partially in a backend if it can be helped. There is also the point that amd could start supporting these features in the not too distant future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to defer to @gmlueck on this one, because I'm not sure how optional features and aspects actually work in the compiler.

Are you asking me whether all the APIs in "sycl_ext_oneapi_non_uniform_groups" should be "optional kernel features" that are tied to some aspect? Obviously, the extension is easier to use if we can guarantee that it is available for all devices. Do we think that it can be implemented for all devices, and we just haven't completed the implementation on certain backends? Or, do we think these APIs can never be implemented on certain devices?

If it's the former, it would be better not to burden application developers by requiring them to check an aspect. Triggering a static-assert might be reasonable as a short-term solution if we can add the missing support soon.

If it's the later, then we should change the specification to add the aspect and document the APIs as optional device features.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure, and I was hoping we could get more implementation experience before having to answer this question. But my guess is that there will be certain devices that cannot implement all of the functionality in the specification as it's written today (or that there might be certain compilers and/or library-only implementations that choose not to do so).

How hard is it for us to add new optional device features? I ask because it's still not clear to me if we would want one aspect (i.e. for non-uniform groups) or multiple aspects (e.g. for each type of non-uniform group, or each combination of non-uniform group and scope). I suppose what I'm asking is: If we started with one aspect today, how angry would people be if we need to change that later?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this extension is still experimental, we can make API breaking changes in the future. For example, the extension could require all devices to support the APIs now, and we can still add an aspect later if necessary. If we decide that DPC++ can support these APIs on all devices, the extension can expose the APIs as "required device features". When it comes time to adopt them into SYCL-Next, we can still adopt them as "optional features" if other vendors think they will be hard to support.

That said, it is not too hard to add support for an optional feature / aspect. There are two main parts to the implementation:

  • The header file needs to add [[__sycl_detail__::__uses_aspects__(aspect::foo)]] to either the declaration of a function or to the definition of a type as described in the design.

  • We need some backend-specific code that can query a device and decide whether the device supports the aspect.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JackAKirk Given Greg's response above, would something like an aspect::non_uniform_groups work for you?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it sounds sensible. I can try adding [[__sycl_detail__::__uses_aspects__(aspect::foo)]] to the header and add the aspect and then check the behavior.

#endif
#else
(void)predicate;
throw runtime_error("Non-uniform groups are not supported on host device.",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -158,8 +158,11 @@ get_fixed_size_group(Group group) {
return fixed_size_group<PartitionSize, sycl::sub_group>(
sycl::detail::Builder::createSubGroupMask<ext::oneapi::sub_group_mask>(
bits, loc_size));
#else
#elif defined(__SPIR__)
return fixed_size_group<PartitionSize, sycl::sub_group>();
#else
static_assert(
false, "fixed_size_group is not currently supported on this platform.");
#endif
#else
throw runtime_error("Non-uniform groups are not supported on host device.",
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -147,6 +147,10 @@ inline opportunistic_group get_opportunistic_group() {
sycl::detail::Builder::createSubGroupMask<ext::oneapi::sub_group_mask>(
active_mask, 32);
return opportunistic_group(mask);
#else
static_assert(
false,
"opportunistic_group is not currently supported on this platform.");
#endif
#else
throw runtime_error("Non-uniform groups are not supported on host device.",
Expand Down
5 changes: 5 additions & 0 deletions sycl/include/sycl/ext/oneapi/experimental/tangle_group.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -148,6 +148,11 @@ get_tangle_group(Group group) {
return tangle_group<sycl::sub_group>(mask);
#elif defined(__NVPTX__)
// TODO: Construct from compiler-generated mask
static_assert(false,
"tangle_group is not currently supported on this platform.");
#else
static_assert(false,
"tangle_group is not currently supported on this platform.");
#endif
#else
throw runtime_error("Non-uniform groups are not supported on host device.",
Expand Down