Skip to content

Conversation

@jle-quel
Copy link
Contributor

@jle-quel jle-quel commented Jun 1, 2023

This PR adds the following list of complex and marray's group algorithms and their tests:

  • reduce_over_group
  • joint_reduce
  • inclusive_scan_over_group
  • joint_inclusive_scan
  • exclusive_scan_over_group
  • joint_exclusive_scan

The test helpers have been modified to support the new tests and reduce the number of overloads.

New traits have been added and grouped in a file containing all traits's tests.
A new conversion test has also been added to support the modification made to the test helpers.

This PR has been tested on different backends and devices successfully.

[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) HD Graphics 630 [0x591b] OpenCL 3.0 NEO  [22.35.24055]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz OpenCL 3.0 (Build 0) [2022.14.8.0.04_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) HD Graphics 630 [0x591b] 1.3 [1.3.24055]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 8.8 [CUDA 12.0]

@jle-quel
Copy link
Contributor Author

jle-quel commented Jun 1, 2023

The test-cpu is failing because it's using icpx, and this version should be in sync with intel/llvm.

What used to be detail::dim_loop became detail::loop (see intel/llvm#9108)

@bd4
Copy link
Contributor

bd4 commented Jun 1, 2023

The test-cpu is failing because it's using icpx, and this version should be in sync with intel/llvm.

What used to be detail::dim_loop became detail::loop (see intel/llvm#9108)

Is there a standard API that could be used here, or is this only for ONEAPI? If so, can it be cmake feature gated so this can at least build with other compilers, and can it be oneapi version ifdef'd to work on older oneapi versions, in particular the latest public 2023.01 oneapi release? I do not think it is worth the effort to support old oneapi releases, but I think at least the latest public one is important to support.

Did clang-format-17 produce different formatting? Sometimes adding things to .clang-format can make them behave the same, but can be tricky to figure out.

@jle-quel
Copy link
Contributor Author

jle-quel commented Jun 2, 2023

The test-cpu is failing because it's using icpx, and this version should be in sync with intel/llvm.
What used to be detail::dim_loop became detail::loop (see intel/llvm#9108)

Is there a standard API that could be used here, or is this only for ONEAPI? If so, can it be cmake feature gated so this can at least build with other compilers, and can it be oneapi version ifdef'd to work on older oneapi versions, in particular the latest public 2023.01 oneapi release? I do not think it is worth the effort to support old oneapi releases, but I think at least the latest public one is important to support.

The standard way could be to use a for...loop, but regression has been seen on the cuda backend (see intel/llvm#7948)

As you mention we could ifdef ONEAPI for detail::loop and use for...loop if the project is built with another compiler

@jle-quel
Copy link
Contributor Author

jle-quel commented Jun 2, 2023

Did clang-format-17 produce different formatting? Sometimes adding things to .clang-format can make them behave the same, but can be tricky to figure out.

Yes, clang-format-17 produced different formatting than the 14 version.
I'll use the 14 version 👍

@TApplencourt
Copy link
Collaborator

TApplencourt commented Jun 2, 2023

As we want to be this code to be portable to other implementations (hip-sycl) I indeed vote for the ifdef in worse case.
But, I was naively thinking that replacing (detail::loop)

 for (int s = 0; s < N; ++s) {
    result[s] = reduce_over_group(g, x[s], binary_op);
  }

Where we replaced x.size() with N will be good enough

@TApplencourt
Copy link
Collaborator

TApplencourt commented Jun 7, 2023

LGTM! I trust the tests which pass :)
Does anyone want to review it? If not will merge Friday night

@TApplencourt TApplencourt merged commit 80e1c54 into argonne-lcf:main Jun 12, 2023
@TApplencourt
Copy link
Collaborator

Thanks again :)

@jle-quel jle-quel deleted the jle-quel/introduce-cplx-group-algorithms branch June 13, 2023 08:07
#else
throw sycl::exception(sycl::make_error_code(sycl::errc::runtime),
"Group algorithms are not supported on host.");
#endif
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is horrendously unportable.

  • The spec defines __SYCL_DEVICE_ONLY__ only for SMCP compilers. For single-pass compilers, like the hipSYCL generic SSCP compiler, this is not defined in accordance with the SYCL spec
  • Just throwing whenever __SYCL_DEVICE_ONLY__ is not defined will completely break on any library-only host backend, where the kernel is compiled as part of the host pass for CPU.

You should seriously start looking into adding CI and validation for other compilers. That would make such issues blatantly obvious.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to add openSYCL CI - do you have a public docker image that we could use with latest version pre-installed?

Re avoiding the __SYCL_DEVICE_ONLY__ else raise pattern, do you have any suggestions on an alternative? This has come up in gtensor as well, where we have a function that could be called from host or device code, and want to have some error reporting in both cases. Is there a standard SYCL safe way to always report an error for both device and host calls?

Copy link
Contributor

@illuhad illuhad Jun 21, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have a docker image at the moment (it's on the lower priority to do list to speed up our own CI), but what you need is usually fairly simple:

  1. Install some recent LLVM release from apt.llvm.org packages, boost packages, cmake, git
  2. Clone and install.

See our own CI (which is a bit more complicated because it tests all backends in different versions etc): https://github.com/OpenSYCL/OpenSYCL/blob/develop/.github/workflows/linux.yml

Re avoiding the SYCL_DEVICE_ONLY else raise pattern, do you have any suggestions on an alternative? This has come up in gtensor as well, where we have a function that could be called from host or device code, and want to have some error reporting in both cases. Is there a standard SYCL safe way to always report an error for both device and host calls?

I might be able to answer more specifically with more information about your use case, but in general this seems like a difficult/impossible thing to do:

  • There is currently no portable mechanism in SYCL to specialize host/device code.
  • As soon as a function is in the kernel call graph, it counts as device code and device code restrictions in principle apply (no exceptions are allowed or else UB). This also affects the host side, as SYCL implementations might use that to target CPU, and still rely on the absence of exceptions for optimizations.
  • Error handling from within kernels is always very difficult, and there is no mechanism for this in the standard, unless you are willing to pass sycl::stream everywhere throughout your code (you might not want that, as its bare existence might add overhead depending on the implementation). Some implementations may support printf/assert but it's not guaranteed.

The issue for pre-standardized APIs like the SYCL group algorithms is more difficult because we cannot add to the API. Outside of this, one suggestion could be to modify the API and solve this on the C++ level and not on the SYCL level using different template instantiations or overloads to distinguish host and device code paths.
In a single-pass compilation scenario, host and device are by nature extremely closely related and potentially generated from the exact same IR (even though we have mechanisms to distinguish them, but they are very different from the macro), thus they may not even be a separate thing a priori.

EDIT: Philosophically, you can consider single-pass compilers to just be better at enforcing C++ ODR rules. Having different implementations for host/device in principle is an ODR violation that only happens to work because device code is not put into the same binary object.

Can you describe your use case a bit more - how your error handling is intended to work and why you feel the need for raising errors inside the kernel call graph?

EDIT2: Something that would work at least across DPC++ and all of the compilation models in hipSYCL for code specialization would be the following:

#ifdef __HIPSYCL__
#define if_target_host(...) __hipsycl_if_target_host(__VA_ARGS__)
#define if_target_device(...) __hipsycl_if_target_device(__VA_ARGS__)
#else
#ifdef __SYCL_DEVICE_ONLY__
#define if_target_device(...) __VA_ARGS__
#define if_target_host(...)
#else
#define if_target_device(...)
#define if_target_host(...) __VA_ARGS__
#endif
#endif


// Use like so, assuming kernel_func is a kernel:
void kernel_func(auto id) {
  if_target_device(
    // Put device code here
  );
 if_target_host(
   // Put host code here
 );
}

EDIT3: In the code here, this pattern and the error handling is not even necessary, because the group algorithms already cannot be called outside of kernel code since they take a group argument - and groups are not user constructible. So the code achieves nothing but restrict portability.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants