Introduce complex's group algorithms #41

jle-quel · 2023-06-01T07:36:40Z

This PR adds the following list of complex and marray's group algorithms and their tests:

reduce_over_group
joint_reduce
inclusive_scan_over_group
joint_inclusive_scan
exclusive_scan_over_group
joint_exclusive_scan

The test helpers have been modified to support the new tests and reduce the number of overloads.

New traits have been added and grouped in a file containing all traits's tests.
A new conversion test has also been added to support the modification made to the test helpers.

This PR has been tested on different backends and devices successfully.

[opencl:gpu:0] Intel(R) OpenCL HD Graphics, Intel(R) HD Graphics 630 [0x591b] OpenCL 3.0 NEO  [22.35.24055]
[opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) i7-8809G CPU @ 3.10GHz OpenCL 3.0 (Build 0) [2022.14.8.0.04_160000]
[ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) HD Graphics 630 [0x591b] 1.3 [1.3.24055]
[ext_oneapi_cuda:gpu:0] NVIDIA CUDA BACKEND, NVIDIA A100-PCIE-40GB 8.8 [CUDA 12.0]

jle-quel · 2023-06-01T07:52:26Z

The test-cpu is failing because it's using icpx, and this version should be in sync with intel/llvm.

What used to be detail::dim_loop became detail::loop (see intel/llvm#9108)

bd4 · 2023-06-01T15:57:29Z

The test-cpu is failing because it's using icpx, and this version should be in sync with intel/llvm.

What used to be detail::dim_loop became detail::loop (see intel/llvm#9108)

Is there a standard API that could be used here, or is this only for ONEAPI? If so, can it be cmake feature gated so this can at least build with other compilers, and can it be oneapi version ifdef'd to work on older oneapi versions, in particular the latest public 2023.01 oneapi release? I do not think it is worth the effort to support old oneapi releases, but I think at least the latest public one is important to support.

Did clang-format-17 produce different formatting? Sometimes adding things to .clang-format can make them behave the same, but can be tricky to figure out.

jle-quel · 2023-06-02T09:23:03Z

The test-cpu is failing because it's using icpx, and this version should be in sync with intel/llvm.
What used to be detail::dim_loop became detail::loop (see intel/llvm#9108)

Is there a standard API that could be used here, or is this only for ONEAPI? If so, can it be cmake feature gated so this can at least build with other compilers, and can it be oneapi version ifdef'd to work on older oneapi versions, in particular the latest public 2023.01 oneapi release? I do not think it is worth the effort to support old oneapi releases, but I think at least the latest public one is important to support.

The standard way could be to use a for...loop, but regression has been seen on the cuda backend (see intel/llvm#7948)

As you mention we could ifdef ONEAPI for detail::loop and use for...loop if the project is built with another compiler

jle-quel · 2023-06-02T09:24:10Z

Did clang-format-17 produce different formatting? Sometimes adding things to .clang-format can make them behave the same, but can be tricky to figure out.

Yes, clang-format-17 produced different formatting than the 14 version.
I'll use the 14 version 👍

TApplencourt · 2023-06-02T14:24:01Z

As we want to be this code to be portable to other implementations (hip-sycl) I indeed vote for the ifdef in worse case.
But, I was naively thinking that replacing (detail::loop)

 for (int s = 0; s < N; ++s) {
    result[s] = reduce_over_group(g, x[s], binary_op);
  }

Where we replaced x.size() with N will be good enough

include/sycl_ext_complex.hpp

TApplencourt · 2023-06-07T20:46:58Z

LGTM! I trust the tests which pass :)
Does anyone want to review it? If not will merge Friday night

TApplencourt · 2023-06-12T14:59:08Z

Thanks again :)

tests/test_helper.hpp

illuhad · 2023-06-21T21:43:17Z

include/sycl_ext_complex.hpp

+#else
+  throw sycl::exception(sycl::make_error_code(sycl::errc::runtime),
+                        "Group algorithms are not supported on host.");
+#endif


This is horrendously unportable.

The spec defines __SYCL_DEVICE_ONLY__ only for SMCP compilers. For single-pass compilers, like the hipSYCL generic SSCP compiler, this is not defined in accordance with the SYCL spec

Just throwing whenever __SYCL_DEVICE_ONLY__ is not defined will completely break on any library-only host backend, where the kernel is compiled as part of the host pass for CPU.

You should seriously start looking into adding CI and validation for other compilers. That would make such issues blatantly obvious.

I'd like to add openSYCL CI - do you have a public docker image that we could use with latest version pre-installed?

Re avoiding the __SYCL_DEVICE_ONLY__ else raise pattern, do you have any suggestions on an alternative? This has come up in gtensor as well, where we have a function that could be called from host or device code, and want to have some error reporting in both cases. Is there a standard SYCL safe way to always report an error for both device and host calls?

We don't have a docker image at the moment (it's on the lower priority to do list to speed up our own CI), but what you need is usually fairly simple:

Install some recent LLVM release from apt.llvm.org packages, boost packages, cmake, git

Clone and install.

See our own CI (which is a bit more complicated because it tests all backends in different versions etc): https://github.com/OpenSYCL/OpenSYCL/blob/develop/.github/workflows/linux.yml

Re avoiding the SYCL_DEVICE_ONLY else raise pattern, do you have any suggestions on an alternative? This has come up in gtensor as well, where we have a function that could be called from host or device code, and want to have some error reporting in both cases. Is there a standard SYCL safe way to always report an error for both device and host calls?

I might be able to answer more specifically with more information about your use case, but in general this seems like a difficult/impossible thing to do:

There is currently no portable mechanism in SYCL to specialize host/device code.

As soon as a function is in the kernel call graph, it counts as device code and device code restrictions in principle apply (no exceptions are allowed or else UB). This also affects the host side, as SYCL implementations might use that to target CPU, and still rely on the absence of exceptions for optimizations.

Error handling from within kernels is always very difficult, and there is no mechanism for this in the standard, unless you are willing to pass sycl::stream everywhere throughout your code (you might not want that, as its bare existence might add overhead depending on the implementation). Some implementations may support printf/assert but it's not guaranteed.

The issue for pre-standardized APIs like the SYCL group algorithms is more difficult because we cannot add to the API. Outside of this, one suggestion could be to modify the API and solve this on the C++ level and not on the SYCL level using different template instantiations or overloads to distinguish host and device code paths.
In a single-pass compilation scenario, host and device are by nature extremely closely related and potentially generated from the exact same IR (even though we have mechanisms to distinguish them, but they are very different from the macro), thus they may not even be a separate thing a priori.

EDIT: Philosophically, you can consider single-pass compilers to just be better at enforcing C++ ODR rules. Having different implementations for host/device in principle is an ODR violation that only happens to work because device code is not put into the same binary object.

Can you describe your use case a bit more - how your error handling is intended to work and why you feel the need for raising errors inside the kernel call graph?

EDIT2: Something that would work at least across DPC++ and all of the compilation models in hipSYCL for code specialization would be the following:

#ifdef __HIPSYCL__ #define if_target_host(...) __hipsycl_if_target_host(__VA_ARGS__) #define if_target_device(...) __hipsycl_if_target_device(__VA_ARGS__) #else #ifdef __SYCL_DEVICE_ONLY__ #define if_target_device(...) __VA_ARGS__ #define if_target_host(...) #else #define if_target_device(...) #define if_target_host(...) __VA_ARGS__ #endif #endif // Use like so, assuming kernel_func is a kernel: void kernel_func(auto id) { if_target_device( // Put device code here ); if_target_host( // Put host code here ); }

EDIT3: In the code here, this pattern and the error handling is not even necessary, because the group algorithms already cannot be called outside of kernel code since they take a group argument - and groups are not user constructible. So the code achieves nothing but restrict portability.

jle-quel added 7 commits June 1, 2023 09:21

introduce complex's group algorithms

b122ecd

improve test_helper for complex's group algorithms

dd4f27c

add and group all traits testing together

64c05a9

add conversion test to support modification made to the test helper

3795a0a

add tests for complex's group algorithms

690e0a5

formatting

1655ac9

increase tol for device's asin

a905ea3

jle-quel mentioned this pull request Jun 1, 2023

Introduction of marray's complex specialization #33

Merged

apply same version of clang-format than the CI

77ab067

format with clang-format-14

a1f6116

jle-quel added 4 commits June 6, 2023 11:29

ifdef ONEAPI for detail::loop

3e83260

sync with main and fix conflicts

d687b2c

Use multiple macros to target only clang and not icpx

fa7fc4f

fix previous CI test - rename dim_loop to loop

9fcb9fe

TApplencourt reviewed Jun 6, 2023

View reviewed changes

include/sycl_ext_complex.hpp Outdated Show resolved Hide resolved

implement cplex::detail's 'sycl::complex::detail::loop'

41ccd28

TApplencourt merged commit 80e1c54 into argonne-lcf:main Jun 12, 2023

bd4 reviewed Jun 12, 2023

View reviewed changes

tests/test_helper.hpp Show resolved Hide resolved

jle-quel deleted the jle-quel/introduce-cplx-group-algorithms branch June 13, 2023 08:07

illuhad reviewed Jun 21, 2023

View reviewed changes

TApplencourt mentioned this pull request Jun 21, 2023

This is horrendously unportable. #45

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce complex's group algorithms #41

Introduce complex's group algorithms #41

Uh oh!

jle-quel commented Jun 1, 2023

Uh oh!

jle-quel commented Jun 1, 2023

Uh oh!

bd4 commented Jun 1, 2023

Uh oh!

jle-quel commented Jun 2, 2023

Uh oh!

jle-quel commented Jun 2, 2023

Uh oh!

TApplencourt commented Jun 2, 2023 •

edited

Loading

Uh oh!

Uh oh!

TApplencourt commented Jun 7, 2023 •

edited

Loading

Uh oh!

TApplencourt commented Jun 12, 2023

Uh oh!

Uh oh!

illuhad Jun 21, 2023

Uh oh!

bd4 Jun 21, 2023

Uh oh!

illuhad Jun 21, 2023 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Introduce complex's group algorithms #41

Introduce complex's group algorithms #41

Uh oh!

Conversation

jle-quel commented Jun 1, 2023

Uh oh!

jle-quel commented Jun 1, 2023

Uh oh!

bd4 commented Jun 1, 2023

Uh oh!

jle-quel commented Jun 2, 2023

Uh oh!

jle-quel commented Jun 2, 2023

Uh oh!

TApplencourt commented Jun 2, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

TApplencourt commented Jun 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

TApplencourt commented Jun 12, 2023

Uh oh!

Uh oh!

illuhad Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

bd4 Jun 21, 2023

Choose a reason for hiding this comment

Uh oh!

illuhad Jun 21, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

TApplencourt commented Jun 2, 2023 •

edited

Loading

TApplencourt commented Jun 7, 2023 •

edited

Loading

illuhad Jun 21, 2023 •

edited

Loading