-
Notifications
You must be signed in to change notification settings - Fork 802
[SYCL] Add fma_relu extension #5749
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
025cf7e
66b4e33
2d04406
9418f74
65fddfa
4d99f3f
3982001
450e1b5
8d2d11f
a514505
d8bc53f
37a18d7
a7b2fdc
7b40302
2f9b7d7
8a29c44
f53577f
49aca06
02cbc5b
9fb55df
358c943
7c6d728
390ae97
f08791a
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,120 @@ | ||
| = sycl_ext_oneapi_fma_relu | ||
|
|
||
| :source-highlighter: coderay | ||
| :coderay-linenums-mode: table | ||
|
|
||
| // This section needs to be after the document title. | ||
| :doctype: book | ||
| :toc2: | ||
| :toc: left | ||
| :encoding: utf-8 | ||
| :lang: en | ||
| :dpcpp: pass:[DPC++] | ||
|
|
||
| // Set the default source code type in this document to C++, | ||
| // for syntax highlighting purposes. This is needed because | ||
| // docbook uses c++ and html5 uses cpp. | ||
| :language: {basebackend@docbook:c++:cpp} | ||
|
|
||
|
|
||
| == Notice | ||
|
|
||
| [%hardbreaks] | ||
| Copyright (C) 2022-2022 Intel Corporation. All rights reserved. | ||
|
|
||
| Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks | ||
| of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by | ||
| permission by Khronos. | ||
|
|
||
| == Contact | ||
|
|
||
| To report problems with this extension, please open a new issue at: | ||
|
|
||
| https://github.com/intel/llvm/issues | ||
|
|
||
| or contact hugh 'dot' delaney 'at' codeplay 'dot' com. | ||
|
|
||
| == Dependencies | ||
|
|
||
| This extension is written against the SYCL 2020 revision 4 specification. All | ||
| references below to the "core SYCL specification" or to section numbers in the | ||
| SYCL specification refer to that revision. | ||
|
|
||
| For the `bfloat16` cases this extension depends on the following other SYCL | ||
| extensions: | ||
|
|
||
| * link:./sycl_ext_intel_bf16_conversion.asciidoc[ | ||
| sycl_ext_*_bf16_conversion] | ||
|
|
||
| For the `half` cases this extension requires the runtime aspect | ||
| `sycl::aspect::fp16`. | ||
|
|
||
| == Contributors | ||
|
|
||
| * Hugh Delaney | ||
|
|
||
| == Status | ||
|
|
||
| This is a proposed extension specification, intended to gather community | ||
| feedback. Interfaces defined in this specification may not be implemented yet | ||
| or may be in a preliminary state. The specification itself may also change in | ||
| incompatible ways before it is finalized. *Shipping software products should | ||
| not rely on APIs defined in this specification.* | ||
|
|
||
| [NOTE] | ||
| ==== | ||
| This extension is currently implemented in {dpcpp} only for GPU devices and | ||
| only when using the CUDA backend. Attempting to use this extension in | ||
| kernels that run on other devices or backends may result in undefined behavior. | ||
| Be aware that the compiler is not able to issue a diagnostic to warn you if | ||
| this happens. | ||
| ==== | ||
|
|
||
|
|
||
| == Overview | ||
|
|
||
| This extension introduces the `fma_relu` function for datatypes `sycl::half`, | ||
| `bfloat16` and `bfloat16x2`. `bfloat16` and `bfloat16x2` refer to the bfloat16 | ||
|
||
| class from the `sycl_ext_*_bf16_conversion` extension, and currently use | ||
| `uint16_t` and `uint32_t`, respectively, as storage types. | ||
|
|
||
| ```c++ | ||
| namespace sycl::ext::oneapi::experimental { | ||
|
|
||
| // Available when T is sycl::half, uint16_t (bfloat16) or uint32_t (bfloat16x2) | ||
| template <typename T> | ||
| T fma_relu(T a, T b, T c); | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As part of extending math functions, you are already adding support for fma, fmax, etc to bfloat16/half variants. This extension of fma_relu is introducing two big "new" territories to DPC++: 2- Introducing fusions to DPC++: fma_relu is telling the compiler these two functions can be fused together. While this can be important in libraries, is this really necessary for DPC++? DPC++ has a compiler that can detect that this type of relu or other functions is following an fma and can trigger the fusion the user intended. One other open question and issue is: if we end up deciding to have this type of ML very specific functions in DPC++, what should be the objects that use them? scalar, vector ? marray? why the only vector type here is bfloat16x2 ? Should this be put under the joint matrix umbrella as an another potential tensor hardware accelerated function?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. These are valid points. The primary benefit of this sort of extension, is that it allows users to concisely target builtins specific to a particular backend. Since the fma_relu function is in the cuda math headers, we think that it is appropriate to have them in DPC++ as well, for ease of porting code etc. It is our feeling that since this extension targets just the CUDA backend, it will always be an extension and will not enter the core spec libraries. A DPC++ extension should (as much as possible) give users access to all of the functionality of the backend API, but not necessarily more. Therefore we do not need to be concerned about making fma_relu work for other backends (unless they also have a similar builtin to target). The question of fusions is an interesting one, and something we will discuss a bit internally. Perhaps in the long run this is the approach that will be used in some instances. The objects that use the function should be scalar and vector. The reason that bfloat16 has not been vectorized is because the vector types for the bfloat16 class has not been implemented yet. Once implemented we will add the bfloat vec versions for this function. bfloat16x2 is vectorized since we are relying on an older impl of bf16x2 which uses uint32_t as storage type. However, we think that for the time being, we are interested in representing backend-specific features in DPC++, and since these features are exposed to the user as a free function in the CUDA headers, we think this is reason enough to bring this function into DPC++ as an extension.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you share a link to the cuda math headers that contains the full list of math/ML functions?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can't find a link to the headers online, but you can find
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Sorry you can actually find it all here:
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What do you think the approach should be with these functions? Should we:
It is worth noting that not all the functions listed above have their own builtins, but it seems that all of them produce far less ptx than their say The reason we have added |
||
| } | ||
| ``` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Wouldn't it make more sense for this function to take the Also a nit about the organization of this spec ... the "Specification" section below is the formal specification of your extension. The description of the
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I am following the convention used by all of these bfloat16 PRs: #5748 #5724, which use
Thanks, have swapped that into specification section.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I talked with the @dkhaldi about the Matrix API, and she says they will add APIs that take the
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Good point, cc @hdelan, we should be able to add bfloat16 implementations of the fma_relu functions in this PR provided that #5393 is merged. We do want the bfloat16x2 cases too but this will require the definition of a bfloat16x2 class / extension doc update first, analogous to bfloat16 in #5393, so the corresponding bfloat16x2 impls will probably be done in a separate PR to this. For the joint_matrix API and other bfloat16 math builtins: fabs, fma, fmin, fmax, the uint16_t implementations are already merged and we are already working on follow up PRs for the corresponding bfloat16 implementations. |
||
|
|
||
| `fma_relu` returns `a * b + c > 0 ? a * b + c : 0`. | ||
|
|
||
| == Specification | ||
|
|
||
| === Feature test macro | ||
|
|
||
| This extension provides a feature-test macro as described in the core SYCL | ||
| specification. An implementation supporting this extension must predefine the | ||
| macro `SYCL_EXT_ONEAPI_FMA_RELU` to one of the values defined in the table | ||
| below. Applications can test for the existence of this macro to determine if | ||
| the implementation supports this feature, or applications can test the macro's | ||
| value to determine which of the extension's features the implementation | ||
| supports. | ||
|
|
||
| If `fma_relu` is to be used with either the `bf16` or `bf16x2` datatypes, then | ||
| an implementation must additionally predefine the macro | ||
| `SYCL_EXT_INTEL_BF16_CONVERSION`, as detailed in | ||
hdelan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
| link:./sycl_ext_intel_bf16_conversion.asciidoc[ | ||
| sycl_ext_*_bf16_conversion]. | ||
hdelan marked this conversation as resolved.
Outdated
Show resolved
Hide resolved
|
||
|
|
||
|
|
||
| [%header,cols="1,5"] | ||
| |=== | ||
| |Value | ||
| |Description | ||
|
|
||
| |1 | ||
| |The APIs of this experimental extension are not versioned, so the | ||
| feature-test macro always has this value. | ||
| |=== | ||
|
|
||
Uh oh!
There was an error while loading. Please reload this page.