diff --git a/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc index 468d8b08ab57f..35b9e4a979268 100644 --- a/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc +++ b/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc @@ -48,6 +48,18 @@ or may be in a preliminary state. The specification itself may also change in incompatible ways before it is finalized. *Shipping software products should not rely on APIs defined in this specification.* +[NOTE] +==== +There is a link:../proposed/sycl_ext_oneapi_graph_fusion.asciidoc[follow-up +proposal] for fusion based on the +https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc[SYCL graph API]. +That proposal continues some of the ideas presented in this proposal, but uses +the more versatile SYCL graphs API to define the sequence of kernels to +execute. + +Once accepted and implemented, the new proposal will supersede this proposal. +==== + [NOTE] ==== This is an experimental extension for the SYCL specification. diff --git a/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc b/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc index bb4402881166d..8a28e97c5f2ad 100644 --- a/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc +++ b/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc @@ -1207,7 +1207,7 @@ The `sycl::ext::codeplay::experimental::property::queue::enable_fusion` property defined by the extension is ignored by queue recording. To enable kernel fusion in a `command_graph` see the -https://github.com/sommerlukas/llvm/blob/proposal/graph-fusion/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph_fusion.asciidoc[sycl_ext_oneapi_graph_fusion extension proposal] +link:../proposed/sycl_ext_oneapi_graph_fusion.asciidoc[sycl_ext_oneapi_graph_fusion extension proposal] which is layered ontop of `sycl_ext_oneapi_graph`. ==== sycl_ext_oneapi_kernel_properties diff --git a/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph_fusion.asciidoc b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph_fusion.asciidoc new file mode 100644 index 0000000000000..91722042c0ada --- /dev/null +++ b/sycl/doc/extensions/proposed/sycl_ext_oneapi_graph_fusion.asciidoc @@ -0,0 +1,779 @@ += sycl_ext_oneapi_graph_fusion + +:source-highlighter: coderay +:coderay-linenums-mode: table + +// This section needs to be after the document title. +:doctype: book +:toc2: +:toc: left +:encoding: utf-8 +:lang: en +:dpcpp: pass:[DPC++] +:sectnums: +:sectnumlevels: 4 + +// Set the default source code type in this document to C++, +// for syntax highlighting purposes. This is needed because +// docbook uses c++ and html5 uses cpp. +:language: {basebackend@docbook:c++:cpp} + + +== Notice + +[%hardbreaks] +Copyright (C) Codeplay Software Limited. All rights reserved. + +Khronos(R) is a registered trademark and SYCL(TM) and SPIR(TM) are trademarks +of The Khronos Group Inc. OpenCL(TM) is a trademark of Apple Inc. used by +permission by Khronos. + + +== Contact + +To report problems with this extension, please open a new issue at: + +https://github.com/intel/llvm/issues + +== Contributors + +Lukas Sommer, Codeplay + +Victor Lomüller, Codeplay + +Victor Perez, Codeplay + +Julian Oppermann, Codeplay + +Ewan Crawford, Codeplay + +Ben Tracy, Codeplay + +John Pennycook, Intel + +Greg Lueck, Intel + + +== Dependencies + +This extension is written against the SYCL 2020 revision 7 specification. All +references below to the "core SYCL specification" or to section numbers in the +SYCL specification refer to that revision. + +This extension builds on top of the experimental SYCL graphs +https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc[extension +proposal]. All references to the "graphs proposal" refer to this proposal. + +In addition, this extension also depends on the following other SYCL extensions: + +* link:../experimental/sycl_ext_oneapi_properties.asciidoc[sycl_ext_oneapi_properties] +extension. +* link:../experimental/sycl_ext_oneapi_annotated_ptr.asciidoc[sycl_ext_oneapi_annotated_ptr] +extension. + +== Status + +This is a proposed extension specification, intended to gather community +feedback. Interfaces defined in this specification may not be implemented yet +or may be in a preliminary state. The specification itself may also change in +incompatible ways before it is finalized. *Shipping software products should +not rely on APIs defined in this specification.* + +== Overview + +The SYCL graph +https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_oneapi_graph.asciidoc[extension +proposal] seeks to reduce the runtime overhead linked to SYCL kernel submission +and expose additional optimization opportunities. + +One of those further optimizations enabled by the graphs proposal is _kernel +fusion_. Fusing two or more kernels executing on the same device into a single +kernel launch can further reduce runtime overhead and enable further kernel +optimizations such as dataflow internalization discussed below. + +This proposal is a continuation of many of the ideas of the initial +https://github.com/intel/llvm/blob/sycl/sycl/doc/extensions/experimental/sycl_ext_codeplay_kernel_fusion.asciidoc[experimental +kernel fusion proposal] for SYCL. However, instead of defining its own +SYCL-based API to record a sequence of kernels to fuse, this proposal builds on +top of the graphs proposal to allow the fusion of graphs. This not only unifies +the APIs, making sure users only need to familiarize themselves with a single +API, but also provides additional advantages. + +The graph proposal defines two APIs to create graphs: a proposal using a +recording mechanism, similar to the initial kernel fusion proposal; and another +one using explicit graph building. Thus, future users will be able to choose +from two different mechanisms to construct the sequence of kernels to fuse. As +there is an explicit step for finalization of graphs before being submitted for +execution, fusion can happen in this step, which also eliminates many of the +synchronization concerns that needed to be covered in the experimental kernel +fusion proposal. + +The aim of this document is to propose a mechanism for users to request the +fusion of two or more kernels in a SYCL graph into a single kernel **at +runtime**. This requires the extension of the runtime with some sort of JIT +compiler to allow for the fusion of kernel functions at runtime. + +== Specification + +=== Feature test macro + +This extension provides a feature-test macro as described in the core SYCL +specification. An implementation supporting this extension must predefine the +macro `SYCL_EXT_ONEAPI_GRAPH_FUSION` to one of the values defined in the +table below. Applications can test for the existence of this macro to determine +if the implementation supports this feature, or applications can test the +macro's value to determine which of the extension's features the implementation +supports. + +[%header,cols="1,5"] +|=== +|Value +|Description + +|1 +|Initial version of this extension. +|=== + +=== API modifications + +==== Properties + +===== Graph Fusion Properties + +The API for `command_graph::finalize()` includes a +`property_list` parameter. The following properties, defined by this extension, +can be added to the property list to indicate that the kernels in the +command-graph can or should be fused. + +```c++ +sycl::ext::oneapi::experimental::property::graph::enable_fusion +``` + +This property is only descriptive, not prescriptive. Implementations are free to +not perform fusion if it is not possible +(see below section <>), fusion is not +supported by the implementation, or the implementation decides not to perform +fusion for other reasons. It is not an error if an implementation does not +perform fusion even though the property is passed. + +Implementations can provide a diagnostic message in case fusion was not +performed through an implementation-specified mechanism, but are not required to +do so. + +```c++ +sycl::ext::oneapi::experimental::property::graph::require_fusion +``` + +This property is prescriptive, i.e., in contrast to the property above, +implementations _must_ perform fusion. If fusion is not supported by the +implementation at all, the implementation must raise an error with error code +`errc::feature_not_supported`. If the implementation is unable to perform fusion +for this graph (see below section <>), the +implementation must raise an error with error code `errc::kernel_not_supported`. + +===== Barrier property + +The following property can be added to the `property_list` of the +`command_graph::finalize()` API. + +```c++ +sycl::ext::oneapi::experimental::property::graph::insert_barriers +``` + +By default, graph fusion will not introduce any additional barriers to the +fused kernel. Existing group barriers inside the code will be retained (see +below). + +If the property list contains this property, additional work-group barriers are +introduced between kernels in the fused kernel (see below section on +synchronization in kernels). + +The property only takes effect if either the +`sycl::ext::oneapi::experimental::property::graph::enable_fusion` +property or the +`sycl::ext::oneapi::experimental::property::graph::require_fusion` property is +also part of the `property_list` of the same invocation of +`command_graph<...>::finalize()`. + +[NOTE] +==== +By adding the `insert_barriers` property, a _work-group barrier_ will be +inserted between the kernels. To achieve a device-wide synchronization, i.e., +a synchronization between different work-groups that is implicit between two +kernels when executed separately, users should leverage the subgraph feature of +the SYCL graph proposal, as device-wide synchronization inside the fused kernel +is not achievable. By creating two subgraphs, fusing each and adding both to +the same graph, a device-wide synchronization between two fused parts can be +achieved if necessary. +==== + +===== Access scope property + +Specializations of the following property template can be passed to three +different APIs, namely: + +* The `accessor` constructor, giving a more granular control. +* The `buffer` constructor, in which case all the `accessors` derived from +this buffer will inherit this property (unless overridden). +* The property list parameter of `annotated_ptr`, to apply the property to a +USM pointer. + +```c++ +namespace sycl::ext::oneapi::experimental::property{ + + template + struct access_scope {}; + + inline constexpr auto access_scope_work_group = + access_scope; + + inline constexpr auto access_scope_work_item = + access_scope; + +} // namespace sycl::ext::oneapi::experimental::property +``` + +Specializations of the `access_scope` property template can be used to express +the access pattern of kernels to a buffer or USM allocation. + +The specializations of the property are an assertion by the application that +each element in the buffer or allocated device memory is at most accessed in +the given memory scope in the kernel submitted by this command-group (in case +the property is specified on an accessor) or in any kernel in the graph (in case +the property is specified on a buffer or an USM pointer). + +More concretely, the two shortcuts express the following semantics: + +* `access_scope_work_group`: Applying this specialization asserts that each +element in the buffer or allocated device memory is accessed by no more than one +work-group. + +* `access_scope_work_item`: Applying this specialization asserts that each +element in the buffer or allocated device memory is accessed by no more than one +work-item. + +Implementations may treat specializations of the access scope property as a +hint to promote the elements of the buffer or allocated device memory to a +different type of memory (see below section on local and private +internalization). + +If different specializations are applied to accessors to the same buffer or +device memory allocation, the resolution rules specified below apply. + +The property is not prescriptive, implementations are free to not perform +internalization and it is no error if they do not perform internalization. +Implementations can provide a diagnostic message in case internalization was +not performed through an implementation-specified mechanism, but are not +required to do so. + +In case the `access_scope` property is attached to `annotated_ptr`, the +properties should be inspected by an implementation when the `annotated_ptr` is +captured by a kernel lambda or otherwise passed as an argument to a kernel +function. Implementations are not required to track internalization-related +information from other USM pointers that may be used by a kernel, such as those +stored inside of structs or other data structures. + +===== Internal memory property + +The following property can be passed to three different APIs, namely: + +* The `accessor` constructor, giving a more granular control. +* The `buffer` constructor, in which case all the `accessors` derived from +this buffer will inherit this property (unless overridden). +* The property list parameter of `annotated_ptr`, to apply the property to a +USM pointer. + +```c++ +sycl::ext::oneapi::experimental::property::fusion_internal_memory +``` + +By applying this property, the application asserts that the updates made to the +buffer or allocated device memory by the kernel submitted by this command-group +(in case the property is specified on an accessor) or in any kernel in the +graph (in case the property is specified on a buffer or an USM pointer) may not +be available for use after the fused kernel completes execution. +Implementations may treat this as a hint to not write back the final result to +global memory. + +The property is not prescriptive, implementations are free to not perform +internalization and it is no error if they do not perform internalization. +Implementations can provide a diagnostic message in case internalization was +not performed through an implementation-specified mechanism, but are not +required to do so. + +In case the `fusion_internal_memory` property is attached to `annotated_ptr`, +the properties should be inspected by an implementation when the +`annotated_ptr` is captured by a kernel lambda or otherwise passed as an +argument to a kernel function. Implementations are not required to track +internalization-related information from other USM pointers that may be used by +a kernel, such as those stored inside of structs or other data structures. + + +==== Device aspect + +To support querying whether a SYCL device and the underlying platform support +kernel fusion for graphs, the following device aspect is added as part of this +extension proposal. + +```c++ +sycl::aspect::ext_oneapi_graph_fusion +``` + +Devices with `aspect::ext_oneapi_graph_fusion` support kernel fusion for graphs. + +=== Linearization + +In order to be able to perform kernel fusion, the commands in a graph must be +arranged in a valid sequential order. + +A valid _linearization_ of the graph is an order of the commands in the graph +such that each command in the linearization depends only on commands that appear +in the sequence before the command itself. + +The exact linearization of the dependency DAG (which generally only implies a +partial order) is implementation defined. The linearization should be +deterministic, i.e., it should yield the same sequence when presented with the +same DAG. + +=== Synchronization in kernels + +Group barriers semantics do not change in the fused kernel and barriers already +in the unfused kernels are preserved in the fused kernel. + +Despite this, it is worth noting that, in order to introduce synchronization +between work items in a same work-group executing a fused kernel, a work-group +barrier can added between each of the kernels being fused by applying the +`insert_barriers` property. + +As the fusion compiler can reason about the access behavior of the different +kernels only in a very limited fashion, **it's the user's responsibility to +make sure no data races occur in the fused kernel**. Data races could in +particular be introduced because the implicit device-wide synchronization +between the execution of two separate kernels is eliminated by fusion. The user +must ensure that the kernels combined during fusion do not rely on this +synchronization or introduce appropriate synchronization. + +Device-wide synchronization can be achieved by splitting the graph into multiple +subgraphs and fusing each separately, as described above. + +=== Limitations + +Some scenarios might require fusion to be cancelled if some undesired scenarios +arise. The required implementation behavior in this case depends on the +property that was used to initiate fusion. + +If the _descriptive_ `enable_fusion` property was used to initiate fusion, it +is not an error for an implementation to cancel fusion in those scenarios. A +valid recovery from such a scenario is to not perform fusion and rather +maintain the original graph, executing the kernels individually rather than in +a single fused kernel. Implementations can provide a diagnostic message in case +fusion was cancelled through an implementation-specified mechanism, but are not +required to do so. + +If, on the other hand, the _prescriptive_ `require_fusion` property was used to +initiate fusion, implementations must raise an error if they need to cancel +fusion in those scenarios. + +The following sections describe a number of scenarios that might require to +cancel fusion. Note that some implementations might be more capable/permissive +and might not abort fusion in all of these cases. + +==== Hierarchical Parallelism + +The extension does not support kernels using hierarchical parallelism. Although +some implementations might want to add support for this kind of kernels. + +==== Incompatible ND-ranges of the kernels to fuse + +Incompatibility of ND-ranges will be determined by the kernel fusion +implementation. All implementations should support fusing kernels with the exact +same ND-ranges, but implementations might cancel fusion as soon as a kernel with +a different ND-range is submitted. + +==== Kernels with different dimensions + +Similar to the previous one, it is implementation-defined whether or not to +support fusing kernels with different dimensionality. + +==== No intermediate representation + +In case any of the kernels to be fused does not come with an accessible +suitable intermediate representation, kernel fusion is canceled. + +==== Explicit memory operations and host tasks + +The graph proposal allows graphs to contain, next to device kernels, explicit +memory operations and host tasks. As both of these types of commands cannot be +integrated into a fused kernel, fusion must be cancelled, unless there is a +valid linearization (see above section on linearization) that allows all memory +operations and host tasks to execute either before or after all device kernels. +It is valid to execute some memory operations and host tasks before all device +kernels and some after all device kernels, as long as that sequence is a valid +linearization. + +==== Multi-device graph + +Attempting to fuse a graph containing device kernels for more than one device +may lead to fusion being cancelled, as kernel fusion across multiple devices +and/or backends is generally not possible. + +=== Internalization + +While avoiding repeated kernel launch overheads will most likely already improve +application performance, kernel fusion can deliver even higher performance gains +when internalizing dataflows. + +In a situation where data produced by one kernel is consumed by another kernel +and the two kernels are fused, the dataflow from the first kernel to the second +kernel can be made internal to the fused kernel. Instead of using time-consuming +reads and writes to/from global memory, the fused kernel can use much faster +mechanisms, e.g., registers or private memory to "communicate" the result. + +To achieve this result during fusion, a fusion compiler must establish some +additional context and information. + +First, the compiler must know that two arguments refer to the same underlying +memory. This can be inferred during runtime, so no additional user input is +required. + +For the remaining information that needs to be established, the necessity of +user-provided input depends on the individual capabilities of the +implementation. + +If the implementation's fusion compiler is not able to initialize the +internalized buffers or memories, elements of the internalized buffer or memory +being read by a kernel must have been written before (either in the same kernel +or in a previous one in the same graph). This behavior can be asserted by the +application by applying the `no_init` property (see +https://registry.khronos.org/SYCL/specs/sycl-2020/html/sycl-2020.html#_properties_2[section +4.7.6.4] of the SYCL specification) to the buffer or allocated device memory. + +To this end, this extension allows the use of the property in more places than +defined in Table 52 in the SYCL specification. More concretely, this extension +allows to use the property in the buffer constructor or the property list +parameter of `annotated_ptr<...>`. In case the `no_init` property is attached to +`annotated_ptr`, the properties should be inspected by an implementation when +the `annotated_ptr` is captured by a kernel lambda or otherwise passed as an +argument to a kernel function. Implementations are not required to track +internalization-related information from other USM pointers that may be used by +a kernel, such as those stored inside of structs or other data structures. + +If the implementation's fusion compiler is not able to guarantee write-back of +the final result after internalization, values stored to an internalized +buffer/memory must not be used by any other kernel not part of the graph, as +the data becomes unavailable to consumers through internalization. This is +knowledge that the compiler cannot deduce. Instead, the fact that the values +stored to an internalized buffer/memory are not used outside the fused kernel +must be provided by the user by applying the `fusion_internal_memory` property +to the buffer or allocated device memory as described above. + +The type of memory that can be used for internalization depends on the memory +access pattern of the fused kernel. Depending on the access pattern, the buffer +or allocated device memory can be classified as: + +* _Privately internalizable_: If not a single element of the buffer/memory is to + be accessed by more than one work-item; + +* _Locally internalizable_: If not a single element of the buffer/memory is to + be accessed by work items of different work groups. + +If the implementation's fusion compiler is not able to deduce the access +pattern, suitable information must be provided by the user. To this end, +specializations of the `access_scope` property template defined in this +proposal can be used to inform the fusion compiler about the access pattern of +the kernels involved in fusion. + +If an `annotated_ptr` is created with any of the properties relating to +internalization and captured by a kernel lambda or otherwise passed as an +argument to a kernel function participating in fusion, the underlying memory +must only be accessed via pointers that are also captured or passed as kernel +argument. Access to the underlying memory via a different pointer, such as +pointers stored inside of structs or other data structures results in undefined +behavior. + +As already stated above, it depends on the implementation's capabilities which +properties need to be applied to a buffer or allocated device memory to enable +dataflow internalization. Implementations should document the necessary +properties required to enable internalization in implementation documentation. + +All internalization-related properties are only _descriptive_, so it is not an +error if an implementation is unable to or for other reasons decides not to +perform internalization based on the specified properties. Implementations can +provide a diagnostic message in case the set of specified properties are not +sufficient to perform internalization, but are not required to do so. + +[NOTE] +==== +The current implementation in DPC++ requires the addition of the `no_init`, +`fusion_internal_memory` and one specialization of the `access_scope` property +to buffers or allocated device memory to enable internalization. +==== + +==== Buffer internalization + +In some cases, the user will specify different access scopes for a +buffer and accessors to such buffer. When incompatible combinations are used, an +`exception` with `errc::invalid` error code is thrown. Otherwise, these +properties must be combined as follows: + +[options="header"] +|=== +|Accessor Access Scope|Buffer Access Scope|Resulting Access Scope + +.3+.^|None +|None +|None + +|Work Group +|Work Group + +|Work Item +|Work Item + +.3+.^|Work Group +|None +|Work Group + +|Work Group +|Work Group + +|Work Item +|*Error* + +.3+.^|Work Item +|None +|Work Item + +|Work Group +|*Error* + +|Work Item +|Work Item +|=== + +In case different internalization targets are used for accessors to the same +buffer or for `annotated_ptr` pointing to the same underlying memory, the +following (commutative and associative) rules are followed: + +[options="header"] +|=== +|Accessor/Ptr~1~ Access Scope|Accessor/Ptr~2~ Access Scope|Resulting Access Scope + +|None +|_Any_ +|None + +.2+.^|Work Group +|Work Group +|Work Group + +|Work Item +|None + +|Work Item +|Work Item +|Work Item +|=== + +If no work-group size is specified or two kernels specify different +work-group sizes when attempting local internalization for any of the +kernels involved in the fusion, no internalization will be +performed. If there is a mismatch between the two accessors (access +range, access offset, number of dimensions, data type), no +internalization is performed. + +== Examples + +=== Buffer-based example + +```c++ +#include + +namespace sycl_ext = sycl::ext::oneapi::experimental; + +struct AddKernel { + sycl::accessor accIn1; + sycl::accessor accIn2; + sycl::accessor accOut; + + void operator()(sycl::id<1> i) const { accOut[i] = accIn1[i] + accIn2[i]; } +}; + +int main() { + constexpr size_t dataSize = 512; + int in1[dataSize], in2[dataSize], in3[dataSize], out[dataSize]; + + sycl::queue q{default_selector_v}; + + { + sycl::buffer bIn1{in1, sycl::range{dataSize}}; + bIn1.set_write_back(false); + sycl::buffer bIn2{in2, sycl::range{dataSize}}; + bIn2.set_write_back(false); + sycl::buffer bIn3{in3, sycl::range{dataSize}}; + bIn3.set_write_back(false); + buffer bTmp1{range{dataSize}}; + // Internalization specified on the buffer + sycl::buffer bTmp2{ + sycl::range{dataSize}, + {sycl_ext::property::access_scope_work_item{}, + sycl_ext::property::fusion_internal_memory{}, + sycl::no_init}}; + // Internalization specified on the buffer + sycl::buffer bTmp3{ + sycl::range{dataSize}, + {sycl_ext::property::access_scope_work_item{}, + sycl_ext::property::fusion_internal_memory{}, + sycl::no_init}}; + sycl::buffer bOut{out, sycl::range{dataSize}}; + bOut.set_write_back(false); + + sycl_ext::command_graph graph{ + q.get_context(), q.get_device(), + sycl_ext::property::graph::assume_buffer_outlives_graph{}}; + + graph.begin_recording(q); + + q.submit([&](sycl::handler &cgh) { + auto accIn1 = bIn1.get_access(cgh); + auto accIn2 = bIn2.get_access(cgh); + // Internalization specified on each accessor. + auto accTmp1 = bTmp1.get_access(cgh, + sycl_ext::property::access_scope_work_item{} + sycl_ext::property::fusion_internal_memory{}, + sycl::no_init); + cgh.parallel_for(dataSize, AddKernel{accIn1, accIn2, accTmp1}); + }); + + q.submit([&](sycl::handler &cgh) { + // Internalization specified on each accessor. + auto accTmp1 = bTmp1.get_access(cgh, + sycl_ext::property::access_scope_work_item{} + sycl_ext::property::fusion_internal_memory{}, + sycl::no_init); + auto accIn3 = bIn3.get_access(cgh); + auto accTmp2 = bTmp2.get_access(cgh); + cgh.parallel_for( + dataSize, [=](sycl::id<1> i) { accTmp2[i] = accTmp1[i] * accIn3[i]; }); + }); + + q.submit([&](sycl::handler &cgh) { + // Internalization specified on each accessor. + auto accTmp1 = bTmp1.get_access(cgh, + sycl_ext::property::access_scope_work_item{} + sycl_ext::property::fusion_internal_memory{}, + sycl::no_init); + auto accTmp3 = bTmp3.get_access(cgh); + cgh.parallel_for( + dataSize, [=](sycl::id<1> i) { accTmp3[i] = accTmp1[i] * 5; }); + }); + + q.submit([&](sycl::handler &cgh) { + auto accTmp2 = bTmp2.get_access(cgh); + auto accTmp3 = bTmp3.get_access(cgh); + auto accOut = bOut.get_access(cgh); + cgh.parallel_for(dataSize, + AddKernel{accTmp2, accTmp3, accOut}); + }); + + graph.end_recording(); + + // Trigger fusion during finalization. + auto exec_graph = + graph.finalize({sycl_ext::property::graph::require_fusion{}}); + + q.ext_oneapi_graph(exec_graph); + + q.wait(); + } + return 0; +} +``` + +=== USM-based example + +```c++ +#include + +namespace sycl_ext = sycl::ext::oneapi::experimental; + +int main() { + constexpr size_t dataSize = 512; + constexpr size_t numBytes = dataSize * sizeof(int); + + int in1[dataSize], in2[dataSize], in3[dataSize], out[dataSize]; + + sycl::queue q{default_selector_v}; + + sycl_ext::command_graph graph{q.get_context(), q.get_device()}; + + int *dIn1, dIn2, dIn3, dTmp, dOut; + + dIn1 = sycl::malloc_device(q, dataSize); + dIn2 = sycl::malloc_device(q, dataSize); + dIn3 = sycl::malloc_device(q, dataSize); + dOut = sycl::malloc_device(q, dataSize); + + // Specify internalization to local memory for an USM pointer + dTmp = sycl::malloc_device(q, dataSize) + auto annotatedTmp = sycl_ext::annotated_ptr( + dTmp, sycl_ext::property::access_scope_work_group{}, + sycl_ext::property::fusion_internal_memory{}, no_init); + + // This explicit memory operation is compatible with fusion, as it can be + // linearized before any device kernel in the graph. + auto copy_in1 = + graph.add([&](sycl::handler &cgh) { cgh.memcpy(dIn1, in1, numBytes); }); + + // This explicit memory operation is compatible with fusion, as it can be + // linearized before any device kernel in the graph. + auto copy_in2 = + graph.add([&](sycl::handler &cgh) { cgh.memcpy(dIn2, in2, numBytes); }); + + auto kernel1 = graph.add( + [&](sycl::handler &cgh) { + cgh.parallel_for( + dataSize, [=](sycl::id<1> i) { annotatedTmp[i] = in1[i] + in2[i]; }); + }, + {sycl_ext::property::node::depends_on(copy_in1, copy_in2)}); + + // This explicit memory operation is compatible with fusion, as it can be + // linearized before any device kernel in the graph. + auto copy_in3 = + graph.add([&](sycl::handler &cgh) { cgh.memcpy(dIn3, in3, numBytes); }); + + auto kernel2 = graph.add( + [&](sycl::handler &cgh) { + cgh.parallel_for( + dataSize, [=](sycl::id<1> i) { out[i] = annotatedTmp[i] * in3[i]; }); + }, + {sycl_ext::property::node::depends_on(copy_in3, kernel1)}); + + // This explicit memory operation is compatible with fusion, as it can be + // linearized after any device kernel in the graph. + auto copy_out = + graph.add([&](sycl::handler &cgh) { cgh.memcpy(out, dOut, numBytes); }, + {sycl_ext::property::node::depends_on(kernel2)}); + + // Trigger fusion during finalization. + auto exec = graph.finalize({sycl_ext::property::graph::require_fusion{}}); + + // use queue shortcut for graph submission + q.ext_oneapi_graph(exec).wait(); + + sycl::free(dIn1, q); + sycl::free(dIn2, q); + sycl::free(dIn3, q); + sycl::free(dOut, q); + sycl::free(dTmp, q); + + return 0; +} +``` + +== Revision History + +[cols="5,15,15,70"] +[grid="rows"] +[options="header"] +|======================================== +|Rev|Date|Authors|Changes +|1|2023-02-16|Lukas Sommer|*Initial draft* +|2|2023-03-16|Lukas Sommer|*Remove reference to outdated `add_malloc_device` API* +|3|2023-04-11|Lukas Sommer|*Update usage examples for graph API changes* +|4|2023-08-17|Lukas Sommer|*Update after graph extension has been merged* +|5|2023-09-01|Lukas Sommer|*Split internalization properties and change barrier* +|6|2023-09-13|Lukas Sommer|*Use annotated_ptr for USM internalization* +|========================================