Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
= sycl_ext_codeplay_enqueue_native_command
= sycl_ext_oneapi_enqueue_native_command

:source-highlighter: coderay
:coderay-linenums-mode: table
Expand Down Expand Up @@ -54,10 +54,10 @@ specification.*
== Backend support status

This extension is currently implemented in {dpcpp} only for GPU devices and
only when using the CUDA or HIP backends. Attempting to use this extension in
kernels that run on other devices or backends may result in undefined
behavior. Be aware that the compiler is not able to issue a diagnostic to
warn you if this happens.
only when using the Level Zero, CUDA, HIP backends. Attempting to use this
extension in kernels that run on other devices or backends may result in
undefined behavior. Be aware that the compiler is not able to issue a
diagnostic to warn you if this happens.


== Overview
Expand All @@ -66,7 +66,7 @@ This extension is derived from the experimental AdaptiveCpp extension,
`enqueue_custom_operation` which is documented
https://github.com/AdaptiveCpp/AdaptiveCpp/blob/develop/doc/enqueue-custom-operation.md[here].

The goal of `ext_codeplay_enqueue_native_command` is to integrate interop
The goal of `ext_oneapi_enqueue_native_command` is to integrate interop
work within the SYCL runtime's creation of the asynchronous SYCL DAG. As such,
the user defined lambda must only enqueue asynchronous, as opposed to
synchronous, backend work within the user lambda. Asynchronous work must only
Expand All @@ -75,27 +75,30 @@ be submitted to the native queue obtained from

=== Differences with `host_task`

A callable submitted to `ext_codeplay_enqueue_native_command` won't wait
A callable submitted to `ext_oneapi_enqueue_native_command` won't wait
on its dependent events to execute. The dependencies passed to an
`ext_codeplay_enqueue_native_command` submission will result in dependencies being
`ext_oneapi_enqueue_native_command` submission will result in dependencies being
implicitly handled in the backend API, using the native queue object associated
with the SYCL queue that the `sycl_ext_codeplay_enqueue_native_command` is
with the SYCL queue that the `sycl_ext_oneapi_enqueue_native_command` is
submitted to. This gives different synchronization guarantees from normal SYCL
`host_task` s, which guarantee that the `host_task` callable will only begin
execution once all of its dependent events have completed.

In this example:

```
```c++
q.submit([&](sycl::handler &cgh) {
sycl::accessor acc_src{buf_src, cgh};
sycl::accessor acc_dst{buf_dst, cgh};
cgh.depends_on(dep_event);
cgh.ext_codeplay_enqueue_native_command([=](sycl::interop_handle &h) {
printf("This will print before dep_event has completed.\n");
// This stream has been synchronized with dep_event's underlying
// hipEvent_t
hipStream_t stream = h.get_native_queue<sycl::backend::hip>();
hipMemcpyAsync(target_ptr, native_mem, test_size * sizeof(int),
hipMemcpyDeviceToHost, stream);
cgh.ext_oneapi_enqueue_native_command([=](sycl::interop_handle &h) {
printf("This may print before dep_event has completed.\n");
auto src = IH.get_native_mem<backend::ext_oneapi_level_zero>(acc_src);
auto dst = IH.get_native_mem<backend::ext_oneapi_level_zero>(acc_dst);
auto command_list = h.get_native_queue<sycl::backend::level_zero>();
zeCommandListAppendMemoryCopy(
command_list, dst, src, sizeof(T) * acc_src.get_count(), nullptr,
0, nullptr);
});
});
q.wait()
Expand All @@ -112,7 +115,7 @@ will only happen once the host task's dependent events are observed to be
complete on the host.

A SYCL event returned by a submission of a
`ext_codeplay_enqueue_native_command` command is only complete once the
`ext_oneapi_enqueue_native_command` command is only complete once the
asynchronous work enqueued to the native queue obtained through
`interop_handle::get_native_queue()` has completed.

Expand Down Expand Up @@ -147,7 +150,7 @@ class:
```c++
class handler {
template <typename Func>
void ext_codeplay_enqueue_native_command(Func&& interopCallable);
void ext_oneapi_enqueue_native_command(Func&& interopCallable);
};
```

Expand All @@ -156,7 +159,7 @@ parameter of type `interop_handle`.

_Effects_: The `interopCallable` object is called exactly once, and this call
may be made asynchronously even after the calling thread returns from
`ext_codeplay_enqueue_native_command`.
`ext_oneapi_enqueue_native_command`.

The call to `interopCallable` may submit one or more asynchronous tasks to the
native backend object obtained from `interop_handle::get_native_queue`, and
Expand All @@ -181,6 +184,31 @@ any of these things, the behavior is undefined.

== Example

```c++
sycl::queue q;
q.submit([&](sycl::handler &cgh) {
sycl::accessor acc_src{buf_src, cgh};
sycl::accessor acc_dst{buf_dst, cgh};
cgh.ext_oneapi_enqueue_native_command([=](sycl::interop_handle &h) {
// Can extract device mem from accessor
auto src = IH.get_native_mem<backend::ext_oneapi_level_zero>(acc_src);
auto dst = IH.get_native_mem<backend::ext_oneapi_level_zero>(acc_dst);
auto command_list = h.get_native_queue<sycl::backend::level_zero>();

// Can enqueue arbitrary backend operations. This could also be a kernel
// launch or call to a library that enqueues operations on the command
// list etc.
//
// Important: Enqueuing a *synchronous* backend operation results in
// undefined behavior.
zeCommandListAppendMemoryCopy(
command_list, dst, src, sizeof(T) * acc_src.get_count(), nullptr,
0, nullptr);
});
});
q.wait();
```

This example demonstrates how to use this extension to enqueue asynchronous
native tasks on the HIP backend.

Expand All @@ -189,14 +217,14 @@ sycl::queue q;
q.submit([&](sycl::handler &cgh) {
sycl::accessor acc{buf, cgh};

cgh.ext_codeplay_enqueue_native_command([=](sycl::interop_handle &h) {
cgh.ext_oneapi_enqueue_native_command([=](sycl::interop_handle &h) {
// Can extract device pointers from accessors
void *native_mem = h.get_native_mem<sycl::backend::hip>(acc);
// Can extract stream
hipStream_t stream = h.get_native_queue<sycl::backend::hip>();

// Can enqueue arbitrary backend operations. This could also be a kernel
// launch or call to a library that enqueues operations on the stream etc
// launch or call to a library that enqueues operations on the stream etc.
//
// Important: Enqueuing a *synchronous* backend operation results in
// undefined behavior.
Expand All @@ -211,7 +239,7 @@ q.wait();

=== sycl_ext_oneapi_graph

`ext_codeplay_enqueue_native_command`
`ext_oneapi_enqueue_native_command`
cannot be used in graph nodes. A synchronous exception will be thrown with error
code `invalid` if a user tries to add them to a graph.

Original file line number Diff line number Diff line change
Expand Up @@ -1943,10 +1943,10 @@ The kernels loaded using
link:../experimental/sycl_ext_oneapi_kernel_compiler_spirv.asciidoc[sycl_ext_oneapi_kernel_compiler_spirv]
behave as normal when used in graph nodes.

==== sycl_ext_codeplay_enqueue_native_command
==== sycl_ext_oneapi_enqueue_native_command

`ext_codeplay_enqueue_native_command`, defined in
link:../experimental/sycl_ext_codeplay_enqueue_native_command.asciidoc[sycl_ext_codeplay_enqueue_native_command]
`ext_oneapi_enqueue_native_command`, defined in
link:../experimental/sycl_ext_oneapi_enqueue_native_command.asciidoc[sycl_ext_oneapi_enqueue_native_command]
cannot be used in graph nodes. A synchronous exception will be thrown with error
code `invalid` if a user tries to add them to a graph.

Expand Down
6 changes: 3 additions & 3 deletions sycl/include/sycl/ext/oneapi/experimental/graph.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ enum class UnsupportedGraphFeatures {
sycl_ext_oneapi_device_global = 6,
sycl_ext_oneapi_bindless_images = 7,
sycl_ext_oneapi_experimental_cuda_cluster_launch = 8,
sycl_ext_codeplay_enqueue_native_command = 9
sycl_ext_oneapi_enqueue_native_command = 9
};

inline const char *
Expand All @@ -80,8 +80,8 @@ UnsupportedFeatureToString(UnsupportedGraphFeatures Feature) {
return "sycl_ext_oneapi_bindless_images";
case UGF::sycl_ext_oneapi_experimental_cuda_cluster_launch:
return "sycl_ext_oneapi_experimental_cuda_cluster_launch";
case UGF::sycl_ext_codeplay_enqueue_native_command:
return "sycl_ext_codeplay_enqueue_native_command";
case UGF::sycl_ext_oneapi_enqueue_native_command:
return "sycl_ext_oneapi_enqueue_native_command";
}

assert(false && "Unhandled graphs feature");
Expand Down
8 changes: 4 additions & 4 deletions sycl/include/sycl/handler.hpp
Original file line number Diff line number Diff line change
Expand Up @@ -1867,7 +1867,7 @@ class __SYCL_EXPORT handler {
template <typename FuncT>
std::enable_if_t<detail::check_fn_signature<std::remove_reference_t<FuncT>,
void(interop_handle)>::value>
ext_codeplay_enqueue_native_command_impl(FuncT &&Func) {
ext_oneapi_enqueue_native_command_impl(FuncT &&Func) {
throwIfActionIsCreated();

// Need to copy these rather than move so that we can check associated
Expand Down Expand Up @@ -2090,11 +2090,11 @@ class __SYCL_EXPORT handler {
template <typename FuncT>
std::enable_if_t<detail::check_fn_signature<std::remove_reference_t<FuncT>,
void(interop_handle)>::value>
ext_codeplay_enqueue_native_command(FuncT &&Func) {
ext_oneapi_enqueue_native_command(FuncT &&Func) {
throwIfGraphAssociated<
ext::oneapi::experimental::detail::UnsupportedGraphFeatures::
sycl_ext_codeplay_enqueue_native_command>();
ext_codeplay_enqueue_native_command_impl(Func);
sycl_ext_oneapi_enqueue_native_command>();
ext_oneapi_enqueue_native_command_impl(Func);
}

/// Defines and invokes a SYCL kernel function for the specified range and
Expand Down
2 changes: 1 addition & 1 deletion sycl/source/detail/scheduler/commands.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -3196,7 +3196,7 @@ ur_result_t ExecCGCommand::enqueueImpQueue() {
detail::getSyclObjImpl(MQueue->get_device())->getHandleRef(),
UR_DEVICE_INFO_ENQUEUE_NATIVE_COMMAND_SUPPORT_EXP,
sizeof(NativeCommandSupport), &NativeCommandSupport, nullptr);
assert(NativeCommandSupport && "ext_codeplay_enqueue_native_command is not "
assert(NativeCommandSupport && "ext_oneapi_enqueue_native_command is not "
"supported on this device");
MQueue->getPlugin()->call(urEnqueueNativeCommandExp, MQueue->getHandleRef(),
InteropFreeFunc, &CustomOpData, ReqMems.size(),
Expand Down
4 changes: 2 additions & 2 deletions sycl/test-e2e/EnqueueNativeCommand/custom-command-cuda.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -51,7 +51,7 @@ void copy(buffer<DataT, 1> &Src, buffer<DataT, 1> &Dst, queue &Q) {
if (Q.get_backend() != IH.get_backend())
throw;
};
CGH.ext_codeplay_enqueue_native_command(Func);
CGH.ext_oneapi_enqueue_native_command(Func);
});
}

Expand Down Expand Up @@ -85,7 +85,7 @@ void test_ht_buffer(queue &Q) {
Q.submit([&](handler &CGH) {
auto Acc = Buffer.get_access<mode::write>(CGH);
auto Func = [=](interop_handle IH) { /*A no-op */ };
CGH.ext_codeplay_enqueue_native_command(Func);
CGH.ext_oneapi_enqueue_native_command(Func);
});
}

Expand Down
4 changes: 2 additions & 2 deletions sycl/test-e2e/EnqueueNativeCommand/custom-command-hip.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ void copy(buffer<DataT, 1> &Src, buffer<DataT, 1> &Dst, queue &Q) {
if (Q.get_backend() != IH.get_backend())
throw;
};
CGH.ext_codeplay_enqueue_native_command(Func);
CGH.ext_oneapi_enqueue_native_command(Func);
});
}

Expand Down Expand Up @@ -91,7 +91,7 @@ void test_ht_buffer(queue &Q) {
Q.submit([&](handler &CGH) {
auto Acc = Buffer.get_access<mode::write>(CGH);
auto Func = [=](interop_handle IH) { /*A no-op */ };
CGH.ext_codeplay_enqueue_native_command(Func);
CGH.ext_oneapi_enqueue_native_command(Func);
});
}

Expand Down
132 changes: 132 additions & 0 deletions sycl/test-e2e/EnqueueNativeCommand/custom-command-level-zero.cpp
Original file line number Diff line number Diff line change
@@ -0,0 +1,132 @@
// RUN: %{build} %level_zero_options -o %t.out
// RUN: %{run} %t.out

// REQUIRES: level_zero, level_zero_dev_kit

#include <level_zero/ze_api.h>

#include <iostream>
#include <sycl/backend.hpp>
#include <sycl/detail/core.hpp>
#include <sycl/interop_handle.hpp>

using namespace sycl;
using namespace sycl::access;

static constexpr size_t BUFFER_SIZE = 1024;

template <typename T> class Modifier;

template <typename T> class Init;

template <typename BufferT, typename ValueT>
void checkBufferValues(BufferT Buffer, ValueT Value) {
auto Acc = Buffer.get_host_access();
for (size_t Idx = 0; Idx < Acc.size(); ++Idx) {
if (Acc[Idx] != Value) {
std::cerr << "buffer[" << Idx << "] = " << Acc[Idx]
<< ", expected val = " << Value << '\n';
exit(1);
}
}
}

template <typename DataT>
void copy(buffer<DataT, 1> &Src, buffer<DataT, 1> &Dst, queue &Q) {
Q.submit([&](handler &CGH) {
auto SrcA = Src.template get_access<mode::read>(CGH);
auto DstA = Dst.template get_access<mode::write>(CGH);

auto Func = [=](interop_handle IH) {
auto CommandList = IH.get_native_queue<backend::ext_oneapi_level_zero>();
auto SrcMem = IH.get_native_mem<backend::ext_oneapi_level_zero>(SrcA);
auto DstMem = IH.get_native_mem<backend::ext_oneapi_level_zero>(DstA);

// If L0 interop becomes a real use case we should make a new UR entry
// point to propagate events into and out of the the interop func.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this comment mean?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In order to make this entry point more performant on L0 we would need something like the previous host task extensions so we can pass in and out events to the ze call.

As it stands there is no point making this more performant for L0 since no one that we are aware of are using L0 interop in SYCL.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

????

We have had a SYCL Level Zero interop specification for a long time and many people are using it.

if (zeCommandListAppendMemoryCopy(CommandList, DstMem, SrcMem,
sizeof(DataT) * SrcA.size(), nullptr, 0,
nullptr) != ZE_RESULT_SUCCESS)
throw;
if (Q.get_backend() != IH.get_backend())
throw;
};
CGH.ext_oneapi_enqueue_native_command(Func);
});
}

template <typename DataT> void modify(buffer<DataT, 1> &B, queue &Q) {
Q.submit([&](handler &CGH) {
auto Acc = B.template get_access<mode::read_write>(CGH);

auto Kernel = [=](item<1> Id) { Acc[Id] += 1; };

CGH.parallel_for<Modifier<DataT>>(Acc.size(), Kernel);
});
}

template <typename DataT, DataT B1Init, DataT B2Init>
void init(buffer<DataT, 1> &B1, buffer<DataT, 1> &B2, queue &Q) {
Q.submit([&](handler &CGH) {
auto Acc1 = B1.template get_access<mode::write>(CGH);
auto Acc2 = B2.template get_access<mode::write>(CGH);

CGH.parallel_for<Init<DataT>>(BUFFER_SIZE, [=](item<1> Id) {
Acc1[Id] = B1Init;
Acc2[Id] = B2Init;
});
});
}

// Check that a single host-interop-task with a buffer will work.
void test_ht_buffer(queue &Q) {
buffer<int, 1> Buffer{BUFFER_SIZE};

Q.submit([&](handler &CGH) {
auto Acc = Buffer.get_access<mode::write>(CGH);
auto Func = [=](interop_handle IH) { /*A no-op */ };
CGH.ext_oneapi_enqueue_native_command(Func);
});
}

// A test that uses level_zero interop to copy data from buffer A to buffer B,
// by getting level_zero ptrs and calling the cuMemcpyWithAsync. Then run a SYCL
// kernel that modifies the data in place for B, e.g. increment one, then copy
// back to buffer A. Run it on a loop, to ensure the dependencies and the
// reference counting of the objects is not leaked.
void test_ht_kernel_dependencies(queue &Q) {
static constexpr int COUNT = 4;
buffer<int, 1> Buffer1{BUFFER_SIZE};
buffer<int, 1> Buffer2{BUFFER_SIZE};

// Init the buffer with a'priori invalid data.
init<int, -1, -2>(Buffer1, Buffer2, Q);

// Repeat a couple of times.
for (size_t Idx = 0; Idx < COUNT; ++Idx) {
copy(Buffer1, Buffer2, Q);
modify(Buffer2, Q);
copy(Buffer2, Buffer1, Q);
}

checkBufferValues(Buffer1, COUNT - 1);
checkBufferValues(Buffer2, COUNT - 1);
}

void tests(queue &Q) {
test_ht_buffer(Q);
test_ht_kernel_dependencies(Q);
}

int main() {
queue Q([](sycl::exception_list ExceptionList) {
if (ExceptionList.size() != 1) {
std::cerr << "Should be one exception in exception list" << std::endl;
std::abort();
}
std::rethrow_exception(*ExceptionList.begin());
});
tests(Q);
std::cout << "Test PASSED" << std::endl;
return 0;
}
Loading