-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Conversation
Hey @Kh4L , Thanks for submitting the PR
CI supported jobs: [unix-cpu, centos-gpu, sanity, windows-cpu, miscellaneous, centos-cpu, unix-gpu, windows-gpu, edge, website, clang] Note: |
c94b243
to
259bbb0
Compare
@mxnet-bot run ci [centos-cpu, centos-gpu, edge, unix-cpu, unix-gpu, windows-gpu] |
Jenkins CI successfully triggered : [unix-cpu, windows-gpu, edge, unix-gpu, centos-gpu, centos-cpu] |
@szha @sandeep-krishnamurthy Do you have someone in your team or elsewhere for a code review? |
I plan to take a closer look this week. could @ptrendx also take a look? |
include/mxnet/base.h
Outdated
*/ | ||
bool is_bulk; | ||
void *event_pool = nullptr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Leaving this comment here just as a way to explain why the is_bulk
is removed. is_bulk
was a way to mark operations as not needing synchronization when being a part of a single engine bulk. This PR moves the handling of the synchronization to the engine itself, so this is no longer useful.
I will review this PR next week. |
src/engine/threaded_engine.cc
Outdated
|
||
void ThreadedEngine::OnStartCPU(Engine *engine, void *opr_block, | ||
const dmlc::Error* error) { | ||
static bool use_new_dep_engine = dmlc::GetEnv("MXNET_ASYNC_GPU_ENGINE", false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have MXNET_ENGINE_TYPE
which decides whether NaiveEngine
(i.e. synchronous scheduler) is used, so unless there's a strong reason I'd prefer not to introduce new environment variables here.
@apeforest @eric-haibin-lin can you please help review this PR? |
@sandeep-krishnamurthy what's the review status? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the great work! The PR looks good to me. The synchronization in new dlpack API #20546 will probably rely on the sync_obj introduced here. Looking forward to seeing this new feature to be merged.
if ((*events_per_stream)[event_stream].pool_index < cuda_event.pool_index) { | ||
(*events_per_stream)[event_stream] = cuda_event; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why should the event be replaced by others with larger pool_index?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because the pool is per stream so for 2 events from the same pool the one with the larger pool_index had to be recorded later (and synchronizing with that event will synchronize with both).
if (sync_obj.writer_event[0].event.expired()) { | ||
sync_obj.writer_event.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you elaborate why should all the writer events be removed when the first writer event expires.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can only be 1 writer event active (vs multiple readers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There can be only one event in the writer_event
vector, clear
here could be replaced by pop_back
.
sync_obj.reader_events.clear(); | ||
sync_obj.writer_event.clear(); | ||
sync_obj.writer_event.push_back({event, worker_stream->stream_, event_pool_idx}); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably answers my last question. But what's the purpose of clearing all the reader and writer events? Is it because current event should synchronize with the cleared reader and writer events?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right - since if you are the writer to the variable, you had to wait for every previous reader/writer already there is no reason to keep those events anymore.
CallbackOnStart on_start = this->CreateOnStart(ThreadedEngine::OnStartStatic, | ||
opr_block); | ||
CallbackOnComplete callback = this->CreateCallback(ThreadedEngine::OnCompleteStatic, | ||
opr_block); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why use OnStartStatic and OnCompleteStatic here instead of creating GPU onstart and oncomplete for GPU case, creating static-onstart and static-oncomplete for CPU case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mechanism introduced in this PR is in addition to the dependency tracking that already existed, and the OnCompleteStatic contains dependency update code (so it has to be called from the GPU version as well).
d863554
to
ee5a6bb
Compare
@mxnet-bot run ci [sanity] |
Jenkins CI successfully triggered : [sanity] |
Signed-off-by: Serge Panev <[email protected]>
Signed-off-by: Serge Panev <[email protected]>
@mxnet-bot run ci [unix-gpu] |
Jenkins CI successfully triggered : [unix-gpu] |
@mxnet-bot run ci [centos-cpu, unix-cpu, windows-gpu] |
Jenkins CI successfully triggered : [windows-gpu, centos-cpu, unix-cpu] |
@Kh4L glad that the CI passed. @ptrendx @barry-jin could you take a look and see if there's any other concern to be addressed? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, Thanks! My only concern is that use_new_dep_engine
is probably not tested in CI and new environment variable is not added into website. But, I think its OK to do these in a seperate pr.
static inline bool IsEngineAsync() { | ||
std::string type = dmlc::GetEnv("MXNET_ENGINE_TYPE", std::string("")); | ||
std::string async_engine_tag("Async"); | ||
auto tag_pos = type.find(async_engine_tag); | ||
return tag_pos != std::string::npos; | ||
} | ||
|
||
void ThreadedEngine::OnStartCPU(Engine* engine, void* opr_block, const dmlc::Error* error) { | ||
static bool use_new_dep_engine = IsEngineAsync(); | ||
if (!use_new_dep_engine) { | ||
return; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this part is not tested in CI because I didn't see any changes to use Async tag inside test functions. https://github.com/apache/incubator-mxnet/blob/master/ci/docker/runtime_functions.sh
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not test it in the CI now? @Kh4L
@@ -49,7 +49,8 @@ core_logic: { | |||
custom_steps.test_unix_cpp_package_gpu('gpu'), | |||
// TODO(szha): fix and reenable the hanging issue. tracked in #18098 | |||
// custom_steps.test_unix_distributed_kvstore_gpu('gpu'), | |||
custom_steps.test_unix_byteps_gpu('gpu'), | |||
// TODO(spanev): reenable when byteps is updated with the new dep engine API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you add an issue to track this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, tracked here #20697
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Description
Changes how the GPU operations are synced in the
ThreadedEngine
.GPU operators needed to be completed so the output variables (either
NDArray
or Resource) could be tagged as readable. The engine workers needed tocudaStreamSynchronize
on the GPU stream to know when the variable is ready to read. This prevented an optimal GPU kernels overlapping with CPU operations.This PR introduces a sync mechanism which leverages
cudaEvent
to avoid host synchronization between GPU operations.The GPU writing operators tag the variables with a
cudaEvent
based sync object, and the operators reading the variables sync or wait on these objects.Illustration
Note
Use the
MXNET_ENGINE_TYPE
environment variable to enable the new async engine feature, by addingAsync
at the end of the engine type:When not set, the GPU operations synchronization will be done as previously: with a host-device synchronization after the op function.
Detailed changes
The changes primarily concern the
ThreadedEnginePerDevice
and theThreadedEnginePooled
.Removing synchronization
This PR removes all the stream synchronizations that used to be called after the operator function, such as
rctx.get_stream<gpu>()->Wait();
.SyncObject
A new structure,
SyncObject
is added to the engine variable. It carries the CUDA events used to defer the data dependency to the CUDA Driver.Therefore, each variable (
NDArray
orResource
) handled by the engine owns aSyncObject
.There are two categories of events, the reader events and writer events.
A sync object can have 0 or more reader events: several operators might be waiting to read this
Variable
.However, it can carry only 0 or 1 writer event: only one writer can be writing on the variable at a time.
New functions: OnCompleteGPU, OnStartGPU and OnStartCPU
Previously, an OnComplete callback would be called by the engine worker right after executing the op function. Its main purpose was to set the data dependency. This PR adds a new OnComplete overrides and another callback type, the on
3 new hooks were added:
OnCompleteGPU
. The existing OnComplete callback has new a new version, the OnCompleteGPU. This functions captures the kernels scheduled on the stream in an event, and assigns it to the SyncObject of each variable.OnStartGPU
: this hook is called before GPU op functions and will iterate through all the events of variables read and written by the function. It calls cudaStreamWaitEvent to guarantee that all the needed variables will be ready before the current function's kernels are scheduled.OnStartCPU
: this hook acts similarly as the previous one, but it is called before GPU op functions and it uses host-device synchronisation withcudaEventSynchronize
.These changes result in a new signature for the
Engine
's async function:Before:
After:
CUDAEventPool
The events are managed by a
CUDAEventPool
that contains a counter. This is needed both for performance and to track the event order.In the
ThreadedEnginePerDevice
, there is one pool per worker. In theThreadedEnginePooled
, there is a oneThreadedEnginePooled
per stream, owned by the stream manager.Pooled allocator
The
PooledStorageManager
is also aware of the sync objects in order to recycle them with the dataChunk
. Thefree
function now also recycles the SyncObject, which can be later synchronized atalloc
time.WaitAll, WaitToRead and WaitToWrite updated
The
NDArray
synchronisation methodWaitAll
,WaitToRead
andWaitToWrite
now callcudaEventSynchronize
to guarantee that the GPU data is ready.TODO:
Co-author: @ptrendx