Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda rebase #6

Open
wants to merge 68 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
68 commits
Select commit Hold shift + click to select a range
7a03ffb
Add GPU packing and unpacking
rolfv Nov 7, 2014
55badab
indexed datatype new, bonus stask support.
eddy16112 Nov 14, 2014
ef41551
RDMA send is now working.
eddy16112 Apr 9, 2015
9bbed91
Add support for vector datatype. Add pipeline.
eddy16112 Apr 22, 2015
fe03183
unrestricted GPU. Instead of forcing everything to go on
bosilca May 7, 2015
26f2237
Using globally defined indexes lead to several synchronization
bosilca Jun 18, 2015
63e10df
Generate the Makefile. It will now be placed in the bindir
bosilca Jun 18, 2015
cf7e185
Add a patch from Rolf fixing 2 issues:
bosilca Jun 30, 2015
fd91bb9
big changes, now pack is driven by receiver by active message
eddy16112 Aug 22, 2015
6d837d1
intel test working
eddy16112 Aug 31, 2015
10a6932
Upon datatype commit create a list of iovec representing a single
bosilca Sep 15, 2015
44a64e1
contiguous vs non-contiguous is working
eddy16112 Sep 17, 2015
f0e8bff
now we are able to pack directly to remote buffer if receiver is
eddy16112 Sep 18, 2015
cf95914
add ddt_benchmark
eddy16112 Sep 29, 2015
7a4a10d
modify for matrix transpose
eddy16112 Oct 2, 2015
bc80b3e
enable vector
eddy16112 Oct 2, 2015
4d6ebb3
receiver now will send msg back to sender for buffer reuse
eddy16112 Oct 6, 2015
3525419
offset instead of actual addess, and lots of clean up for unused
eddy16112 Oct 22, 2015
9b9f783
re-write pipeline
eddy16112 Oct 25, 2015
e77302e
opal_datatype is chnaged, so we need more space
Oct 27, 2015
b57f1e5
remove smcuda btl calls from pml ob1
eddy16112 Oct 28, 2015
7cdf09e
cuda ddt support is able to turn itself off. Make it support
eddy16112 Oct 29, 2015
e3c6d36
move ddt kernel support function pointer into opal_datatype_cuda.c
eddy16112 Nov 4, 2015
66d30fb
support caching datatype
bosilca Nov 7, 2015
4e3c5d6
apply loop unroll for pack and unpack kernels
eddy16112 Feb 5, 2016
f77c382
fix a cuda event bug. cudaStreamWaitEvent is not blocking call.
eddy16112 Feb 23, 2016
5a97315
new vector kernel
eddy16112 Feb 26, 2016
8868914
fix a if CUDA_41 error
eddy16112 Feb 26, 2016
99f03b3
use cuda event to track the completion of pack and unpack
eddy16112 Mar 1, 2016
e214c53
make openib support multi-stream
eddy16112 Mar 11, 2016
9a22660
create a btl function to register convertor to registration handle, n…
eddy16112 Apr 6, 2016
9b7d28a
fix a bug: we should also track the completion of unpack operation, b…
eddy16112 Apr 14, 2016
e56bd4f
use multiple cuda stream for P2P, it allows multiple send/recv workin…
eddy16112 Jul 13, 2016
acc3647
fix the renaming issue after merge
eddy16112 Aug 8, 2016
dcbd75e
fix a GPU memory leak issue.
eddy16112 Aug 9, 2016
612d77e
put ompi_datatype_t back to 512 byte, clean up printf and unused func…
eddy16112 Aug 30, 2016
489cc8d
convertor should be async
eddy16112 Sep 9, 2016
0017991
revert the ddt_test, will have a separate cuda test later
eddy16112 Sep 17, 2016
9b9d5b0
set the default of ddt pipeline size to 4M
eddy16112 Sep 21, 2016
545091b
bug fix, set gpu buffer to NULL when init
eddy16112 Sep 23, 2016
052c1ce
fix configuration and slient warnings of datatype padding
eddy16112 Oct 7, 2016
ac184dc
minor fix in makefile
eddy16112 Oct 8, 2016
5819a55
more fix in makefile
eddy16112 Oct 9, 2016
b0a3000
clean up printf
eddy16112 Oct 10, 2016
e016bcb
disable ddt cuda test
eddy16112 Oct 12, 2016
8b85c3d
roll back not use multiple ipc stream
eddy16112 Oct 13, 2016
39ec7ae
remove unused functions
eddy16112 Oct 19, 2016
f7e0c5f
more cleanup
eddy16112 Oct 19, 2016
b2b69e5
add a printf
eddy16112 Oct 19, 2016
ce34259
A lot of minor changes.
bosilca Oct 20, 2016
ad4198d
minor fix to make it work again
eddy16112 Oct 20, 2016
42c5a4d
add comment for opal_datatype_cuda.cuh, cached device id, remove comm…
eddy16112 Oct 21, 2016
5c2ce9f
this function is no longer needed
eddy16112 Oct 21, 2016
99ffb84
merge pack unpack put sig function into one
eddy16112 Oct 21, 2016
66ac26a
recheck the opal_datatype_cuda_kernel_support
eddy16112 Oct 21, 2016
b47a5cd
OPAL_DATATYPE_IOV_UNIFIED_MEM is no longer needed
eddy16112 Oct 21, 2016
023f489
clean up cu files
eddy16112 Oct 21, 2016
f00ab75
Remove useless #define. Other minor cleanups.
bosilca Oct 21, 2016
d4a48d1
Do not modify the PMIx files (there should be no need).
bosilca Oct 21, 2016
2006b86
use OPAL_VERBOSE instead of my own DEBUG print
eddy16112 Oct 24, 2016
269067c
Small updates.
bosilca Oct 24, 2016
12f5f83
clean up comments and remove unused define
eddy16112 Oct 25, 2016
d30cc73
remove NB_GPUS; use mca to set cuda buffer size; use mca to set cuda …
eddy16112 Oct 26, 2016
fcc9ccb
add some protection for the case there is no mem for pack/unpack
eddy16112 Oct 28, 2016
1d91384
clean up testing
eddy16112 Oct 31, 2016
308da38
use convertor->stream to set the stream pack/unpack works on, now we …
eddy16112 Nov 1, 2016
36d117b
disable cuda ddt test to pass make check
eddy16112 Nov 1, 2016
7063d19
mca_cuda_convertor_init is called in MPI_Init if using pre-connect, s…
eddy16112 Nov 15, 2016
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
16 changes: 16 additions & 0 deletions config/opal_check_cuda.m4
Original file line number Diff line number Diff line change
Expand Up @@ -55,6 +55,8 @@ AS_IF([test "$with_cuda" = "no" || test "x$with_cuda" = "x"],
AC_MSG_ERROR([Cannot continue])],
[AC_MSG_RESULT([found])
opal_check_cuda_happy=yes
opal_cuda_prefix=/usr/local/
opal_cuda_libdir=/usr/local/cuda/lib64
opal_cuda_incdir=/usr/local/cuda/include])],
[AS_IF([test ! -d "$with_cuda"],
[AC_MSG_RESULT([not found])
Expand All @@ -66,10 +68,14 @@ AS_IF([test "$with_cuda" = "no" || test "x$with_cuda" = "x"],
AC_MSG_WARN([Could not find cuda.h in $with_cuda/include or $with_cuda])
AC_MSG_ERROR([Cannot continue])],
[opal_check_cuda_happy=yes
opal_cuda_prefix=$with_cuda
opal_cuda_incdir=$with_cuda
opal_cuda_libdir="$with_cuda/lib64"
AC_MSG_RESULT([found ($with_cuda/cuda.h)])])],
[opal_check_cuda_happy=yes
opal_cuda_prefix="$with_cuda"
opal_cuda_incdir="$with_cuda/include"
opal_cuda_libdir="$with_cuda/lib64"
AC_MSG_RESULT([found ($opal_cuda_incdir/cuda.h)])])])])])

dnl We cannot have CUDA support without dlopen support. HOWEVER, at
Expand Down Expand Up @@ -119,6 +125,8 @@ if test "$opal_check_cuda_happy" = "yes"; then
CUDA_SUPPORT=1
opal_datatype_cuda_CPPFLAGS="-I$opal_cuda_incdir"
AC_SUBST([opal_datatype_cuda_CPPFLAGS])
opal_datatype_cuda_LDFLAGS="-L$opal_cuda_libdir"
AC_SUBST([opal_datatype_cuda_LDFLAGS])
else
AC_MSG_RESULT([no])
CUDA_SUPPORT=0
Expand All @@ -144,6 +152,14 @@ AM_CONDITIONAL([OPAL_cuda_gdr_support], [test "x$CUDA_VERSION_60_OR_GREATER" = "
AC_DEFINE_UNQUOTED([OPAL_CUDA_GDR_SUPPORT],$CUDA_VERSION_60_OR_GREATER,
[Whether we have CUDA GDR support available])

# Checking for nvcc
AC_MSG_CHECKING([nvcc in $opal_cuda_prefix/bin])
if test -x "$opal_cuda_prefix/bin/nvcc"; then
AC_MSG_RESULT([found])
AC_DEFINE_UNQUOTED([NVCC], ["$opal_cuda_prefix/bin/nvcc"], [Path to nvcc binary])
fi

AC_SUBST([NVCC],[$opal_cuda_prefix/bin/nvcc])
])

dnl
Expand Down
4 changes: 4 additions & 0 deletions configure.ac
Original file line number Diff line number Diff line change
Expand Up @@ -1416,6 +1416,10 @@ m4_ifdef([project_oshmem],

opal_show_subtitle "Final output"

if test "$OPAL_cuda_support" != "0"; then
AC_CONFIG_FILES([opal/datatype/cuda/Makefile])
fi

AC_CONFIG_FILES([
Makefile

Expand Down
9 changes: 9 additions & 0 deletions ompi/mca/bml/bml.h
Original file line number Diff line number Diff line change
Expand Up @@ -361,6 +361,15 @@ static inline void mca_bml_base_deregister_mem (mca_bml_base_btl_t* bml_btl, mca
btl->btl_deregister_mem (btl, handle);
}

static inline void mca_bml_base_register_convertor (mca_bml_base_btl_t* bml_btl, mca_btl_base_registration_handle_t *handle, opal_convertor_t *convertor)
{
mca_btl_base_module_t* btl = bml_btl->btl;

if (btl->btl_register_convertor != NULL) {
btl->btl_register_convertor (btl, handle, convertor);
}
}

/*
* BML component interface functions and datatype.
*/
Expand Down
2 changes: 1 addition & 1 deletion ompi/mca/pml/ob1/pml_ob1_component.c
Original file line number Diff line number Diff line change
Expand Up @@ -184,7 +184,7 @@ static int mca_pml_ob1_component_register(void)
mca_pml_ob1_param_register_int("free_list_max", -1, &mca_pml_ob1.free_list_max);
mca_pml_ob1_param_register_int("free_list_inc", 64, &mca_pml_ob1.free_list_inc);
mca_pml_ob1_param_register_int("priority", 20, &mca_pml_ob1.priority);
mca_pml_ob1_param_register_sizet("send_pipeline_depth", 3, &mca_pml_ob1.send_pipeline_depth);
mca_pml_ob1_param_register_sizet("send_pipeline_depth", 4, &mca_pml_ob1.send_pipeline_depth);
mca_pml_ob1_param_register_sizet("recv_pipeline_depth", 4, &mca_pml_ob1.recv_pipeline_depth);

/* NTH: we can get into a live-lock situation in the RDMA failure path so disable
Expand Down
167 changes: 157 additions & 10 deletions ompi/mca/pml/ob1/pml_ob1_cuda.c
Original file line number Diff line number Diff line change
Expand Up @@ -37,11 +37,22 @@
#include "ompi/mca/bml/base/base.h"
#include "ompi/memchecker.h"

#include "opal/datatype/opal_datatype_cuda.h"
#include "opal/mca/common/cuda/common_cuda.h"

size_t mca_pml_ob1_rdma_cuda_btls(
mca_bml_base_endpoint_t* bml_endpoint,
unsigned char* base,
size_t size,
mca_pml_ob1_com_btl_t* rdma_btls);

int mca_pml_ob1_rdma_cuda_btl_register_data(
mca_bml_base_endpoint_t* bml_endpoint,
mca_pml_ob1_com_btl_t* rdma_btls,
uint32_t num_btls_used,
struct opal_convertor_t *pack_convertor);

size_t mca_pml_ob1_rdma_cuda_avail(mca_bml_base_endpoint_t* bml_endpoint);

int mca_pml_ob1_cuda_need_buffers(void * rreq,
mca_btl_base_module_t* btl);
Expand All @@ -54,18 +65,21 @@ void mca_pml_ob1_cuda_add_ipc_support(struct mca_btl_base_module_t* btl, int32_t
*/
int mca_pml_ob1_send_request_start_cuda(mca_pml_ob1_send_request_t* sendreq,
mca_bml_base_btl_t* bml_btl,
size_t size) {
size_t size)
{
struct opal_convertor_t *convertor = &(sendreq->req_send.req_base.req_convertor);
int rc;
#if OPAL_CUDA_GDR_SUPPORT
/* With some BTLs, switch to RNDV from RGET at large messages */
if ((sendreq->req_send.req_base.req_convertor.flags & CONVERTOR_CUDA) &&
(sendreq->req_send.req_bytes_packed > (bml_btl->btl->btl_cuda_rdma_limit - sizeof(mca_pml_ob1_hdr_t)))) {
return mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
}
#endif /* OPAL_CUDA_GDR_SUPPORT */

sendreq->req_send.req_base.req_convertor.flags &= ~CONVERTOR_CUDA;

if (opal_convertor_need_buffers(&sendreq->req_send.req_base.req_convertor) == false) {
#if OPAL_CUDA_GDR_SUPPORT
/* With some BTLs, switch to RNDV from RGET at large messages */
if ((sendreq->req_send.req_bytes_packed > (bml_btl->btl->btl_cuda_rdma_limit - sizeof(mca_pml_ob1_hdr_t)))) {
sendreq->req_send.req_base.req_convertor.flags |= CONVERTOR_CUDA;
return mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
}
#endif /* OPAL_CUDA_GDR_SUPPORT */
unsigned char *base;
opal_convertor_get_current_pointer( &sendreq->req_send.req_base.req_convertor, (void**)&base );
/* Set flag back */
Expand All @@ -75,6 +89,14 @@ int mca_pml_ob1_send_request_start_cuda(mca_pml_ob1_send_request_t* sendreq,
base,
sendreq->req_send.req_bytes_packed,
sendreq->req_rdma))) {

rc = mca_pml_ob1_rdma_cuda_btl_register_data(sendreq->req_endpoint,
sendreq->req_rdma, sendreq->req_rdma_cnt,
convertor);
if (rc != 0) {
OPAL_OUTPUT_VERBOSE((0, mca_common_cuda_output, "Failed to register convertor, rc= %d\n", rc));
return rc;
}
rc = mca_pml_ob1_send_request_start_rdma(sendreq, bml_btl,
sendreq->req_send.req_bytes_packed);
if( OPAL_UNLIKELY(OMPI_SUCCESS != rc) ) {
Expand All @@ -91,14 +113,90 @@ int mca_pml_ob1_send_request_start_cuda(mca_pml_ob1_send_request_t* sendreq,
} else {
/* Do not send anything with first rendezvous message as copying GPU
* memory into RNDV message is expensive. */
unsigned char *base;
size_t buffer_size = 0;
sendreq->req_send.req_base.req_convertor.flags |= CONVERTOR_CUDA;

/* cuda kernel support is not enabled */
if (opal_datatype_cuda_kernel_support == 0) {
rc = mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
return rc;
}
/* cuda kernel support is enabled */
if ((bml_btl->btl->btl_cuda_ddt_allow_rdma == 1) &&
(mca_pml_ob1_rdma_cuda_avail(sendreq->req_endpoint) != 0)) {

if (convertor->local_size > bml_btl->btl->btl_cuda_ddt_pipeline_size) {
buffer_size = bml_btl->btl->btl_cuda_ddt_pipeline_size * bml_btl->btl->btl_cuda_ddt_pipeline_depth;
} else {
buffer_size = convertor->local_size;
}
base = opal_cuda_malloc_gpu_buffer(buffer_size, 0);
if (NULL == base) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
convertor->gpu_buffer_ptr = base;
convertor->gpu_buffer_size = buffer_size;
sendreq->req_send.req_bytes_packed = convertor->local_size;
OPAL_OUTPUT_VERBOSE((OPAL_DATATYPE_CUDA_VERBOSE_LEVEL, mca_common_cuda_output,
"RDMA malloc GPU BUFFER %p for pack, local size %lu, "
"pipeline size %lu, depth %d\n",
base, convertor->local_size, bml_btl->btl->btl_cuda_ddt_pipeline_size,
bml_btl->btl->btl_cuda_ddt_pipeline_depth));
if( 0 != (sendreq->req_rdma_cnt = (uint32_t)mca_pml_ob1_rdma_cuda_btls(
sendreq->req_endpoint,
base,
sendreq->req_send.req_bytes_packed,
sendreq->req_rdma))) {

rc = mca_pml_ob1_rdma_cuda_btl_register_data(sendreq->req_endpoint,
sendreq->req_rdma, sendreq->req_rdma_cnt,
convertor);
if (rc != 0) {
OPAL_OUTPUT_VERBOSE((0, mca_common_cuda_output, "Failed to register convertor, rc= %d\n", rc));
return rc;
}
convertor->flags |= CONVERTOR_CUDA_ASYNC;
rc = mca_pml_ob1_send_request_start_rdma(sendreq, bml_btl,
sendreq->req_send.req_bytes_packed);

if( OPAL_UNLIKELY(OMPI_SUCCESS != rc) ) {
mca_pml_ob1_free_rdma_resources(sendreq);
}
return rc; /* ready to return */
} else {
/* We failed to use the last GPU buffer, release it and realloc it with the new size */
opal_cuda_free_gpu_buffer(base, 0);
}
}
/* In all other cases fall-back on copy in/out protocol */
if (bml_btl->btl->btl_cuda_max_send_size != 0) {
convertor->pipeline_size = bml_btl->btl->btl_cuda_max_send_size;
} else {
convertor->pipeline_size = bml_btl->btl->btl_max_send_size;
}
convertor->pipeline_depth = mca_pml_ob1.send_pipeline_depth;
if (convertor->local_size > convertor->pipeline_size) {
buffer_size = convertor->pipeline_size * convertor->pipeline_depth;
} else {
buffer_size = convertor->local_size;
}
base = opal_cuda_malloc_gpu_buffer(buffer_size, 0);
if (NULL == base) {
return OPAL_ERR_OUT_OF_RESOURCE;
}
OPAL_OUTPUT_VERBOSE((OPAL_DATATYPE_CUDA_VERBOSE_LEVEL, mca_common_cuda_output,
"Copy in/out malloc GPU buffer %p, pipeline_size %ld\n",
base, convertor->pipeline_size));
convertor->gpu_buffer_ptr = base;
convertor->gpu_buffer_size = buffer_size;
convertor->pipeline_seq = 0;
rc = mca_pml_ob1_send_request_start_rndv(sendreq, bml_btl, 0, 0);
}

return rc;
}



size_t mca_pml_ob1_rdma_cuda_btls(
mca_bml_base_endpoint_t* bml_endpoint,
unsigned char* base,
Expand Down Expand Up @@ -152,6 +250,55 @@ size_t mca_pml_ob1_rdma_cuda_btls(
return num_btls_used;
}

int mca_pml_ob1_rdma_cuda_btl_register_data(
mca_bml_base_endpoint_t* bml_endpoint,
mca_pml_ob1_com_btl_t* rdma_btls,
uint32_t num_btls_used,
struct opal_convertor_t *pack_convertor)
{
uint32_t i;
for (i = 0; i < num_btls_used; i++) {
mca_btl_base_registration_handle_t *handle = rdma_btls[i].btl_reg;
mca_bml_base_btl_t* bml_btl = mca_bml_base_btl_array_get_index(&bml_endpoint->btl_send, i);
mca_bml_base_register_convertor(bml_btl, handle, pack_convertor);
}
return 0;
}

/* return how many btl can have RDMA support */
size_t mca_pml_ob1_rdma_cuda_avail(mca_bml_base_endpoint_t* bml_endpoint)
{
int num_btls = mca_bml_base_btl_array_get_size(&bml_endpoint->btl_send);
double weight_total = 0;
int num_btls_used = 0, n;

/* shortcut when there are no rdma capable btls */
if(num_btls == 0) {
return 0;
}

/* check if GET is supported by the BTL */
for(n = 0;
(n < num_btls) && (num_btls_used < mca_pml_ob1.max_rdma_per_request);
n++) {
mca_bml_base_btl_t* bml_btl =
mca_bml_base_btl_array_get_index(&bml_endpoint->btl_send, n);

if (bml_btl->btl_flags & MCA_BTL_FLAGS_CUDA_GET) {
weight_total += bml_btl->btl_weight;
num_btls_used++;
}
}

/* if we don't use leave_pinned and all BTLs that already have this memory
* registered amount to less then half of available bandwidth - fall back to
* pipeline protocol */
if(0 == num_btls_used || (!mca_pml_ob1.leave_pinned && weight_total < 0.5))
return 0;

return num_btls_used;
}

int mca_pml_ob1_cuda_need_buffers(void * rreq,
mca_btl_base_module_t* btl)
{
Expand Down
Loading