Cuda rebase #6

eddy16112 · 2016-08-29T15:48:39Z

No description provided.

add cuda stream for submmitting multiple kernels. add suppot for predefined datatypes. Conflicts: opal/datatype/opal_datatype_unpack.c test/datatype/ddt_test.c

Add support for iovec and for pipeline iovec. a new way to compute nb_block and thread_per_block Conflicts: test/datatype/Makefile.am

Conflicts: test/datatype/Makefile.am

Improve the GPU memory management. Conflicts: opal/mca/mpool/gpusm/mpool_gpusm.h opal/mca/mpool/gpusm/mpool_gpusm_module.c fix gpu memory and vector datatype

device 0, we now use the devices already opened.

issues, when 2 peers were doing a send/recv or when multiple senders were targetting the same receiver. Rolf provided a patch to solve this issue, by moving the IPC communication index from a global location onto each endpoint.

and will be populated with all the known information. Beware: one still has to manually set the CUDA lib and path as they are not available after configure (unlike the include which is). Conflicts: opal/datatype/cuda/Makefile This file was certainly not supposed to be here. There is NO valid reason to have a copy of a locally generated file in the source. Add the capability to install the generated library and other minor cleanups. Open the datatype CUDA library from a default install location. Various other minor cleanups.

1. free code did not work right because we were computing the amount we freed after merging the list 2. we need to store original malloc GPU buffer in extra place because the one in the convertor gets changed over time Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu clean up code in pack and unpack Conflicts: ompi/mca/pml/ob1/pml_ob1_cuda.c opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/mca/btl/smcuda/btl_smcuda.c fix a bug when buffer is not big enough for whole ddt Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu if data in different gpu, instead of copy direct from one to the other, we do a D2D copy Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu test/datatype/Makefile.am now we can use cudamemcpy2d Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu enable zero copy + fix GPU buffer bug Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu put pipeline size into mca

iteration of the datatype based on a NULL pointer. This list will then contain the displacement and the length of each fragment of the datatype memory layout and can be used for any packing/unpacking purpose.

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_unpack.c Fix pipeline bug

contiguous

Conflicts: opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu fix zerocopy

functions Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_cuda_internal.cuh opal/datatype/cuda/opal_datatype_pack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu opal/datatype/cuda/opal_datatype_unpack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_gpu.c

rewrite pipeline s up and running. PUT size in an MCA parameters. Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu Conflicts: opal/mca/btl/btl.h less bugs Conflicts: ompi/mca/pml/monitoring/pml_monitoring_component.c opal/mca/mpool/gpusm/mpool_gpusm.h fix pipelining for non-contiguous to contiguous

reorder datatypes to cache boundaries slience warnings

this file is not used anymore

multi-GPU when ompi support multi-GPU in the future fix a cuda stream bug for iov, remove some stream syncs in openib, disable rdma for non-contiguous gpu data

rename some functions check point

Add support for caching the unpacked datatype description via the opal_convertor_raw_cached function. cached iov is working for count = 1 check point use raw_cached, but cuda iov caching is not enabled check point, split iov into two version, non-cached and cached check point iov cache another checkpoint check point, cuda iov is cached, but not used for pack/unpack check point, ready to use cached cuda iov checkpoint, cached cuda iov is working with multiple send, but not for count > 1 checkpoint, fix a bug for partial unpack checkpoint, fix unpack size cache the entire cuda iov checkpoint, during unpack, cache the entire iov before unpack another checkpoint checkpoint , remove unnecessary cuda stream sync use bit to replace % rollback to use %, not bit, since it is faster, not sure why now cuda iov is {nc_disp, c_disp} clean up kernel, put variables uses multiple times into register cached cuda iov is working for count > 1 another checkpoint now convertor->count > 1 is woring move the cuda iov caching into a seperate function these two variables are useless now fix a bug for ib, current count of convertor should be set in set_cuda_iov_position cleanup, move cudamalloc into cache cuda iov rearrange varibles if cuda_iov is not big enough, use realloc. However, cudaMallocHost does not work with realloc, so use malloc instead make sure check pointer is not NULL before free it rewrite non cached iov, make it unified with cached iov checkpoint, rewrite non-cached version fix for non cached iov fix the non cached iov, set position should be put at first move ddt iov to cuda iov into a function merge iov cached and non-cached for non cached iov, if there is no enough cuda iov space, break

This is my first complete review of the code. Many things need to get cleaned, but overall the code looks pretty good.

…ented code

…datatype support enabled or not; check cuda calls.

…dont have outer_stream

…o do not init kernel support until confirming buffer is gpu buffer.

Signed-off-by: Clement Foyer <[email protected]>

Update to OMPI master

eddy16112 force-pushed the cuda-rebase branch 6 times, most recently from 44d59cf to 42944b2 Compare September 5, 2016 19:35

rolfv and others added 24 commits October 18, 2016 12:09

Add GPU packing and unpacking

7a03ffb

add cuda stream for submmitting multiple kernels. add suppot for predefined datatypes. Conflicts: opal/datatype/opal_datatype_unpack.c test/datatype/ddt_test.c

indexed datatype new, bonus stask support.

55badab

Add support for iovec and for pipeline iovec. a new way to compute nb_block and thread_per_block Conflicts: test/datatype/Makefile.am

RDMA send is now working.

ef41551

Conflicts: test/datatype/Makefile.am

Add support for vector datatype. Add pipeline.

9bbed91

Improve the GPU memory management. Conflicts: opal/mca/mpool/gpusm/mpool_gpusm.h opal/mca/mpool/gpusm/mpool_gpusm_module.c fix gpu memory and vector datatype

unrestricted GPU. Instead of forcing everything to go on

fe03183

device 0, we now use the devices already opened.

big changes, now pack is driven by receiver by active message

fd91bb9

Upon datatype commit create a list of iovec representing a single

10a6932

iteration of the datatype based on a NULL pointer. This list will then contain the displacement and the length of each fragment of the datatype memory layout and can be used for any packing/unpacking purpose.

contiguous vs non-contiguous is working

44a64e1

Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_unpack.c Fix pipeline bug

now we are able to pack directly to remote buffer if receiver is

f0e8bff

contiguous

add ddt_benchmark

cf95914

modify for matrix transpose

7a4a10d

enable vector

bc80b3e

receiver now will send msg back to sender for buffer reuse

4d6ebb3

Conflicts: opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu fix zerocopy

opal_datatype is chnaged, so we need more space

e77302e

reorder datatypes to cache boundaries slience warnings

remove smcuda btl calls from pml ob1

b57f1e5

this file is not used anymore

cuda ddt support is able to turn itself off. Make it support

7cdf09e

multi-GPU when ompi support multi-GPU in the future fix a cuda stream bug for iov, remove some stream syncs in openib, disable rdma for non-contiguous gpu data

move ddt kernel support function pointer into opal_datatype_cuda.c

e3c6d36

rename some functions check point

eddy16112 added 3 commits October 18, 2016 12:12

clean up printf

b0a3000

disable ddt cuda test

e016bcb

roll back not use multiple ipc stream

8b85c3d

eddy16112 force-pushed the cuda-rebase branch from c2a29eb to 8b85c3d Compare October 18, 2016 19:13

eddy16112 and others added 22 commits October 19, 2016 15:03

remove unused functions

39ec7ae

more cleanup

f7e0c5f

add a printf

b2b69e5

A lot of minor changes.

ce34259

This is my first complete review of the code. Many things need to get cleaned, but overall the code looks pretty good.

minor fix to make it work again

ad4198d

add comment for opal_datatype_cuda.cuh, cached device id, remove comm…

42c5a4d

…ented code

this function is no longer needed

5c2ce9f

merge pack unpack put sig function into one

99ffb84

recheck the opal_datatype_cuda_kernel_support

66ac26a

OPAL_DATATYPE_IOV_UNIFIED_MEM is no longer needed

b47a5cd

clean up cu files

023f489

Remove useless #define. Other minor cleanups.

f00ab75

Do not modify the PMIx files (there should be no need).

d4a48d1

use OPAL_VERBOSE instead of my own DEBUG print

2006b86

Small updates.

269067c

clean up comments and remove unused define

12f5f83

remove NB_GPUS; use mca to set cuda buffer size; use mca to set cuda …

d30cc73

…datatype support enabled or not; check cuda calls.

add some protection for the case there is no mem for pack/unpack

fcc9ccb

clean up testing

1d91384

use convertor->stream to set the stream pack/unpack works on, now we …

308da38

…dont have outer_stream

disable cuda ddt test to pass make check

36d117b

mca_cuda_convertor_init is called in MPI_Init if using pre-connect, s…

7063d19

…o do not init kernel support until confirming buffer is gpu buffer.

eddy16112 force-pushed the cuda-rebase branch 2 times, most recently from 69b4614 to 7063d19 Compare March 29, 2017 20:29

thananon pushed a commit that referenced this pull request Oct 27, 2017

Simplify the communicator's name caching management (#6)

f334607

Signed-off-by: Clement Foyer <[email protected]>

dong0321 pushed a commit that referenced this pull request Sep 26, 2018

Merge pull request #6 from rhc54/topic/update

dfb36f7

Update to OMPI master

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cuda rebase #6

Cuda rebase #6

eddy16112 commented Aug 29, 2016

Cuda rebase #6

Are you sure you want to change the base?

Cuda rebase #6

Conversation

eddy16112 commented Aug 29, 2016