Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cuda rebase #6

Open
wants to merge 68 commits into
base: main
Choose a base branch
from
Open

Cuda rebase #6

wants to merge 68 commits into from

Conversation

eddy16112
Copy link
Collaborator

No description provided.

@eddy16112 eddy16112 force-pushed the cuda-rebase branch 6 times, most recently from 44d59cf to 42944b2 Compare September 5, 2016 19:35
rolfv and others added 24 commits October 18, 2016 12:09
add cuda stream for submmitting multiple kernels.
add suppot for predefined datatypes.

Conflicts:
	opal/datatype/opal_datatype_unpack.c
	test/datatype/ddt_test.c
Add support for iovec and for pipeline iovec.
a new way to compute nb_block and thread_per_block

Conflicts:
	test/datatype/Makefile.am
Conflicts:
	test/datatype/Makefile.am
Improve the GPU memory management.

Conflicts:
	opal/mca/mpool/gpusm/mpool_gpusm.h
	opal/mca/mpool/gpusm/mpool_gpusm_module.c

fix gpu memory and vector datatype
device 0, we now use the devices already opened.
issues, when 2 peers were doing a send/recv or when multiple senders
were targetting the same receiver. Rolf provided a patch to solve
this issue, by moving the IPC communication index from a global
location onto each endpoint.
and will be populated with all the known information. Beware:
one still has to manually set the CUDA lib and path as they are
not available after configure (unlike the include which is).

Conflicts:
	opal/datatype/cuda/Makefile

This file was certainly not supposed to be here. There is NO valid
reason to have a copy of a locally generated file in the source.

Add the capability to install the generated library and other
minor cleanups.

Open the datatype CUDA library from a default install location.
Various other minor cleanups.
1. free code did not work right because we were computing the amount we
freed after merging the list
2. we need to store original malloc GPU buffer in extra place because
the one in the convertor gets changed over time

Conflicts:
	opal/datatype/cuda/opal_datatype_cuda.cu
	opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu

clean up code in pack and unpack

Conflicts:
	ompi/mca/pml/ob1/pml_ob1_cuda.c
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu
Conflicts:
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu
	opal/mca/btl/smcuda/btl_smcuda.c

fix a bug when buffer is not big enough for whole ddt

Conflicts:
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

if data in different gpu, instead of copy direct from one to the other,
we do a D2D copy

Conflicts:
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu
	test/datatype/Makefile.am

now we can use cudamemcpy2d

Conflicts:
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

enable zero copy + fix GPU buffer bug

Conflicts:
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

put pipeline size into mca
iteration of the datatype based on a NULL pointer. This list will
then contain the displacement and the length of each fragment of
the datatype memory layout and can be used for any packing/unpacking
purpose.
Conflicts:
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu
	opal/datatype/opal_datatype_unpack.c

Fix pipeline bug
Conflicts:
	opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu

fix zerocopy
functions

Conflicts:
	opal/datatype/cuda/opal_datatype_cuda.cu
	opal/datatype/cuda/opal_datatype_cuda_internal.cuh
	opal/datatype/cuda/opal_datatype_pack_cuda_kernel.cu
	opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu
	opal/datatype/cuda/opal_datatype_unpack_cuda_kernel.cu
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu
	opal/datatype/opal_datatype_gpu.c
rewrite pipeline

s up and running. PUT size in an MCA parameters.

Conflicts:
	opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu

Conflicts:
	opal/mca/btl/btl.h

less bugs

Conflicts:
	ompi/mca/pml/monitoring/pml_monitoring_component.c
	opal/mca/mpool/gpusm/mpool_gpusm.h

fix pipelining for non-contiguous to contiguous
reorder datatypes to cache boundaries

slience warnings
this file is not used anymore
multi-GPU when ompi support multi-GPU in the future

fix a cuda stream bug for iov, remove some stream syncs

in openib, disable rdma for non-contiguous gpu data
Add support for caching the unpacked datatype description
via the opal_convertor_raw_cached function.

cached iov is working for count = 1
check point use raw_cached, but cuda iov caching is not enabled

check point, split iov into two version, non-cached and cached

check point iov cache

another checkpoint

check point, cuda iov is cached, but not used for pack/unpack

check point, ready to use cached cuda iov

checkpoint, cached cuda iov is working with multiple send, but not for
count > 1

checkpoint, fix a bug for partial unpack

checkpoint, fix unpack size

cache the entire cuda iov
checkpoint, during unpack, cache the entire iov before unpack

another checkpoint

checkpoint , remove unnecessary cuda stream sync

use bit to replace %

rollback to use %, not bit, since it is faster, not sure why

now cuda iov is {nc_disp, c_disp}

clean up kernel, put variables uses multiple times into register

cached cuda iov is working for count > 1

another checkpoint

now convertor->count > 1 is woring

move the cuda iov caching into a seperate function

these two variables are useless now

fix a bug for ib, current count of convertor should be set in
set_cuda_iov_position

cleanup, move cudamalloc into cache cuda iov

rearrange varibles

if cuda_iov is not big enough, use realloc. However, cudaMallocHost does
not work with realloc, so use malloc instead

make sure check pointer is not NULL before free it

rewrite non cached iov, make it unified with cached iov

checkpoint, rewrite non-cached version

fix for non cached iov

fix the non cached iov, set position should be put at first

move ddt iov to cuda iov into a function

merge iov cached and non-cached

for non cached iov, if there is no enough cuda iov space, break
@eddy16112 eddy16112 force-pushed the cuda-rebase branch 2 times, most recently from 69b4614 to 7063d19 Compare March 29, 2017 20:29
thananon pushed a commit that referenced this pull request Oct 27, 2017
dong0321 pushed a commit that referenced this pull request Sep 26, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants