forked from open-mpi/ompi
-
Notifications
You must be signed in to change notification settings - Fork 4
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cuda rebase #6
Open
eddy16112
wants to merge
68
commits into
ICLDisco:main
Choose a base branch
from
eddy16112:cuda-rebase
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Cuda rebase #6
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
eddy16112
force-pushed
the
cuda-rebase
branch
6 times, most recently
from
September 5, 2016 19:35
44d59cf
to
42944b2
Compare
add cuda stream for submmitting multiple kernels. add suppot for predefined datatypes. Conflicts: opal/datatype/opal_datatype_unpack.c test/datatype/ddt_test.c
Add support for iovec and for pipeline iovec. a new way to compute nb_block and thread_per_block Conflicts: test/datatype/Makefile.am
Conflicts: test/datatype/Makefile.am
Improve the GPU memory management. Conflicts: opal/mca/mpool/gpusm/mpool_gpusm.h opal/mca/mpool/gpusm/mpool_gpusm_module.c fix gpu memory and vector datatype
device 0, we now use the devices already opened.
issues, when 2 peers were doing a send/recv or when multiple senders were targetting the same receiver. Rolf provided a patch to solve this issue, by moving the IPC communication index from a global location onto each endpoint.
and will be populated with all the known information. Beware: one still has to manually set the CUDA lib and path as they are not available after configure (unlike the include which is). Conflicts: opal/datatype/cuda/Makefile This file was certainly not supposed to be here. There is NO valid reason to have a copy of a locally generated file in the source. Add the capability to install the generated library and other minor cleanups. Open the datatype CUDA library from a default install location. Various other minor cleanups.
1. free code did not work right because we were computing the amount we freed after merging the list 2. we need to store original malloc GPU buffer in extra place because the one in the convertor gets changed over time Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu clean up code in pack and unpack Conflicts: ompi/mca/pml/ob1/pml_ob1_cuda.c opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu
Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/mca/btl/smcuda/btl_smcuda.c fix a bug when buffer is not big enough for whole ddt Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu if data in different gpu, instead of copy direct from one to the other, we do a D2D copy Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu test/datatype/Makefile.am now we can use cudamemcpy2d Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu enable zero copy + fix GPU buffer bug Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu put pipeline size into mca
iteration of the datatype based on a NULL pointer. This list will then contain the displacement and the length of each fragment of the datatype memory layout and can be used for any packing/unpacking purpose.
Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_unpack.c Fix pipeline bug
Conflicts: opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu fix zerocopy
functions Conflicts: opal/datatype/cuda/opal_datatype_cuda.cu opal/datatype/cuda/opal_datatype_cuda_internal.cuh opal/datatype/cuda/opal_datatype_pack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_pack_cuda_wrapper.cu opal/datatype/cuda/opal_datatype_unpack_cuda_kernel.cu opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu opal/datatype/opal_datatype_gpu.c
rewrite pipeline s up and running. PUT size in an MCA parameters. Conflicts: opal/datatype/cuda/opal_datatype_unpack_cuda_wrapper.cu Conflicts: opal/mca/btl/btl.h less bugs Conflicts: ompi/mca/pml/monitoring/pml_monitoring_component.c opal/mca/mpool/gpusm/mpool_gpusm.h fix pipelining for non-contiguous to contiguous
reorder datatypes to cache boundaries slience warnings
this file is not used anymore
multi-GPU when ompi support multi-GPU in the future fix a cuda stream bug for iov, remove some stream syncs in openib, disable rdma for non-contiguous gpu data
rename some functions check point
Add support for caching the unpacked datatype description via the opal_convertor_raw_cached function. cached iov is working for count = 1 check point use raw_cached, but cuda iov caching is not enabled check point, split iov into two version, non-cached and cached check point iov cache another checkpoint check point, cuda iov is cached, but not used for pack/unpack check point, ready to use cached cuda iov checkpoint, cached cuda iov is working with multiple send, but not for count > 1 checkpoint, fix a bug for partial unpack checkpoint, fix unpack size cache the entire cuda iov checkpoint, during unpack, cache the entire iov before unpack another checkpoint checkpoint , remove unnecessary cuda stream sync use bit to replace % rollback to use %, not bit, since it is faster, not sure why now cuda iov is {nc_disp, c_disp} clean up kernel, put variables uses multiple times into register cached cuda iov is working for count > 1 another checkpoint now convertor->count > 1 is woring move the cuda iov caching into a seperate function these two variables are useless now fix a bug for ib, current count of convertor should be set in set_cuda_iov_position cleanup, move cudamalloc into cache cuda iov rearrange varibles if cuda_iov is not big enough, use realloc. However, cudaMallocHost does not work with realloc, so use malloc instead make sure check pointer is not NULL before free it rewrite non cached iov, make it unified with cached iov checkpoint, rewrite non-cached version fix for non cached iov fix the non cached iov, set position should be put at first move ddt iov to cuda iov into a function merge iov cached and non-cached for non cached iov, if there is no enough cuda iov space, break
eddy16112
force-pushed
the
cuda-rebase
branch
from
October 18, 2016 19:13
c2a29eb
to
8b85c3d
Compare
This is my first complete review of the code. Many things need to get cleaned, but overall the code looks pretty good.
…datatype support enabled or not; check cuda calls.
…dont have outer_stream
…o do not init kernel support until confirming buffer is gpu buffer.
eddy16112
force-pushed
the
cuda-rebase
branch
2 times, most recently
from
March 29, 2017 20:29
69b4614
to
7063d19
Compare
thananon
pushed a commit
that referenced
this pull request
Oct 27, 2017
Signed-off-by: Clement Foyer <[email protected]>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.