Skip to content

Commit fe830ac

Browse files
authored
GrCUDA MultiGPU Release (#47)
* Integrating multiGPU in master [TEST] (#45) * removed deprecation warning for arity exception * updated changelog * updated cuda benchmark suite with multi-gpu benchmarks * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * fixed crash when shutting down grcuda and numgpus < totgpus; disabled cublas on old gpus and async scheduler; added multigpu API support to runtime * added multi-gpu option in context used in tests * GRCUDA-56 added optimized interface to load/build kernels on single/multi GPU * tests are properly skipped if the system configuration does not support them * added test for manual selection of multigpu * fixed manual gpu selection not working with async scheduler. Added tracking of currently active GPU in runtime * improved computationelement profiling interface * minor updates in naming, added default GPU id * removed unnecessary logging of timers * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * replaced 'isLastComputationArrayAccess' with device-level tracking of array updates * added fixme note on possible problem with array tracking * fixed scheduling of write array being skipped when a read-only GPU computation was ongoing * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * [GrCUDA 96-1] update python benchmark suite for multi gpu (#30) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * [GrCUDA-96-2] integrate multi gpu scheduler (#31) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * integrating multi-gpu stream policies * replaced default cuda benchmark with ERR instead of B1 * removed unnecessary js file * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * added interface to restrict device selection to a subset of devices; changed DeviceList impl to use List instead of array * fixed round robin with specific devices; added tests for it * added filtered device selection policies * added new parent stream selection policy * added test for stream aware policy * added test for disjoint parent policy * fixed oob error when reading connection graph and the number of gpus to use is smaller than the number of gpus in the system * fixed connection_graph loading error in parsing csv * added test dataset for connection garph * [GrCUDA-96-4] multigpu device management (#33) * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * [GrCUDA-96-6] Stream policies for multi-GPU (#36) * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * integrating multi-gpu stream policies * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * fixed things for pr. Added connection_graph_test.csv to git; * minr fix porting from 97-7 * Merge 96-8 on test-96-0 (#41) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * Grcuda 96 9 more time logging (#42) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * added logging for multiple computations on same deviceId * Merge 96-11 on test 96-0 (#43) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * Merge 96-12 on test-96-0 (#44) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * Added first version of dump scheduling graph functionality * fixing minor issues * fixing minor issues * Added MultiGPU support * Added option to export scheduling DAG true or false for now, path hardcoded * negative values bug fix * refactoring using java 8 streams * minor issue * minor issue * minor issue * minor issue * minor issue * modifying export path * fixing bug in export dag function * Export DAG to specific path the option now expects the path where to place the DAG file as value, if not specified, the DAG will not be exported * fixed ExportDAG option * Update README.md * visualization optimization * minor visualization optimization * minor visualization optimization * minor visualization optimization * Added documentation to GraphExport.java * removed hardcoded paths from tests * code cleanup, prepare to PR * Removed DAG export from tests (#46) * removed deprecation warning for arity exception * updated changelog * updated cuda benchmark suite with multi-gpu benchmarks * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * fixed crash when shutting down grcuda and numgpus < totgpus; disabled cublas on old gpus and async scheduler; added multigpu API support to runtime * added multi-gpu option in context used in tests * GRCUDA-56 added optimized interface to load/build kernels on single/multi GPU * tests are properly skipped if the system configuration does not support them * added test for manual selection of multigpu * fixed manual gpu selection not working with async scheduler. Added tracking of currently active GPU in runtime * improved computationelement profiling interface * minor updates in naming, added default GPU id * removed unnecessary logging of timers * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * replaced 'isLastComputationArrayAccess' with device-level tracking of array updates * added fixme note on possible problem with array tracking * fixed scheduling of write array being skipped when a read-only GPU computation was ongoing * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * [GrCUDA 96-1] update python benchmark suite for multi gpu (#30) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * [GrCUDA-96-2] integrate multi gpu scheduler (#31) * updated plotting code with multi-gpu code * minor fixed in plotting code * updated python benchmarks for multi gpu * minor cleanup * fixed benchmark tests in python, added temporary multigpu options to grcuda * added options for multi-gpu support * updated grcudaexecutioncontext to have grcudaoptionmap as input * more logging, added more policy enum for multigpu * integrating multi-gpu stream policies * replaced default cuda benchmark with ERR instead of B1 * removed unnecessary js file * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * added interface to restrict device selection to a subset of devices; changed DeviceList impl to use List instead of array * fixed round robin with specific devices; added tests for it * added filtered device selection policies * added new parent stream selection policy * added test for stream aware policy * added test for disjoint parent policy * fixed oob error when reading connection graph and the number of gpus to use is smaller than the number of gpus in the system * fixed connection_graph loading error in parsing csv * added test dataset for connection garph * [GrCUDA-96-4] multigpu device management (#33) * added location tracking inside abstractarray; added abstractdevice to distinguish cpu and gpu * added mocked tests to validate abstractarray location * fixed bug on post-pascal devices where CPU reads required unnecessary sync when a read-only GPU kernel is running * [GrCUDA-96-6] Stream policies for multi-GPU (#36) * adding streampolicy class; modified FIFO retrieval policy to retrieve any stream from the set * added device manager for multigpu * integrating multi-gpu stream policies * updated infrastructure for stream policy, added mocked classes, hidden current device selection in stream policy * added stream-aware device selection policy; refactored streampolicy/devicesmanager to create streams on multiple devices * adding tests for multigpu: added base case with 1 gpu * added test for multigpu image pipeline, mocked * added tests for mocked hits * added new tests for stream-aware policy * refactored gpu test for reuse in multigpu tests * added tests for multi gpu * added multigpu-disjoint policy for parent stream retrieval * added round robin and min transfer size device selection policies * added tests for round robin and multigpu-disjoint and min-transfer policies * added minmin/max transfer time policies for device selection * added script to generate connection graph * fixed oob access in min transfer time policy * added option to manually specify connection graph location; moved connection graph script to separate folder; added test to validate connection graph loading * fixed bandwidth computation in min time device selection policy * fixed things for pr. Added connection_graph_test.csv to git; * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * Added first version of dump scheduling graph functionality * fixing minor issues * fixing minor issues * Added MultiGPU support * Added option to export scheduling DAG true or false for now, path hardcoded * negative values bug fix * refactoring using java 8 streams * minor issue * minor issue * minor issue * minor issue * minor issue * modifying export path * fixing bug in export dag function * Export DAG to specific path the option now expects the path where to place the DAG file as value, if not specified, the DAG will not be exported * fixed ExportDAG option * Update README.md * visualization optimization * minor visualization optimization * minor visualization optimization * minor visualization optimization * Added documentation to GraphExport.java * removed hardcoded paths from tests * code cleanup, prepare to PR * minr fix porting from 97-7 * Merge 96-8 on test-96-0 (#41) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * Grcuda 96 9 more time logging (#42) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * added logging for multiple computations on same deviceId * Merge 96-11 on test 96-0 (#43) * moved all mocked computations to a different class * added mock vec benchmark * added connection graph dataset with 8 V100s; rounding bandwidth to floor to reduce randomness; added test for vec multigpu * added mocked b6ml * added mocked cg-B9 benchmark * added mmult mocked benchmark * added partioned z in b11, added kernel for preconditioning in b9; both in cuda * fixed zpartition in b11 cuda, added zpartition in b11 python * added preconditioning to b9 python * fixed device selection policy not using tostring * updated b11 mocked to use partitioned z * simplified round-robin policy, now we simply increase the internal state and do a % on the device list * added min data threshold to consider device for selection * added option to specify min data threshold * updated benchmark wrapper to new grcuda policies * updated python benchmark suite to use current multigpu options * restored options for experiments * updated nvprof wrapper * added connection graphs for 1 and 2 V100 * fixed wrappers * fixed prefetching in cuda being always on * fixed benchmarks in python * added v100 conneciton graph for 4 gpus * updated path of 8 v100 dataset, added command to create dataset dir in connection graph script * fixed const flags in py benchmarks multigpu * fixed kernel timing option being wrong in wrapper * updated wrapper for testing * reverted to GraalVM 21.2; fixed performance regression in DeviceArray access by putting logger to static final * replaced logging strings with lambdas * optimized init of b1m python * fixed init of b9 for large matrices * fixed parameters not being reset * fixed benchmark parameters * fixed init in b1. Irrelevant for benchmark performance * adding heatmap for gpu bandwidth * adde gpu bandwidth heatmap plot * updated heatmap, now it's smaller. Updated res loading code. Added loading of new grcuda results * updated loading of grcuda results * added grcuda plotting * added options for a100 to benchmark wrapper * modified install script * fix to install.sh * updated install.sh to compute the interconnection graph * benchmark_wrapper set for V100 * updated benchmark_wrapper to retrieve connection_graph correctly * enabled min-transfer-size test in benchmark_wrapper * remove execution DAG export from tests * {HOTFIX} Update install.sh * {HOTFIX} Update setup_machine_from_scratch.sh * {HOTFIX} Create setup_graalvm.sh * {HOTFIX} Update setup_machine_from_scratch.sh Install CUDA toolkit 11.7 instead of 11.4 * {HOTFIX} Update README.md * updated bindings and logging documentation * support for graalvm 21.3.2 * fix to setup script * {HOTFIX} Update CHANGELOG.md * {HOTFIX} Update CHANGELOG.md * support for graalvm 22.1.0 * moved to network version of cuda installer, removed unuseful binary * updated readme * updated documentation
1 parent a3e7bfe commit fe830ac

File tree

202 files changed

+15370
-4386
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

202 files changed

+15370
-4386
lines changed

.gitignore

+1-1
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,7 @@ examples/tensorrt/cpp/build
6060
venv
6161
out/
6262
*.files
63-
63+
*.csv
6464
grcuda_token.txt
6565
projects/demos/image_pipeline/cuda/build
6666
projects/demos/image_pipeline/img_out

.gitmodules

+4
Original file line numberDiff line numberDiff line change
@@ -2,3 +2,7 @@
22
path = grcuda-data
33
url = https://github.com/AlbertoParravicini/grcuda-data.git
44
branch = master
5+
[submodule "projects/resources/python/plotting/segretini_matplottini"]
6+
path = projects/resources/python/plotting/segretini_matplottini
7+
url = [email protected]:AlbertoParravicini/segretini-matplottini.git
8+
branch = master

CHANGELOG.md

+71
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,82 @@
1+
# 2022-06-01
2+
3+
* Added scheduling DAG export functionality. It is now possible to retrieve a graphic version of the scheduling DAG of the execution by adding `ExportDAG` in the startup options. The graph will be exported in .dot format in the path specified by the user as option argument.
4+
* This information can be leveraged to better understand the achieved runtime performance and to compare the schedules derived from different policies. Moreover, poorly written applications will results in DAGs with low-level of task-parallelism independently of the selected policy, suggesting designers to change their applications’ logic.
5+
6+
7+
# 2022-04-15
8+
9+
* Updated install.sh to compute the interconnection graph
10+
* Updated benchmark_wrapper to retrieve connection_graph correctly
11+
* Enabled min-transfer-size test in benchmark_wrapper
12+
* Benchmark_wrapper set for V100
13+
14+
# 2022-02-16
15+
16+
* Added logging for multiple computations (List of floats) on the same deviceID. This information could be used in future history-based adaptive scheduling policies.
17+
18+
# 2022-01-26
19+
20+
* Added mocked benchmarks: for each multi-gpu benchmark in our suite, there is a mocked version where we check that the GPU assignment is the one we expect. Added utility functions to easily test mocked benchmarks
21+
simplified round-robin device selection policy, now it works more or less as before but it is faster to update when using a subset of devices
22+
* Added threshold parameter for data-aware device selection policies. When using min-transfer-size or minmax/min-transfer-time, consider only devices that have at least 10% (or X %) of the requested data. Basically, if a device only has a very small amount of data already available it is not worth preferring it to other devices, and it can cause scheduling to converge to a unique device. See B9 and B11, for example.
23+
* Updated python benchmark suite to use new options, and optimized initialization of B1 (it is faster now) and B9 (it didn't work on matrices with 50K rows, as python uses 32-bit array indexing)
24+
* Fixed performance regression in DeviceArray access. For a simple python code that writes 160M values on a DeviceArray, performance went from 4sec to 20sec by using GraalVM 21.3 instead of 21.2. Reverted GraalVM to 21.2. Using non-static final Logger in GrCUDAComputationalElement increased time from 4sec to 130sec (not sure why, they are not created in repeated array accesses): fixed this regression.
25+
26+
# 2022-01-14
27+
28+
* Modified the "new stream creation policy FIFO" to simply reuse an existing free stream, without using a FIFO policy. Using FIFO did not give any benefit (besides a more predictable stream assignment), but it was more complex (we needed both a set and a FIFO, now we just use a set for the free streams)
29+
* Added device manager to track devices. This is mostly an abstraction layer over CUDARuntime, and allows retrieving the currently active GPU, or retrieving a specific device.
30+
* DeviceManager is only a "getter", it cannot change the state of the system (e.g. it does not allow changing the current GPU)
31+
* Compared to the original multi-GPU branch, we have cleaner separation. StreamManager has access to StreamPolicy, StreamPolicy has access to DeviceManager. StreamManager still has access to the runtime (for event creation, sync etc.), but we might completely hide CUDARuntime inside DeviceManager to have even more separation.
32+
* Re-added script to build connection graph. We might want to call it automatically from grcuda if the output CSV is not found. Otherwise we need to update the documentation to tell users how to use the script
33+
34+
# 2022-01-12
35+
36+
* Modified DeviceSelectionPolicy to select a device from a specified list of GPUs, instead of looking at all GPUs.
37+
That's useful because when we want to reuse a parent's stream we have to choose among the devices used by the parents, instead of considering all devices.
38+
* Added new SelectParentStreamPolicy where we find the parents' streams that can be reused, and then looks at the best device among the devices where these streams are, instead of considering all the devices in the system as in the previous policy. The old policy is still available.
39+
140
# 2021-12-21, Release 2
241

342
* Added support for GraalVM 21.3.
443
* Removed `ProfilableElement` Boolean flag, as it was always true.
544

45+
# 2021-12-09
46+
47+
* Replaced old isLastComputationArrayAccess" with new device tracking API
48+
* The old isLastComputationArrayAccess was a performance optimization used to track if the last computation on an array was an access done by the CPU (the only existing CPU computations), to skip scheduling of further array accesses done by the CPU
49+
* Implicitly, the API tracked if a certain array was up-to-date on the CPU or on the GPU (for a 1 GPU system).
50+
* The new API that tracks locations of arrays completely covers the old API, making it redundant. If an array is up-to-date on the CPU, we can perform read/write without any ComputationalElement scheduling.
51+
* Checking if an array is up-to-date on the CPU requires a hashset lookup. It might be optimized if necessary, using a tracking flag.
52+
53+
# 2021-12-06
54+
55+
* Fixed major bug that prevented CPU reads on read-only arrays in-use by the GPU. The problem appeared only on devices since Pascal.
56+
* Started integrating API to track on which devices a certain array is currently up-to-date. Slightly modified from the original multi-GPU API.
57+
58+
# 2021-12-05
59+
60+
* Updated options in GrCUDA to support new multi-gpu flags.
61+
* Improved initialization of ExecutionContext, now it takes GrCUDAOptionMap as parameter.
62+
* Improved GrCUDAOptionMap testing, and integrated preliminary multi-GPU tests.
63+
* Renamed GrCUDAExecutionContext to AsyncGrCUDAExecutionContext.
64+
* Integrated multi-GPU features into CUDARuntime
65+
* Improved interface to measure execution time of computationalelements (now the role of "ProfilableElement" is clearer, and execution time logging has been moved inside ComputationElement instead of using StreamManager)
66+
* Improved manual selection of GPU
67+
* Unsupported tests (e.g. tests for multiGPU if just 1 GPU is available) are properly skipped, instead of failing or completing successfully without info
68+
temporary fix for GRCUDA-56: cuBLAS is disabled on pre-pascal if async scheduler is selected
69+
70+
# 2021-11-30
71+
72+
* Updated python benchmark suite to integrate multi-gpu code.
73+
* Minor updates in naming conventions (e.g. using snake_case instead of CamelCase)
74+
* We might still want to update the python suite (for example the output dict structure), but for now this should work.
75+
676
# 2021-11-29
777

878
* Removed deprecation warning for Truffle's ArityException.
79+
* Updated benchmark suite with CUDAs multiGPU benchmarks. Also fixed GPU OOB in B9.
980

1081
# 2021-11-21
1182

0 commit comments

Comments
 (0)