-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ci enable distconv #2235
Merged
Merged
Ci enable distconv #2235
Changes from 61 commits
Commits
Show all changes
81 commits
Select commit
Hold shift + click to select a range
c4dece7
Enable CI testing for DistConv.
bvanessen be4d4a6
Re-introduced the WITH_CLEAN_BUILD flag.
bvanessen b21d54f
Fixed whitespace to avoid change.
bvanessen 3ed74f7
Adapting new tests to the new build script framework with modules.
bvanessen 82e032f
Temporarily disabled existing tests.
bvanessen 68cd36f
Fixed typo
bvanessen 7a1b81e
Increased the time limit for the build on Lassen. Code cleanup.
bvanessen 0303ebb
Removed duplicate get_distconv_environment function in the ci_test
bvanessen 767876c
Changed the default behavior on MIOpen systems to use a local cache
bvanessen 0e056f8
Added back note about existing issue in DiHydrogen.
bvanessen f59772e
Enable CI runs to specific a subset of unit tests to run.
bvanessen 1ce8813
Tweaking the allowed runtimes for tests.
bvanessen 23ad3d6
Debuging the test selection. Increasing some test time limits.
bvanessen 1132524
Added test filter flags to all systems.
bvanessen ab6a653
Increasing time limits
bvanessen 0db91f3
Debugging CI scripts.
bvanessen 15c88f2
Added flags to skip integration tests on distconv CI runs.
bvanessen 5202940
Bumped up pooling time limit.
bvanessen 704b502
Testing out setting a set of MIOpen dB cache directories for CI
bvanessen 0f1fb8a
Adding caching options for Corona and changed how the username is que…
bvanessen 5233b21
Updated CI tests to use common MIOpen caches. Split user and custom
bvanessen 2c0ed65
Fix the lassen multi-stage pipeline to record the spack architecture.
bvanessen 10d8063
Increase the build time limit on Lassen.
bvanessen 02852ce
Fixed the new lassen build to avoid installing pytest through spack.
bvanessen d81401b
Added the clean build flags into the multi-stage pipeline.
bvanessen a938f51
Skip failing tests in distconv.
bvanessen fc7db63
Change the test utils to not set cluster value to unset, but rather N…
bvanessen 4527cf2
Added support for passing in the system cluster name by default if it…
bvanessen 6e74302
Cleanup the paths for the MIOpen caches.
bvanessen fe5fffc
Added a guard to skip inplace test if DistConv is disabled.
bvanessen d007d79
Removing unnecessary variable definitions.
bvanessen fc0f699
ResNet tests should run on Corona.
bvanessen 1ef3fb3
Added support in the data coordinator for explicitly recording the
bvanessen c89f19d
Force lassen to clean build.
bvanessen b2d77bd
Fixed the legacy HDF5 data reader.
bvanessen 261fbc2
Fixed the legacy HDF5 data reader.
bvanessen e3d8220
Fixed the legacy HDF5 data reader.
bvanessen 6200fcd
Increased the timeout for the lassen build and test.
bvanessen 44478bb
Bumped up the time limit on the catch tests for ROCm systems.
bvanessen 6ec057e
Increase the catch test sizes.
bvanessen 4de5502
Trying to avoid forcing static linking when using NVSHMEM.
bvanessen 80c211e
Changed the run catch tests script for flux to use a flux proxy.
bvanessen c5053dc
Export the lbann setup for Lassen unit and integration tests.
bvanessen f275de2
Minimize what is saved from the catch2 unit tests.
bvanessen e2c86d8
Cleaning up the environment variables.
bvanessen 8c025ee
Added a flag to extend the spack env name.
bvanessen 8adf999
Tweaking the flux proxy.
bvanessen 1fc34e6
Change how the NVSHMEM variables are setup so that the .before_script
bvanessen b01a521
Removed the -o cpu-affinity=per-task flag from the flux run commands
bvanessen 3dd4ef7
Tweak the flux commands to resovle hang on Corona catch tests.
bvanessen a636c46
Cleaning up the flux launch commands on Tioga and Corona to help avoi…
bvanessen 9b95e6b
Reenable the CI pipeline. Leave the distconv Lassen testing out for …
bvanessen 4dc8a49
Added a job name suffix variable.
bvanessen d402f4a
Ensure that the spack environment names are unique.
bvanessen 169173e
Tightened up the inclusion of the LBANN Python packages to avoid
bvanessen 1bfa57b
Added support to Pip install into the lbann build directory. Removed
bvanessen c70706b
Updated the baseline modules used on Corona and package versions on
bvanessen 44756f3
Fixed the import
bvanessen 95f3ed5
Removed the star import
bvanessen a02401a
Fixing the allocation flux command for Tioga.
bvanessen 046be1d
Changing it so that only Corona addes the -o pmi=pmix flags to flux.
bvanessen 5360341
Enable module generation for multiple core compilers.
bvanessen afaff50
Making the flux commands consistent.
bvanessen 70d0b43
Applied clang format.
bvanessen 3073a88
Fixed the compiler path on Pasacl.
bvanessen 22dca0a
Reenable lassen multi-stage distconv test pipeline.
bvanessen 4b27d58
Fixed how the new Lassen distconv tests are invoked and avoid
bvanessen fd4b5ec
Reenabling logging.
bvanessen 0ea4543
Disabled logging
bvanessen 24752be
Added a second if clause to the integration tests so that there is
bvanessen 5113649
Consolidated the rules clause into a common one.
bvanessen c891e95
Fix the rules regex.
bvanessen b08a262
Added corona numbers for resnet.
bvanessen 68e4290
Tweaking the CI rules to avoid integrations on distconv builds.
bvanessen aa9314d
Fixed test
bvanessen 482d399
Tweaking how the lassen unit tests are called.
bvanessen 9f87e5c
Disable nvshmem build on Lassen. Code cleanup and adding suggestions.
bvanessen bbedf6b
Changed the guard in resnet 50 test
bvanessen d78b217
Disable NVSHMEM environemnt variables.
bvanessen 593bd7e
Disabled Lassen DistConv unit tests.
bvanessen 3c15855
Apply suggestions from code review
bvanessen File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -40,6 +40,19 @@ corona testing: | |
strategy: depend | ||
include: .gitlab/corona/pipeline.yml | ||
|
||
corona distconv testing: | ||
stage: run-all-clusters | ||
variables: | ||
JOB_NAME_SUFFIX: _distconv | ||
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv" | ||
SPACK_SPECS: "+rocm +distconv" | ||
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}" | ||
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}" | ||
TEST_FLAG: "test_*_distconv.py" | ||
trigger: | ||
strategy: depend | ||
include: .gitlab/corona/pipeline.yml | ||
|
||
lassen testing: | ||
stage: run-all-clusters | ||
variables: | ||
|
@@ -49,6 +62,19 @@ lassen testing: | |
strategy: depend | ||
include: .gitlab/lassen/pipeline.yml | ||
|
||
# lassen distconv testing: | ||
# stage: run-all-clusters | ||
# variables: | ||
# JOB_NAME_SUFFIX: _distconv | ||
# SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv" | ||
# SPACK_SPECS: "+cuda +distconv +nvshmem +fft" | ||
# WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}" | ||
# WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}" | ||
# TEST_FLAG: "test_*_distconv.py" | ||
# trigger: | ||
# strategy: depend | ||
# include: .gitlab/lassen/multi_stage_pipeline.yml | ||
|
||
pascal testing: | ||
stage: run-all-clusters | ||
variables: | ||
|
@@ -68,6 +94,19 @@ pascal compiler testing: | |
strategy: depend | ||
include: .gitlab/pascal/pipeline_compiler_tests.yml | ||
|
||
pascal distconv testing: | ||
stage: run-all-clusters | ||
variables: | ||
JOB_NAME_SUFFIX: _distconv | ||
SPACK_SPECS: "%[email protected] +cuda +distconv +fft" | ||
BUILD_SCRIPT_OPTIONS: "--no-default-mirrors" | ||
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}" | ||
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}" | ||
TEST_FLAG: "test_*_distconv.py" | ||
trigger: | ||
strategy: depend | ||
include: .gitlab/pascal/pipeline.yml | ||
|
||
tioga testing: | ||
stage: run-all-clusters | ||
variables: | ||
|
@@ -76,3 +115,16 @@ tioga testing: | |
trigger: | ||
strategy: depend | ||
include: .gitlab/tioga/pipeline.yml | ||
|
||
tioga distconv testing: | ||
stage: run-all-clusters | ||
variables: | ||
JOB_NAME_SUFFIX: _distconv | ||
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv" | ||
SPACK_SPECS: "+rocm +distconv" | ||
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}" | ||
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}" | ||
TEST_FLAG: "test_*_distconv.py" | ||
trigger: | ||
strategy: depend | ||
include: .gitlab/tioga/pipeline.yml |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
################################################################################ | ||
## Copyright (c) 2014-2023, Lawrence Livermore National Security, LLC. | ||
## Produced at the Lawrence Livermore National Laboratory. | ||
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in | ||
## the CONTRIBUTORS file. <[email protected]> | ||
## | ||
## LLNL-CODE-697807. | ||
## All rights reserved. | ||
## | ||
## This file is part of LBANN: Livermore Big Artificial Neural Network | ||
## Toolkit. For details, see http://software.llnl.gov/LBANN or | ||
## https://github.com/LLNL/LBANN. | ||
## | ||
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you | ||
## may not use this file except in compliance with the License. You may | ||
## obtain a copy of the License at: | ||
## | ||
## http://www.apache.org/licenses/LICENSE-2.0 | ||
## | ||
## Unless required by applicable law or agreed to in writing, software | ||
## distributed under the License is distributed on an "AS IS" BASIS, | ||
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or | ||
## implied. See the License for the specific language governing | ||
## permissions and limitations under the license. | ||
################################################################################ | ||
|
||
#!/bin/bash | ||
# Just in case | ||
# source ${HOME}/${SPACK_REPO}/share/spack/setup-env.sh | ||
# source spack-ci-env-name.sh | ||
|
||
# # Load up the spack environment | ||
# #SPACK_ARCH=$(spack arch) | ||
# #SPACK_ARCH_TARGET=$(spack arch -t) | ||
# spack env activate lbann-${SPACK_DEP_ENV_NAME}-${SPACK_ARCH_TARGET} | ||
# spack load lbann@${SPACK_DEP_ENV_NAME}-${SPACK_ARCH_TARGET} arch=${SPACK_ARCH} | ||
|
||
cd ${LBANN_BUILD_DIR} | ||
|
||
# Configure the output directory | ||
OUTPUT_DIR=${CI_PROJECT_DIR}/${RESULTS_DIR} | ||
if [[ -d ${OUTPUT_DIR} ]]; | ||
then | ||
rm -rf ${OUTPUT_DIR} | ||
fi | ||
mkdir -p ${OUTPUT_DIR} | ||
|
||
FAILED_JOBS="" | ||
|
||
# LBANN_HASH=$(spack find --format {hash:7} lbann@${SPACK_DEP_ENV_NAME}-${SPACK_ARCH_TARGET}) | ||
# SPACK_BUILD_DIR="spack-build-${LBANN_HASH}" | ||
# cd ${SPACK_BUILD_DIR} | ||
lrun -N 1 -n 1 -W 5 \ | ||
./unit_test/seq-catch-tests \ | ||
-r JUnit \ | ||
-o ${OUTPUT_DIR}/seq-catch-results.xml | ||
if [[ $? -ne 0 ]]; then | ||
FAILED_JOBS+=" seq" | ||
fi | ||
|
||
lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \ | ||
-T $TEST_TASKS_PER_NODE \ | ||
-W 5 ${TEST_MPIBIND_FLAG} \ | ||
./unit_test/mpi-catch-tests "exclude:[externallayer]" "exclude:[filesystem]" \ | ||
-r JUnit \ | ||
-o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml" | ||
if [[ $? -ne 0 ]]; then | ||
FAILED_JOBS+=" mpi" | ||
fi | ||
|
||
lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \ | ||
-T $TEST_TASKS_PER_NODE \ | ||
-W 5 ${TEST_MPIBIND_FLAG} \ | ||
./unit_test/mpi-catch-tests "[filesystem]" \ | ||
-r JUnit \ | ||
-o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml" | ||
if [[ $? -ne 0 ]]; | ||
then | ||
FAILED_JOBS+=" mpi-filesystem" | ||
fi | ||
|
||
# Try to write a semi-useful message to this file since it's being | ||
# saved as an artifact. It's not completely outside the realm that | ||
# someone would look at it. | ||
if [[ -n "${FAILED_JOBS}" ]]; | ||
then | ||
echo "Some Catch2 tests failed:${FAILED_JOBS}" > ${OUTPUT_DIR}/catch-tests-failed.txt | ||
fi | ||
|
||
# Return "success" so that the pytest-based testing can run. | ||
exit 0 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the issue here? Arguably this is the most important of the test jobs for this PR...