-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ci enable distconv #2235
Ci enable distconv #2235
Conversation
d1860a9
to
752e1c6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, one question about the code.
752e1c6
to
7d29a79
Compare
Added DistConv CI tests Added Corona DistConv test and disabled FFT on ROCm Ensure that DistConv tests keep error signals Enable NVSHMEM on Lassen Added a multi-stage pipeline for Lassen Fixed a typo and disabled other tests. Added spack environment Added check stage for the catch tests Debugging Added the definition of the RESULTS_DIR environment variable Added release notes. Fixed the launcher for catch tests Removed debugging code Changed the batch launch commands to be interactive to block completion. Added a wrapper shell script for launching the unit tests Added the number of nodes for the unit test. Cleaning up launching paths Added execute permissions for unit test script. Ingest the Spack dependent environment information. Fixed typo. Disabled the external test on Lassen with NVSHMEM and DistConv. Fixing launch command and exclusion of externallayer Bugfix python Added number of tasks per node Added integration tests. Set some NVSHMEM runtime variables Cleaned up debugging code Fixed typos Run the correct test Restore the CI testing Uniquify the CI JOB_NAME fields for DistConv tests.
common python tools. Switched all tests to using the standard contrib args version.
testing, both for normal users and lbannusr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should be quick comments.
.gitlab-ci.yml
Outdated
# lassen distconv testing: | ||
# stage: run-all-clusters | ||
# variables: | ||
# JOB_NAME_SUFFIX: _distconv | ||
# SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv" | ||
# SPACK_SPECS: "+cuda +distconv +nvshmem +fft" | ||
# WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}" | ||
# WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}" | ||
# TEST_FLAG: "test_*_distconv.py" | ||
# trigger: | ||
# strategy: depend | ||
# include: .gitlab/lassen/multi_stage_pipeline.yml |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What's the issue here? Arguably this is the most important of the test jobs for this PR...
erronously resetting up the spack environment. Changed the saved spack environment name to SPACK_ENV_NAME. Cleaned up some dead code.
always at least one true clause, so the stage will schedule. Fixed the regex so that the distconv substring doesn't have to come at the start of the string.
Co-authored-by: Tom Benson <[email protected]>
Added new CI tests for building and checking execution with both DistConv and NVSHMEM enabled as appropriate.