Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ci enable distconv #2235

Merged
merged 81 commits into from
Sep 22, 2023
Merged

Ci enable distconv #2235

merged 81 commits into from
Sep 22, 2023

Conversation

bvanessen
Copy link
Collaborator

Added new CI tests for building and checking execution with both DistConv and NVSHMEM enabled as appropriate.

@bvanessen bvanessen requested a review from benson31 as a code owner March 24, 2023 21:25
@bvanessen bvanessen force-pushed the ci_enable_distconv branch from d1860a9 to 752e1c6 Compare March 25, 2023 20:59
@bvanessen bvanessen requested a review from tbennun March 25, 2023 21:00
@bvanessen bvanessen added the CI Continuous Integration label Mar 25, 2023
Copy link
Contributor

@tbennun tbennun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, one question about the code.

ci_test/common_python/tools.py Outdated Show resolved Hide resolved
@bvanessen bvanessen requested a review from szaman19 June 6, 2023 16:29
Added DistConv CI tests

Added Corona DistConv test and disabled FFT on ROCm

Ensure that DistConv tests keep error signals

Enable NVSHMEM on Lassen

Added a multi-stage pipeline for Lassen

Fixed a typo and disabled other tests.

Added spack environment

Added check stage for the catch tests

Debugging

Added the definition of the RESULTS_DIR environment variable

Added release notes.

Fixed the launcher for catch tests

Removed debugging code

Changed the batch launch commands to be interactive to block completion.

Added a wrapper shell script for launching the unit tests

Added the number of nodes for the unit test.

Cleaning up launching paths

Added execute permissions for unit test script.

Ingest the Spack dependent environment information.

Fixed typo.  Disabled the external test on Lassen with NVSHMEM and DistConv.

Fixing launch command and exclusion of externallayer

Bugfix python

Added number of tasks per node

Added integration tests.  Set some NVSHMEM runtime variables

Cleaned up debugging code

Fixed typos

Run the correct test

Restore the CI testing

Uniquify the CI JOB_NAME fields for DistConv tests.
common python tools.  Switched all tests to using the standard contrib
args version.
testing, both for normal users and lbannusr.
Copy link
Collaborator

@benson31 benson31 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be quick comments.

.gitlab-ci.yml Outdated
Comment on lines 65 to 76
# lassen distconv testing:
# stage: run-all-clusters
# variables:
# JOB_NAME_SUFFIX: _distconv
# SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv"
# SPACK_SPECS: "+cuda +distconv +nvshmem +fft"
# WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
# WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
# TEST_FLAG: "test_*_distconv.py"
# trigger:
# strategy: depend
# include: .gitlab/lassen/multi_stage_pipeline.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue here? Arguably this is the most important of the test jobs for this PR...

python/lbann/contrib/lc/launcher.py Outdated Show resolved Hide resolved
python/lbann/contrib/lc/launcher.py Outdated Show resolved Hide resolved
python/lbann/contrib/olcf/launcher.py Outdated Show resolved Hide resolved
python/lbann/launcher/flux.py Outdated Show resolved Hide resolved
ci_test/integration_tests/test_integration_resnet50.py Outdated Show resolved Hide resolved
ci_test/common_python/tools.py Outdated Show resolved Hide resolved
ci_test/common_python/tools.py Outdated Show resolved Hide resolved
@bvanessen bvanessen dismissed benson31’s stale review September 22, 2023 22:03

Changes have been applied.

@bvanessen bvanessen merged commit b75a718 into LBANN:develop Sep 22, 2023
@bvanessen bvanessen deleted the ci_enable_distconv branch October 10, 2023 17:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CI Continuous Integration
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants