Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ci enable distconv #2235

Merged
merged 81 commits into from
Sep 22, 2023
Merged
Show file tree
Hide file tree
Changes from 61 commits
Commits
Show all changes
81 commits
Select commit Hold shift + click to select a range
c4dece7
Enable CI testing for DistConv.
bvanessen Mar 17, 2023
be4d4a6
Re-introduced the WITH_CLEAN_BUILD flag.
bvanessen Aug 30, 2023
b21d54f
Fixed whitespace to avoid change.
bvanessen Aug 30, 2023
3ed74f7
Adapting new tests to the new build script framework with modules.
bvanessen Aug 30, 2023
82e032f
Temporarily disabled existing tests.
bvanessen Aug 30, 2023
68cd36f
Fixed typo
bvanessen Aug 30, 2023
7a1b81e
Increased the time limit for the build on Lassen. Code cleanup.
bvanessen Aug 30, 2023
0303ebb
Removed duplicate get_distconv_environment function in the ci_test
bvanessen Sep 1, 2023
767876c
Changed the default behavior on MIOpen systems to use a local cache
bvanessen Sep 1, 2023
0e056f8
Added back note about existing issue in DiHydrogen.
bvanessen Sep 1, 2023
f59772e
Enable CI runs to specific a subset of unit tests to run.
bvanessen Sep 1, 2023
1ce8813
Tweaking the allowed runtimes for tests.
bvanessen Sep 1, 2023
23ad3d6
Debuging the test selection. Increasing some test time limits.
bvanessen Sep 1, 2023
1132524
Added test filter flags to all systems.
bvanessen Sep 1, 2023
ab6a653
Increasing time limits
bvanessen Sep 1, 2023
0db91f3
Debugging CI scripts.
bvanessen Sep 1, 2023
15c88f2
Added flags to skip integration tests on distconv CI runs.
bvanessen Sep 1, 2023
5202940
Bumped up pooling time limit.
bvanessen Sep 1, 2023
704b502
Testing out setting a set of MIOpen dB cache directories for CI
bvanessen Sep 6, 2023
0f1fb8a
Adding caching options for Corona and changed how the username is que…
bvanessen Sep 6, 2023
5233b21
Updated CI tests to use common MIOpen caches. Split user and custom
bvanessen Sep 7, 2023
2c0ed65
Fix the lassen multi-stage pipeline to record the spack architecture.
bvanessen Sep 7, 2023
10d8063
Increase the build time limit on Lassen.
bvanessen Sep 10, 2023
02852ce
Fixed the new lassen build to avoid installing pytest through spack.
bvanessen Sep 11, 2023
d81401b
Added the clean build flags into the multi-stage pipeline.
bvanessen Sep 11, 2023
a938f51
Skip failing tests in distconv.
bvanessen Sep 11, 2023
fc7db63
Change the test utils to not set cluster value to unset, but rather N…
bvanessen Sep 11, 2023
4527cf2
Added support for passing in the system cluster name by default if it…
bvanessen Sep 13, 2023
6e74302
Cleanup the paths for the MIOpen caches.
bvanessen Sep 13, 2023
fe5fffc
Added a guard to skip inplace test if DistConv is disabled.
bvanessen Sep 13, 2023
d007d79
Removing unnecessary variable definitions.
bvanessen Sep 13, 2023
fc0f699
ResNet tests should run on Corona.
bvanessen Sep 13, 2023
1ef3fb3
Added support in the data coordinator for explicitly recording the
bvanessen Sep 14, 2023
c89f19d
Force lassen to clean build.
bvanessen Sep 14, 2023
b2d77bd
Fixed the legacy HDF5 data reader.
bvanessen Sep 14, 2023
261fbc2
Fixed the legacy HDF5 data reader.
bvanessen Sep 14, 2023
e3d8220
Fixed the legacy HDF5 data reader.
bvanessen Sep 14, 2023
6200fcd
Increased the timeout for the lassen build and test.
bvanessen Sep 14, 2023
44478bb
Bumped up the time limit on the catch tests for ROCm systems.
bvanessen Sep 14, 2023
6ec057e
Increase the catch test sizes.
bvanessen Sep 14, 2023
4de5502
Trying to avoid forcing static linking when using NVSHMEM.
bvanessen Sep 14, 2023
80c211e
Changed the run catch tests script for flux to use a flux proxy.
bvanessen Sep 14, 2023
c5053dc
Export the lbann setup for Lassen unit and integration tests.
bvanessen Sep 14, 2023
f275de2
Minimize what is saved from the catch2 unit tests.
bvanessen Sep 14, 2023
e2c86d8
Cleaning up the environment variables.
bvanessen Sep 14, 2023
8c025ee
Added a flag to extend the spack env name.
bvanessen Sep 14, 2023
8adf999
Tweaking the flux proxy.
bvanessen Sep 14, 2023
1fc34e6
Change how the NVSHMEM variables are setup so that the .before_script
bvanessen Sep 15, 2023
b01a521
Removed the -o cpu-affinity=per-task flag from the flux run commands
bvanessen Sep 16, 2023
3dd4ef7
Tweak the flux commands to resovle hang on Corona catch tests.
bvanessen Sep 18, 2023
a636c46
Cleaning up the flux launch commands on Tioga and Corona to help avoi…
bvanessen Sep 19, 2023
9b95e6b
Reenable the CI pipeline. Leave the distconv Lassen testing out for …
bvanessen Sep 19, 2023
4dc8a49
Added a job name suffix variable.
bvanessen Sep 19, 2023
d402f4a
Ensure that the spack environment names are unique.
bvanessen Sep 19, 2023
169173e
Tightened up the inclusion of the LBANN Python packages to avoid
bvanessen Sep 20, 2023
1bfa57b
Added support to Pip install into the lbann build directory. Removed
bvanessen Sep 20, 2023
c70706b
Updated the baseline modules used on Corona and package versions on
bvanessen Sep 20, 2023
44756f3
Fixed the import
bvanessen Sep 20, 2023
95f3ed5
Removed the star import
bvanessen Sep 20, 2023
a02401a
Fixing the allocation flux command for Tioga.
bvanessen Sep 21, 2023
046be1d
Changing it so that only Corona addes the -o pmi=pmix flags to flux.
bvanessen Sep 21, 2023
5360341
Enable module generation for multiple core compilers.
bvanessen Sep 21, 2023
afaff50
Making the flux commands consistent.
bvanessen Sep 21, 2023
70d0b43
Applied clang format.
bvanessen Sep 22, 2023
3073a88
Fixed the compiler path on Pasacl.
bvanessen Sep 22, 2023
22dca0a
Reenable lassen multi-stage distconv test pipeline.
bvanessen Sep 22, 2023
4b27d58
Fixed how the new Lassen distconv tests are invoked and avoid
bvanessen Sep 22, 2023
fd4b5ec
Reenabling logging.
bvanessen Sep 22, 2023
0ea4543
Disabled logging
bvanessen Sep 22, 2023
24752be
Added a second if clause to the integration tests so that there is
bvanessen Sep 22, 2023
5113649
Consolidated the rules clause into a common one.
bvanessen Sep 22, 2023
c891e95
Fix the rules regex.
bvanessen Sep 22, 2023
b08a262
Added corona numbers for resnet.
bvanessen Sep 22, 2023
68e4290
Tweaking the CI rules to avoid integrations on distconv builds.
bvanessen Sep 22, 2023
aa9314d
Fixed test
bvanessen Sep 22, 2023
482d399
Tweaking how the lassen unit tests are called.
bvanessen Sep 22, 2023
9f87e5c
Disable nvshmem build on Lassen. Code cleanup and adding suggestions.
bvanessen Sep 22, 2023
bbedf6b
Changed the guard in resnet 50 test
bvanessen Sep 22, 2023
d78b217
Disable NVSHMEM environemnt variables.
bvanessen Sep 22, 2023
593bd7e
Disabled Lassen DistConv unit tests.
bvanessen Sep 22, 2023
3c15855
Apply suggestions from code review
bvanessen Sep 22, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions .gitlab-ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -40,6 +40,19 @@ corona testing:
strategy: depend
include: .gitlab/corona/pipeline.yml

corona distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
SPACK_SPECS: "+rocm +distconv"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/corona/pipeline.yml

lassen testing:
stage: run-all-clusters
variables:
Expand All @@ -49,6 +62,19 @@ lassen testing:
strategy: depend
include: .gitlab/lassen/pipeline.yml

# lassen distconv testing:
# stage: run-all-clusters
# variables:
# JOB_NAME_SUFFIX: _distconv
# SPACK_ENV_BASE_NAME_MODIFIER: "-multi-stage-distconv"
# SPACK_SPECS: "+cuda +distconv +nvshmem +fft"
# WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
# WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
# TEST_FLAG: "test_*_distconv.py"
# trigger:
# strategy: depend
# include: .gitlab/lassen/multi_stage_pipeline.yml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the issue here? Arguably this is the most important of the test jobs for this PR...


pascal testing:
stage: run-all-clusters
variables:
Expand All @@ -68,6 +94,19 @@ pascal compiler testing:
strategy: depend
include: .gitlab/pascal/pipeline_compiler_tests.yml

pascal distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_SPECS: "%[email protected] +cuda +distconv +fft"
BUILD_SCRIPT_OPTIONS: "--no-default-mirrors"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/pascal/pipeline.yml

tioga testing:
stage: run-all-clusters
variables:
Expand All @@ -76,3 +115,16 @@ tioga testing:
trigger:
strategy: depend
include: .gitlab/tioga/pipeline.yml

tioga distconv testing:
stage: run-all-clusters
variables:
JOB_NAME_SUFFIX: _distconv
SPACK_ENV_BASE_NAME_MODIFIER: "-distconv"
SPACK_SPECS: "+rocm +distconv"
WITH_WEEKLY: "${LBANN_CI_RUN_WEEKLY}"
WITH_CLEAN_BUILD: "${LBANN_CI_CLEAN_BUILD}"
TEST_FLAG: "test_*_distconv.py"
trigger:
strategy: depend
include: .gitlab/tioga/pipeline.yml
6 changes: 3 additions & 3 deletions .gitlab/common/common.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@
variables:
# This is based on the assumption that each runner will only ever
# be able to run one pipeline on a given cluster at one time.
SPACK_ENV_BASE_NAME: gitlab-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}
SPACK_ENV_BASE_NAME: gitlab${SPACK_ENV_BASE_NAME_MODIFIER}-${CI_COMMIT_BRANCH}-${GITLAB_USER_LOGIN}-${SYSTEM_NAME}${SPACK_ENV_BASE_NAME_EXTENSION}-${CI_RUNNER_SHORT_TOKEN}

# This variable is the name used to identify the job in the Slurm
# queue. We need this to be able to access the correct jobid.
JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}
JOB_NAME: ${CI_PROJECT_NAME}_${CI_PIPELINE_ID}${JOB_NAME_SUFFIX}

# This is needed to ensure that we run as lbannusr.
LLNL_SERVICE_USER: lbannusr
Expand Down Expand Up @@ -137,7 +137,7 @@
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/*.cmake
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/CMakeCache.txt
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/build.ninja
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*
- ${RESULTS_DIR}/*
exclude:
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/**/*.o
- builds/lbann_${SYSTEM_NAME}_${SPACK_ENV_BASE_NAME}-*${CI_CONCURRENT_ID}-*/build/unit_test/*
11 changes: 3 additions & 8 deletions .gitlab/common/run-catch-tests-flux.sh
Original file line number Diff line number Diff line change
Expand Up @@ -52,14 +52,9 @@ export LD_LIBRARY_PATH=${ROCM_PATH}/lib:${LD_LIBRARY_PATH}

cd ${LBANN_BUILD_DIR}


flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=per-task -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort

flux run --label-io -n4 -N2 -g 1 -o cpu-affinity=off -o gpu-affinity=per-task sh -c 'taskset -cp $$; printenv | grep VISIBLE' | sort

echo "Running sequential catch tests"

flux run -N 1 -n 1 -g 1 -t 5m \
flux run -N 1 -n 1 --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} -t 5m \
./unit_test/seq-catch-tests \
-r JUnit \
-o ${OUTPUT_DIR}/seq-catch-results.xml
Expand All @@ -71,7 +66,7 @@ echo "Running MPI catch tests with ${LBANN_NNODES} nodes and ${TEST_TASKS_PER_NO

flux run \
-N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
-g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
-t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
./unit_test/mpi-catch-tests "exclude:[random]" "exclude:[filesystem]"\
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
Expand All @@ -83,7 +78,7 @@ echo "Running MPI filesystem catch tests"

flux run \
-N ${LBANN_NNODES} -n $((${TEST_TASKS_PER_NODE} * ${LBANN_NNODES})) \
-g 1 -t 5m -o gpu-affinity=per-task -o cpu-affinity=per-task -o mpibind=off \
-t 5m --exclusive -o nosetpgrp ${EXTRA_FLUX_ARGS} \
./unit_test/mpi-catch-tests -s "[filesystem]" \
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"
Expand Down
91 changes: 91 additions & 0 deletions .gitlab/common/run-catch-tests-lsf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
################################################################################
## Copyright (c) 2014-2023, Lawrence Livermore National Security, LLC.
## Produced at the Lawrence Livermore National Laboratory.
## Written by the LBANN Research Team (B. Van Essen, et al.) listed in
## the CONTRIBUTORS file. <[email protected]>
##
## LLNL-CODE-697807.
## All rights reserved.
##
## This file is part of LBANN: Livermore Big Artificial Neural Network
## Toolkit. For details, see http://software.llnl.gov/LBANN or
## https://github.com/LLNL/LBANN.
##
## Licensed under the Apache License, Version 2.0 (the "Licensee"); you
## may not use this file except in compliance with the License. You may
## obtain a copy of the License at:
##
## http://www.apache.org/licenses/LICENSE-2.0
##
## Unless required by applicable law or agreed to in writing, software
## distributed under the License is distributed on an "AS IS" BASIS,
## WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
## implied. See the License for the specific language governing
## permissions and limitations under the license.
################################################################################

#!/bin/bash
# Just in case
# source ${HOME}/${SPACK_REPO}/share/spack/setup-env.sh
# source spack-ci-env-name.sh

# # Load up the spack environment
# #SPACK_ARCH=$(spack arch)
# #SPACK_ARCH_TARGET=$(spack arch -t)
# spack env activate lbann-${SPACK_DEP_ENV_NAME}-${SPACK_ARCH_TARGET}
# spack load lbann@${SPACK_DEP_ENV_NAME}-${SPACK_ARCH_TARGET} arch=${SPACK_ARCH}

cd ${LBANN_BUILD_DIR}

# Configure the output directory
OUTPUT_DIR=${CI_PROJECT_DIR}/${RESULTS_DIR}
if [[ -d ${OUTPUT_DIR} ]];
then
rm -rf ${OUTPUT_DIR}
fi
mkdir -p ${OUTPUT_DIR}

FAILED_JOBS=""

# LBANN_HASH=$(spack find --format {hash:7} lbann@${SPACK_DEP_ENV_NAME}-${SPACK_ARCH_TARGET})
# SPACK_BUILD_DIR="spack-build-${LBANN_HASH}"
# cd ${SPACK_BUILD_DIR}
lrun -N 1 -n 1 -W 5 \
./unit_test/seq-catch-tests \
-r JUnit \
-o ${OUTPUT_DIR}/seq-catch-results.xml
if [[ $? -ne 0 ]]; then
FAILED_JOBS+=" seq"
fi

lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
-T $TEST_TASKS_PER_NODE \
-W 5 ${TEST_MPIBIND_FLAG} \
./unit_test/mpi-catch-tests "exclude:[externallayer]" "exclude:[filesystem]" \
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-results-rank=%r-size=%s.xml"
if [[ $? -ne 0 ]]; then
FAILED_JOBS+=" mpi"
fi

lrun -N ${LBANN_NNODES} -n $(($TEST_TASKS_PER_NODE * ${LBANN_NNODES})) \
-T $TEST_TASKS_PER_NODE \
-W 5 ${TEST_MPIBIND_FLAG} \
./unit_test/mpi-catch-tests "[filesystem]" \
-r JUnit \
-o "${OUTPUT_DIR}/mpi-catch-filesystem-results-rank=%r-size=%s.xml"
if [[ $? -ne 0 ]];
then
FAILED_JOBS+=" mpi-filesystem"
fi

# Try to write a semi-useful message to this file since it's being
# saved as an artifact. It's not completely outside the realm that
# someone would look at it.
if [[ -n "${FAILED_JOBS}" ]];
then
echo "Some Catch2 tests failed:${FAILED_JOBS}" > ${OUTPUT_DIR}/catch-tests-failed.txt
fi

# Return "success" so that the pytest-based testing can run.
exit 0
12 changes: 9 additions & 3 deletions .gitlab/corona/pipeline.yml
Original file line number Diff line number Diff line change
Expand Up @@ -79,6 +79,7 @@ build and install:
- export TEST_MPIBIND_FLAG="--mpibind=off"
- export SPACK_ARCH=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch)
- export SPACK_ARCH_TARGET=$(flux proxy ${JOB_ID} flux mini run -N 1 spack arch -t)
- export EXTRA_FLUX_ARGS="-o pmi=pmix"
- !reference [.setup_lbann, script]
- flux proxy ${JOB_ID} .gitlab/common/run-catch-tests-flux.sh

Expand All @@ -97,7 +98,8 @@ unit tests:
- export OMP_NUM_THREADS=10
- "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
- cd ci_test/unit_tests
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 --junitxml=results.xml
# - echo "Running unit tests with file pattern: ${TEST_FLAG}"
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 --junitxml=results.xml ${TEST_FLAG}
artifacts:
when: always
paths:
Expand All @@ -114,15 +116,19 @@ integration tests:
stage: test
dependencies:
- build and install
rules:
- if: $TEST_FLAGS =~ /^distconv/
when: never
script:
- echo "== RUNNING PYTHON-BASED INTEGRATION TESTS =="
- echo "Testing $(which lbann)"
- export OMP_NUM_THREADS=10
- "export FLUX_JOB_ID=$(flux jobs -no {id}:{name} | grep ${JOB_NAME} | awk -F: '{print $1}')"
- cd ci_test/integration_tests
- export WEEKLY_FLAG=${WITH_WEEKLY:+--weekly}
- echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml"
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml
# - echo "Running integration tests with file pattern: ${TEST_FLAG}"
# - echo "python3 -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}"
- flux proxy ${FLUX_JOB_ID} lbann_pfe.sh -m pytest -s -vv --durations=0 ${WEEKLY_FLAG} --junitxml=results.xml ${TEST_FLAG}
artifacts:
when: always
paths:
Expand Down
Loading