[v1.x] Migrate to use ECR as docker cache instead of dockerhub #19654

josephevans · 2020-12-10T23:43:50Z

Description

This PR changes the docker tag used in build containers to use a new single ECR registry (defined in a Jenkins environment variable) to retrieve and store build containers. This creates a unique docker tag based on the hash of the Dockerfile and all copied files, to prevent name collisions of build container names between branches.

This should allow CI to be more stable and faster because we won't have to rebuild the containers on every CI run (master branch already reuses docker images built nightly, but other branches can not utilize them because the dockerfiles are different.)

A jenkins pipeline monitors the v1.x branch and will regenerate the docker images from a restricted node and push them to the ECR registry on PR merge.

Considering there are 60+ stages in the v1.x pipeline and each stage takes about 15 minutes to build the docker images, this would save us about 15 hours of setup time (2 executors per slave node,) so about 7.5 hours of actual instance hours.

mxnet-bot · 2020-12-10T23:43:54Z

Hey @josephevans , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

To trigger all jobs: @mxnet-bot run ci [all]
To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, centos-gpu, sanity, unix-gpu, windows-cpu, unix-cpu, windows-gpu, website, edge, centos-cpu, miscellaneous]

Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

leezu

We should also ask the ECR team to implement aws/containers-roadmap#876 It doesn't affect this PR, as we use the legacy docker build tool, but will be needed for #19605

ci/build.py

…ECR repository, but use the platform and a hash of the dockerfile in the tag name so we can cache across branches. Also push newly built containers up to ECR repo so future CI runs will not have to build entire container.

…pull image.

…peline - only push in restricted docker cache pipeline.

waytrue17

LGTM, thanks!

szha · 2021-02-11T04:24:09Z

ci/docker_cache.py

+    # extract region from registry
+    region = registry.split(".")[3]
+    logging.info("Logging into ECR region %s using aws-cli..", region)
+    os.system("$(aws ecr get-login --region "+region+" --no-include-email)")


I wonder if this is the recommended way to populate login in python.

It looks like you can do it from python, but it would take a bit of work to rewrite this whole module, since we are already shelling out for the docker commands.

…e#19654)

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <[email protected]> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (#19828) Co-authored-by: Joe Evans <[email protected]> * [v1.x] For ECR, ensure we sanitize region input from environment variable (#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <[email protected]> * [v1.x] Address CI failures with docker timeouts (v2) (#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <[email protected]> * [v1.x] CI fixes to make more stable and upgradable (#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <[email protected]> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]>

* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (apache#19654) * [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (apache#19788) * Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5. * Set symlink for python3 to point to newly installed 3.6 version. * Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version. * Setup symlinks in /usr/local/bin, since it comes first in the path. * Don't use absolute path for python3 executable, just use python3 from path. Co-authored-by: Joe Evans <[email protected]> * Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (apache#19828) Co-authored-by: Joe Evans <[email protected]> * [v1.x] For ECR, ensure we sanitize region input from environment variable (apache#19882) * Set default for cache_intermediate. * Make sure we sanitize region extracted from registry, since we pass it to os.system. Co-authored-by: Joe Evans <[email protected]> * [v1.x] Address CI failures with docker timeouts (v2) (apache#19890) * Add random sleep only, since retry attempts are already implemented. * Reduce random sleep to 2-10 sec. Co-authored-by: Joe Evans <[email protected]> * [v1.x] CI fixes to make more stable and upgradable (apache#19895) * Test moving pipelines from p3 to g4. * Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.) * Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections). See docker/docker-py#2395 * Remove old files. * Fix comment * Set default environment variables * Fix GPU syntax. * Use subprocess.run and redirect output to stdout, don't run docker in interactive mode. * Check if codecov works without providing parameters now. * Send docker stderr to sys.stderr * Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'. Co-authored-by: Joe Evans <[email protected]> * fix cd * fix cudnn version for cu10.2 buiuld * WAR the dataloader issue with forked processes holding stale references (apache#19924) * skip some tests * fix ski[ * [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (apache#19959) * update cude compt for cd * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu102 * Update Dockerfile.build.ubuntu_gpu_cu110 * Update runtime_functions.sh * Update Dockerfile.build.ubuntu_gpu_cu110 * Update Dockerfile.build.ubuntu_gpu_cu102 * update command Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Joe Evans <[email protected]> Co-authored-by: Przemyslaw Tredak <[email protected]>

josephevans requested review from aaronmarkham and marcoabreu as code owners December 10, 2020 23:43

lanking520 added the pr-work-in-progress PR is still work in progress label Dec 10, 2020

leezu reviewed Dec 11, 2020

View reviewed changes

ci/build.py Outdated Show resolved Hide resolved

josephevans force-pushed the docker_cache_v1.x branch from 42eea93 to cd5834a Compare December 11, 2020 05:53

marcoabreu reviewed Dec 11, 2020

View reviewed changes

ci/build.py Outdated Show resolved Hide resolved

marcoabreu suggested changes Dec 11, 2020

View reviewed changes

ci/build.py Outdated Show resolved Hide resolved

ci/build.py Outdated Show resolved Hide resolved

Joe Evans added 8 commits February 8, 2021 19:45

Use new env variable for new ECR registry in order to test.

a2d58de

Fix typo.

bb1436d

Fix variable name and make sure we login to ECR before attempting to …

a10381b

…pull image.

Fix variable.

a4df078

Specify region based on environment variable.

762dc55

Only push image to docker cache when DOCKER_ECR_CACHE is defined.

235390a

Add all copied files to hash context to make sure it's unique.

d1e8cae

josephevans force-pushed the docker_cache_v1.x branch from e85e43c to d1e8cae Compare February 9, 2021 05:26

Joe Evans added 3 commits February 8, 2021 22:28

Move ECR logic to docker_cache.py, don't push containers in normal pi…

2e58613

…peline - only push in restricted docker cache pipeline.

Remove invalid keyword arg docker_binary.

624223b

Merge remote-tracking branch 'upstream/v1.x' into docker_cache_v1.x

ab54767

josephevans changed the title ~~[WIP] Migrate to use ECR as docker cache instead of dockerhub~~ [v1.x] Migrate to use ECR as docker cache instead of dockerhub Feb 10, 2021

lanking520 added pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-work-in-progress PR is still work in progress labels Feb 10, 2021

josephevans requested review from marcoabreu and leezu February 10, 2021 06:55

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 10, 2021

leezu approved these changes Feb 10, 2021

View reviewed changes

lanking520 removed the pr-work-in-progress PR is still work in progress label Feb 10, 2021

lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 10, 2021

Remove unused function.

70bb17a

lanking520 added pr-work-in-progress PR is still work in progress and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 10, 2021

waytrue17 approved these changes Feb 10, 2021

View reviewed changes

szha reviewed Feb 11, 2021

View reviewed changes

szha merged commit 9f3da90 into apache:v1.x Feb 11, 2021

josephevans added a commit to josephevans/mxnet that referenced this pull request Feb 24, 2021

[v1.x] Migrate to use ECR as docker cache instead of dockerhub (apach…

7e38dcb

…e#19654)

josephevans mentioned this pull request Feb 24, 2021

[v1.8.x] Backport PRs from v1.x branch #19946

Closed

Zha0q1 pushed a commit to Zha0q1/incubator-mxnet that referenced this pull request Mar 1, 2021

[v1.x] Migrate to use ECR as docker cache instead of dockerhub (apach…

4b911e2

…e#19654)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[v1.x] Migrate to use ECR as docker cache instead of dockerhub #19654

[v1.x] Migrate to use ECR as docker cache instead of dockerhub #19654

josephevans commented Dec 10, 2020 •

edited

Loading

mxnet-bot commented Dec 10, 2020

leezu left a comment •

edited

Loading

waytrue17 left a comment

szha Feb 11, 2021

josephevans Feb 11, 2021

[v1.x] Migrate to use ECR as docker cache instead of dockerhub #19654

[v1.x] Migrate to use ECR as docker cache instead of dockerhub #19654

Conversation

josephevans commented Dec 10, 2020 • edited Loading

Description

mxnet-bot commented Dec 10, 2020

leezu left a comment • edited Loading

Choose a reason for hiding this comment

waytrue17 left a comment

Choose a reason for hiding this comment

szha Feb 11, 2021

Choose a reason for hiding this comment

josephevans Feb 11, 2021

Choose a reason for hiding this comment

josephevans commented Dec 10, 2020 •

edited

Loading

leezu left a comment •

edited

Loading