Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[1.x, BUGFIX] WAR the dataloader issue with forked processes holding stale references #19924

Merged
merged 1 commit into from
Feb 20, 2021

Conversation

ptrendx
Copy link
Member

@ptrendx ptrendx commented Feb 19, 2021

Description

Fixes #19918 #19877

For explanation of the problem see #19918 (comment)

@Zha0q1

Checklist

Essentials

  • PR's title starts with a category (e.g. [BUGFIX], [MODEL], [TUTORIAL], [FEATURE], [DOC], etc)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage
  • Code is well-documented

@ptrendx ptrendx requested a review from szha as a code owner February 19, 2021 18:35
@mxnet-bot
Copy link

Hey @ptrendx , Thanks for submitting the PR
All tests are already queued to run once. If tests fail, you can trigger one or more tests again with the following commands:

  • To trigger all jobs: @mxnet-bot run ci [all]
  • To trigger specific jobs: @mxnet-bot run ci [job1, job2]

CI supported jobs: [clang, centos-gpu, unix-cpu, website, sanity, edge, windows-cpu, unix-gpu, centos-cpu, windows-gpu, miscellaneous]


Note:
Only following 3 categories can trigger CI :PR Author, MXNet Committer, Jenkins Admin.
All CI tests must pass before the PR can be merged.

@lanking520 lanking520 added the pr-awaiting-testing PR is reviewed and waiting CI build and test label Feb 19, 2021
Copy link
Contributor

@Zha0q1 Zha0q1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you!

@lanking520 lanking520 added pr-work-in-progress PR is still work in progress pr-awaiting-testing PR is reviewed and waiting CI build and test and removed pr-awaiting-testing PR is reviewed and waiting CI build and test pr-work-in-progress PR is still work in progress labels Feb 19, 2021
Copy link
Contributor

@josephevans josephevans left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@lanking520 lanking520 added pr-awaiting-merge Review and CI is complete. Ready to Merge and removed pr-awaiting-testing PR is reviewed and waiting CI build and test labels Feb 20, 2021
@Zha0q1 Zha0q1 merged commit 18d1771 into apache:v1.x Feb 20, 2021
Zha0q1 pushed a commit to Zha0q1/incubator-mxnet that referenced this pull request Feb 22, 2021
Zha0q1 pushed a commit to Zha0q1/incubator-mxnet that referenced this pull request Feb 24, 2021
Zha0q1 added a commit that referenced this pull request Mar 2, 2021
* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (#19654)

* [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (#19788)

* Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5.

* Set symlink for python3 to point to newly installed 3.6 version.

* Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version.

* Setup symlinks in /usr/local/bin, since it comes first in the path.

* Don't use absolute path for python3 executable, just use python3 from path.

Co-authored-by: Joe Evans <[email protected]>

* Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (#19828)

Co-authored-by: Joe Evans <[email protected]>

* [v1.x] For ECR, ensure we sanitize region input from environment variable (#19882)

* Set default for cache_intermediate.

* Make sure we sanitize region extracted from registry, since we pass it to os.system.

Co-authored-by: Joe Evans <[email protected]>

* [v1.x] Address CI failures with docker timeouts (v2) (#19890)

* Add random sleep only, since retry attempts are already implemented.

* Reduce random sleep to 2-10 sec.

Co-authored-by: Joe Evans <[email protected]>

* [v1.x] CI fixes to make more stable and upgradable (#19895)

* Test moving pipelines from p3 to g4.

* Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.)

* Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections).

See docker/docker-py#2395

* Remove old files.

* Fix comment

* Set default environment variables

* Fix GPU syntax.

* Use subprocess.run and redirect output to stdout, don't run docker in interactive mode.

* Check if codecov works without providing parameters now.

* Send docker stderr to sys.stderr

* Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'.

Co-authored-by: Joe Evans <[email protected]>

* fix cd

* fix cudnn version for cu10.2 buiuld

* WAR the dataloader issue with forked processes holding stale references (#19924)

* skip some tests

* fix ski[

* [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (#19959)

* update cude compt for cd

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update runtime_functions.sh

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update Dockerfile.build.ubuntu_gpu_cu102

* update command

Co-authored-by: Joe Evans <[email protected]>
Co-authored-by: Joe Evans <[email protected]>
Co-authored-by: Joe Evans <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
mseth10 pushed a commit to mseth10/incubator-mxnet that referenced this pull request Mar 15, 2021
* [v1.x] Migrate to use ECR as docker cache instead of dockerhub (apache#19654)

* [v1.x] Update CI build scripts to install python 3.6 from deadsnakes repo (apache#19788)

* Install python3.6 from deadsnakes repo, since 3.5 is EOL'd and get-pip.py no longer works with 3.5.

* Set symlink for python3 to point to newly installed 3.6 version.

* Setting symlink or using update-alternatives causes add-apt-repository to fail, so instead just set alias in environment to call the correct python version.

* Setup symlinks in /usr/local/bin, since it comes first in the path.

* Don't use absolute path for python3 executable, just use python3 from path.

Co-authored-by: Joe Evans <[email protected]>

* Disable unix-gpu-cu110 pipeline for v1.x build since we now build with cuda 11.0 in windows pipelines. (apache#19828)

Co-authored-by: Joe Evans <[email protected]>

* [v1.x] For ECR, ensure we sanitize region input from environment variable (apache#19882)

* Set default for cache_intermediate.

* Make sure we sanitize region extracted from registry, since we pass it to os.system.

Co-authored-by: Joe Evans <[email protected]>

* [v1.x] Address CI failures with docker timeouts (v2) (apache#19890)

* Add random sleep only, since retry attempts are already implemented.

* Reduce random sleep to 2-10 sec.

Co-authored-by: Joe Evans <[email protected]>

* [v1.x] CI fixes to make more stable and upgradable (apache#19895)

* Test moving pipelines from p3 to g4.

* Remove fallback codecov command - the existing (first) command works and the second always fails a few times before finally succeeding (and also doesn't support the -P parameter, which causes an error.)

* Stop using docker python client, since it still doesn't support latest nvidia 'gpus' attribute. Switch to using subprocess calls using list parameter (to avoid shell injections).

See docker/docker-py#2395

* Remove old files.

* Fix comment

* Set default environment variables

* Fix GPU syntax.

* Use subprocess.run and redirect output to stdout, don't run docker in interactive mode.

* Check if codecov works without providing parameters now.

* Send docker stderr to sys.stderr

* Support both nvidia-docker configurations, first try '--gpus all', and if that fails, then try '--runtime nvidia'.

Co-authored-by: Joe Evans <[email protected]>

* fix cd

* fix cudnn version for cu10.2 buiuld

* WAR the dataloader issue with forked processes holding stale references (apache#19924)

* skip some tests

* fix ski[

* [v.1x] Attempt to fix v1.x cd by installing new cuda compt package (apache#19959)

* update cude compt for cd

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu102

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update runtime_functions.sh

* Update Dockerfile.build.ubuntu_gpu_cu110

* Update Dockerfile.build.ubuntu_gpu_cu102

* update command

Co-authored-by: Joe Evans <[email protected]>
Co-authored-by: Joe Evans <[email protected]>
Co-authored-by: Joe Evans <[email protected]>
Co-authored-by: Przemyslaw Tredak <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-merge Review and CI is complete. Ready to Merge
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants