Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes. #15762

Merged
merged 7 commits into from
Aug 9, 2019

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Aug 6, 2019

Description

This PR improves initialization of the Library which uses unsafe passing of environment variables to child processes through pthread_atfork handlers. Calling setenv concurrently with getenv leads to random crashes which have been reported in several ocassions but so far haven't been fixed.

This patch avoids using getenv and setenv and it should address those random crashes while improving handling of the library intialization in the LibraryInitializer object.

Fixes #13438
Fixes #14979

Fixes minor issues with #15755

Aggregated dynamic library code in LibraryInitializer.
Verified Dtor is called on exit.

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Instead of changing OMP_NUM_THREADS it directly set_thread_max to 1, even though OMP is disabled.
  • LibraryInitializer is run on (library) construction instead of static initialization so it's run before any other static initializer.
  • Improve documentation of concurrency environment variables.
  • Moved current_process_id to common

@larroy larroy requested a review from szha as a code owner August 6, 2019 01:39
@larroy larroy changed the title Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes. [WIP] Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes. Aug 6, 2019
@larroy
Copy link
Contributor Author

larroy commented Aug 6, 2019

@mxnet-label-bot add [pr-work-in-progress]

@marcoabreu marcoabreu added the pr-work-in-progress PR is still work in progress label Aug 6, 2019
@larroy
Copy link
Contributor Author

larroy commented Aug 6, 2019

@zhreshold @anirudh2290

Copy link
Member

@zhreshold zhreshold left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Copy link
Contributor

@samskalicky samskalicky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@larroy larroy changed the title [WIP] Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes. Refactor LibraryInitializer so it's thread safe. Fixes random sporadical concurrency crashes. Aug 8, 2019
@larroy
Copy link
Contributor Author

larroy commented Aug 8, 2019

@mseth10 @samskalicky

src/common/utils.h Outdated Show resolved Hide resolved
src/initialize.cc Outdated Show resolved Hide resolved
@larroy
Copy link
Contributor Author

larroy commented Aug 8, 2019

@mxnet-label-bot remove [pr-work-in-progress]

@marcoabreu marcoabreu removed the pr-work-in-progress PR is still work in progress label Aug 8, 2019
@larroy
Copy link
Contributor Author

larroy commented Aug 8, 2019

@mxnet-label-bot add [pr-awaiting-review]

@marcoabreu marcoabreu added the pr-awaiting-review PR is waiting for code review label Aug 8, 2019
Copy link
Contributor

@mseth10 mseth10 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@zachgk zachgk merged commit bfd3bb8 into apache:master Aug 9, 2019
anirudhacharya pushed a commit to anirudhacharya/mxnet that referenced this pull request Aug 20, 2019
…cal concurrency crashes. (apache#15762)

* Refactor LibraryInitializer so it's thread safe.
Fixes apache#13438
Fixes apache#14979

* Refactor around lib loading

* Fix lint

* CR

* Add option to choose between OMP implementations

* Fix bug

* Fix from CR
apeforest added a commit that referenced this pull request Aug 27, 2019
… sporadical concurrency crashes. (#15762)"

This reverts commit bfd3bb8.
larroy added a commit to larroy/mxnet that referenced this pull request Dec 5, 2019
… sporadical concurrency crashes. (apache#15762)"

This reverts commit bfd3bb8.
larroy added a commit to larroy/mxnet that referenced this pull request Dec 7, 2019
This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before apache#15762 was introduced.

Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially
larroy added a commit to larroy/mxnet that referenced this pull request Dec 7, 2019
This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before apache#15762 was introduced.

Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially
anirudh2290 pushed a commit that referenced this pull request Dec 11, 2019
* Prevent after-fork number of OMP threads being bigger than 1.
This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before #15762 was introduced.

Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially

* add C++ unit test

* Add comment
ptrendx pushed a commit that referenced this pull request Dec 12, 2019
* Fix test_gluon.py:test_sync_batchnorm when number of GPUS > 4

* Prevent after-fork number of OMP threads being bigger than 1.
This could happen if it was set in the environment. As we are setting engine::OpenMP::Get()->set_enabled(false) in initialize.cc in the child after forking, the behaviour goes back to what it was before #15762 was introduced.

Regions using omp get the threads count from GetRecommendedOMPThreadCount, so if omp is disabled they will get 1 thread and run serially

* add C++ unit test

* Add comment
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
pr-awaiting-review PR is waiting for code review
Projects
None yet
6 participants