Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

A solution to prevent zombie containers locally and in CI #12336

Closed
wants to merge 2 commits into from

Conversation

larroy
Copy link
Contributor

@larroy larroy commented Aug 24, 2018

Description

This PR adds mechanisms on the build scripts to cleanup docker containers when there is a cancellation, either from the user by sending SIGINT / SIGTERM or when the process inside the container ends with an error.

It also propagates the environment variables from Jenkins which are used by the process tree killer to identify runaway processes so processes inside the container are killed when the Jenkins job is stopped.

With this patch we catch SIGINT and SIGTERM and install a handler on atexit which cleans (stops) docker containers which were created during execution of the build.

We also switch to python docker API for better management of running containers.

@aaronmarkham @marcoabreu @lebeg @KellenSunderland

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@KellenSunderland
Copy link
Contributor

LGTM, let's get the host AMIs changed.

@aaronmarkham
Copy link
Contributor

I was hoping to see some usage info. 1) how to make a zombie; 2) how to kill it - I'd be happy to test this out.

@larroy
Copy link
Contributor Author

larroy commented Aug 27, 2018

I added info in the readme. Do you see it in the patch?

Copy link
Contributor

@KellenSunderland KellenSunderland left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you run pylint and mypy on this commit again, I see some issues.

@larroy
Copy link
Contributor Author

larroy commented Aug 28, 2018

Done. I don't know why the PR is not updated.

@larroy
Copy link
Contributor Author

larroy commented Aug 28, 2018

larroy@7bc6596

@larroy
Copy link
Contributor Author

larroy commented Aug 28, 2018

any ideas? is a github problem?

@larroy larroy changed the base branch from master to cython August 28, 2018 15:40
@larroy larroy changed the base branch from cython to master August 28, 2018 15:40
@larroy larroy closed this Aug 28, 2018
@larroy
Copy link
Contributor Author

larroy commented Aug 28, 2018

I can't reopen the PR, had some troubles with PR not updating and Jenkins, new pR:

#12381

@KellenSunderland
Copy link
Contributor

Github is having issues today. https://status.github.com/messages

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants