-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[Don't merge][Review] A solution to prevent zombie containers locally and in CI #12276
Conversation
9d43167
to
55abd3c
Compare
@mxnet-label-bot: [pr-awaiting-review] |
ci/build.py
Outdated
@@ -23,7 +23,7 @@ | |||
""" | |||
|
|||
__author__ = 'Marco de Abreu, Kellen Sunderland, Anton Chernov, Pedro Larroy' | |||
__version__ = '0.1' | |||
__version__ = '0.8' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why bumping to 0.8 directly?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it's getting close to 1.0, by features and maturity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: but I'd suggest 0.2. We can jump straight to 1.0 whenever we like but until then I'd just make minor bumps.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall feedback: can you do a pass with pylint and mypy, and try and separate concerns into different atomic PRs.
ci/build.py
Outdated
|
||
def under_ci() -> bool: | ||
""":return: True if we run in Jenkins.""" | ||
return 'JOB_NAME' in os.environ | ||
|
||
def get_platforms(path: Optional[str] = "docker"): | ||
|
||
def git_cleanup() -> None: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could remove useful files from a developers desktop silently. I suspect this could surprise many users. Could you potentially lose work if for example you have a new file you haven't checked in that you want tested.
This change should also be in its own PR so that we can review it effectively. I believe we could address docker zombies without making this change.
ci/build.py
Outdated
cmd.extend(["--cache-from", tag,]) | ||
cmd.extend(["-t", tag, get_dockerfiles_path()]) | ||
|
||
@retry(subprocess.CalledProcessError, tries=num_retries) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool but should be in a different PR. Reverting your zombie containers fix would mean we'd have to re-implement your retry refactor and vice-versa.
|
||
return docker_run_cmd | ||
docker_cmd_list.extend(command) | ||
docker_cmd = ' \\\n\t'.join(docker_cmd_list) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you be open to not putting these on different lines with \ ? I've had to copy and paste these and each time I have to paste it into a text editor, removed the \ and newlines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This works now as expected with the space, can be cut and pasted no problem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok cool, I'll give it a shot again.
ci/build.py
Outdated
'-e', 'CCACHE_TEMPDIR=/tmp/ccache', # temp dir should be local and not shared | ||
'-e', "CCACHE_DIR=/work/ccache", # this path is inside the container as /work/ccache is mounted | ||
'-e', "CCACHE_LOGFILE=/tmp/ccache.log", # a container-scoped log, useful for ccache verification. | ||
'-ti', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great change! Not specific to zombie containers, should be in its own PR.
ci/build.py
Outdated
@@ -383,7 +604,8 @@ def use_cache(): | |||
|
|||
./build.py -a | |||
|
|||
Builds for all platforms and leaves artifacts in build_<platform> | |||
Builds for all platforms and leaves artifacts in build_<platform>. **WARNING** it performs git |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See above comment on this.
ci/build.py
Outdated
|
||
if ret != 0: | ||
logging.critical("Execution of %s failed with status: %d", command, ret) | ||
return(ret) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
redundant ()
ci/build.py
Outdated
logging.warning("Cleaning up containers") | ||
else: | ||
return | ||
docker_client = docker.from_env() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is docker_client used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
⚡️
ci/build.py
Outdated
docker_client = docker.from_env() | ||
try: | ||
stop_timeout = int(os.environ.get("DOCKER_STOP_TIMEOUT", self.docker_stop_timeout)) | ||
except Exception as e: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
e unused
dry_run: bool = False, | ||
interactive: bool = False) -> str: | ||
interactive: bool = False) -> int: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
where is interactive used?
ci/build.py
Outdated
command=command, docker_registry=args.docker_registry, | ||
local_ccache_dir=args.ccache_dir, cleanup=cleanup) | ||
|
||
if ret != 0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we return 0 on success?
Can you provide any usage/testing info? |
Depends on PR #12296 |
123e39f
to
633ce92
Compare
Made the change smaller with @KellenSunderland in a different PR: #12336 |
* Separate minor refactoring from #12276 in a prior PR * Address CR comments
* Separate minor refactoring from apache#12276 in a prior PR * Address CR comments
Description
This PR adds mechanisms on the build scripts to cleanup docker containers when there is a cancellation, either from the user by sending SIGINT / SIGTERM or when the process inside the container ends with an error.
It also propagates the environment variables from Jenkins which are used by the process tree killer to identify runaway processes so processes inside the container are killed when the Jenkins job is stopped.
With this patch we catch SIGINT and SIGTERM and install a handler on atexit which cleans (stops) docker containers which were created during execution of the build.
We also switch to python docker API for better management of running containers.
@aaronmarkham @marcoabreu @lebeg @KellenSunderland
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.