Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CD] Adds python docker image pipeline #16435

Closed

Conversation

perdasilva
Copy link
Contributor

Description

Extends the CD pipeline for publishing the python docker images

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at https://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

@perdasilva perdasilva force-pushed the cd_python_docker_pipeline branch 3 times, most recently from 0dba0ad to 60b5f31 Compare October 11, 2019 06:27
@perdasilva perdasilva changed the title [WIP][CD] Adds python docker image pipeline [CD] Adds python docker image pipeline Oct 11, 2019
ENV MXNET_COMMIT_ID=${MXNET_COMMIT_ID}

RUN mkdir -p /mxnet
COPY wheel_build/dist/*.whl /mxnet/.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might want to put the copy behind the install instruction because of the cache layers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

FROM ${BASE_IMAGE}

ARG MXNET_COMMIT_ID
ENV MXNET_COMMIT_ID=${MXNET_COMMIT_ID}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Move that down as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

# specific language governing permissions and limitations
# under the License.
#
# Python MXNet Dockerfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the point of this file? It feels like you're replicating mechanisms we have in the regular CI dockerfiles

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a base image for the tests. Because we mount the local fs we need to use that same user. Same problem as in CI. I'll use the ubuntu_adduser.sh script, though...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done


# Builds runtime image
build() {
docker build -t "${image_name}" --build-arg PYTHON_CMD=${python_cmd} --build-arg BASE_IMAGE="${base_image}" --build-arg MXNET_COMMIT_ID=${GIT_COMMIT} -f ${resources_path}/Dockerfile .
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Raw docker access is highly discouraged. Please use our Python Docker wrapper. Otherwise you are doing to run into all sorts of leaks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will leak here? This assumes docker login has already been called

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Docker build and docker run tend to leak resources and thus leave behind unclaimed containers. It took us a lot of time to get it right

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I get it - but it's way to tightly integrated with the CI stuff. I don't think it should be done here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a compromise, I've added a trap to the script. This does basically what the build.py does.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We spent months on fixing all the problems that arose, so please spend a bit of time on integrating your stuff. We don't want the same issues to arise again - the current solution is proven to work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately, I don't have the time to refactor build.py and extract that functionality and make it general purpose. If you are willing, I'm happy to refactor this stuff once it is done...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The time saved now is paid by increased maintenance overhead and risk due to two separate solutions. So this is a time trade, where the increased maintenance overhead will then have to be handled by the community - sorry, but I don't like that stance.

from botocore.exceptions import ClientError


def docker_login():
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already have docker login code, please don't duplicate

Copy link
Contributor Author

@perdasilva perdasilva Oct 11, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair call - I'll refactor the login stuff out of the cache script and use that

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@marcoabreu
Copy link
Contributor

It feels like the wheel was reinvented a few times. Please use the existing mechanisms

@perdasilva
Copy link
Contributor Author

perdasilva commented Oct 11, 2019

@marcoabreu which mechanism should I be using?

@marcoabreu
Copy link
Contributor

The build.py or docker cache functions (the stuff we use to generate our docker cache). In the end, you're doing a similar job as the docker cache generation job, so you could adopt that

@perdasilva
Copy link
Contributor Author

perdasilva commented Oct 11, 2019

Sorry, I misunderstood you. I equated wheel with wheel file =P

But I don't necessarily agree with you. I don't think they are solving the same problem. My understanding of CI is that, as much as possible, we don't want to rebuild the images. This isn't so much of an issue here given the low level of complexity of the images, i.e. we just install python and copy over the wheel.

If you feel like you can refactor the stuff in build and cache to be more generic and not so tightly integrated with CI, I'm happy to use it.

What seems like an issue to me though, are the zombie containers. As a mid term solution, I'm trapping the SIGTERM and SIGINT signals and cleaning up the container - which is exactly what the python code does anyway - except in a way that is tightly bound to CI...

@marcoabreu
Copy link
Contributor

Sorry I'm not following. Both build docker images and then publish them. If you're bothered by the usage of catching, you can disable incremental builds in the function.

Sorry, but I don't really see much reason to go with a hack if we already got a working solution that you just have to integrate into. You were aware of the need to integrate your stuff with the existing code from the beginning when you decided you'd like to work off-branch, so this request shouldn't be a surprise.

@perdasilva
Copy link
Contributor Author

The problem is the tight integration with the CI stuff. At some point we had something, it's from months ago and I don't have time to merge it with the current state of things.

Again, if you want to do it, I'm happy to hook up this stuff to whatever you come up with...

@perdasilva
Copy link
Contributor Author

At the end of the day, I'm on my own time here and I don't have much of it. Are you doing to work with me to push this through, or should I close the PR?

@marcoabreu
Copy link
Contributor

Maybe you can find somebody else in the community who'd be able to assist you here. I'm just the reviewer here.

@perdasilva
Copy link
Contributor Author

perdasilva commented Oct 11, 2019

No worries. Understood. In the future I would suggest being more flexible and working with the submitter to mitigate any issues as a follow up. Thus, fostering actual community building. Now, the community can spend time recreating this functionality + lose another contributer...wish you guys all the best!

@perdasilva perdasilva closed this Oct 11, 2019
@marcoabreu
Copy link
Contributor

Since I know of the priority shift and the allocated time budget for future work on this particular project, I don't believe that a follow up would have happened. The risk this PR contains weighs over its benefit. No doubt that this feature would have been quite helpful to MXNet, but we already went through a lot of trouble with regards to zombie containers and I consider the stability of the whole CI more important. I made sure to bring up this particular point (integrating the solution with the existing tools and interfaces) right from the inception of CD - if the time budget wasn't allocated properly to accommodate this, then it's a pity, but it's a hard requirement. As development of a feature is part of a project, integration is as well - but that's often forgotten. I consider a standalone solution (like the one presented here) a proof of concept and the necessary next step is thus to align it with the existing infrastructure. If there's no time for that, then it has to stop at the POC stage. I'm generally open towards incremental changes, but not when a PR brings a risk to the entire system.

@perdasilva
Copy link
Contributor Author

I mitigated the risk in the PR. The integration could have been done in a follow up PR refactoring the functionality out of the scripts...as I said I've been working on this on my own time and would have continued to do so. But the consistent lack of flexibility and the constant trading of continuous improvement for delayed perfection makes this a hard project to contribute to. Wish you guys all the best!

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants