Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

[CD] Add COMMIT_ID param to release job #16202

Merged
merged 8 commits into from
Sep 24, 2019

Conversation

perdasilva
Copy link
Contributor

@perdasilva perdasilva commented Sep 18, 2019

Description

Currently, if a new commit on master comes between the different stages, the CD pipeline will break. The reason is, the libxmet binary will be posted against one commit id, but the release pipeline (e.g. for PyPI packages), will be kicked off with the latest commit id.

This PR proposes to fix this issue by introducing a COMMIT_ID parameter to the release job definition. This solves the issue. This means we can use the commit id as the branch specifier in the 'Pipeline' section of the job configuration and ensure that the specified commit_id will be used for the build. This comes with some drawbacks, namely that release job will assume the state of the last job it ran. This can change the pipeline definition and lead to hard to track errors.

Screen Shot 2019-09-18 at 8 05 47 PM

Checklist

Essentials

Please feel free to remove inapplicable items for your PR.

  • The PR title starts with [MXNET-$JIRA_ID], where $JIRA_ID refers to the relevant JIRA issue created (except PRs with tiny changes)
  • Changes are complete (i.e. I finished coding on this PR)
  • All changes have test coverage:
  • Unit tests are added for small changes to verify correctness (e.g. adding a new operator)
  • Nightly tests are added for complicated/long-running ones (e.g. changing distributed kvstore)
  • Build tests will be added for build configuration changes (e.g. adding a new build option with NCCL)
  • Code is well-documented:
  • For user-facing API changes, API doc string has been updated.
  • For new C++ functions in header files, their functionalities and arguments are documented.
  • For new examples, README.md is added to explain the what the example does, the source of the dataset, expected performance on test set and reference to the original paper if applicable
  • Check the API doc at http://mxnet-ci-doc.s3-accelerate.dualstack.amazonaws.com/PR-$PR_ID/$BUILD_ID/index.html
  • To the my best knowledge, examples are either not affected by this change, or have been fixed to be compatible with this change

Changes

  • Feature1, tests, (and when applicable, API doc)
  • Feature2, tests, (and when applicable, API doc)

Comments

  • If this change is a backward incompatible change, why must this change be made.
  • Interesting edge cases to note here

@perdasilva
Copy link
Contributor Author

@zachgk could you please have a look at this - the other PR is blocked by it

@perdasilva
Copy link
Contributor Author

dev job

@perdasilva
Copy link
Contributor Author

@zachgk if this is all good, please merge - the pipeline works. The failures are either due to flakyness or a persistent problem that I've noticed and created an issue for: 16208

Copy link
Contributor

@aaronmarkham aaronmarkham left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just putting a hold on this for a bit since it might make a merge conflict with the website release. Please bear with me.
Can you explain removing the git_sha param? I didn't see it in use in the init_git function anyways.
Can you describe what steps to take when you do get stuck because of this change? Like if I were to try to emulate the process with a build.py run on a local instance, what would I need to do differently?

@perdasilva
Copy link
Contributor Author

@aaronmarkham
Can you explain removing the git_sha param?
That was left-over from a previous CD PR. We originally had the functionality to clone a particular commit id, but then decided to remove it. It was too messy. I guess we forgot to remove everything...

Can you describe what steps to take when you do get stuck because of this change?
So, here I think it's important to first understand the source of the issue. The CD framework is made up of two Jenkins jobs: one that defines the overall CD pipeline (e.g. static lib release -> then, in parallel, pipelines for releasing the different artifacts (pypi, docker images, etc.) that depend on the static library). The other, is a parameterized job that executes the pipelines that build, validate, and publish a particular artifact (e.g. binary, pypi package, docker image, etc.). Essentially, the CD pipeline orchestrates calls to the parameterized job by calling it with the parameters that define the pipeline for each of the artifacts.

Now, the CD pipeline is basically broken in two parts: the first builds and quality assures the libmxnet.so files and pushes them to the artifact repository indexed by the commit id. Then, the pipelines for each delivery channel take the binaries from the artifact repository, packages, tests and ships them. Our issue occurs if there are new commits to the branch from which that CD pipeline is being executed. That is, different (downstream) jobs could end up checking out different states of the repository, and there will be a failure because the binary for a commit id won't be found (since the libmxnet.so build pipeline worked off and got posted with a different commit id).

This PR introduces the commit id parameter to the parameterized job to guarantee that the commit id used to checkout the CD pipeline will be used throughout the entire process. The only caveat here is that the interface, (i.e. the parameters) to the parameterized job is defined in the Jenkinsfile using pipeline syntax.
This means, that every time that job runs, the current state (interface) could change depending on what is defined in the Jenkinsfile for the job at the supplied commit id. As an example, if we supply a commit id that is pre- adding the commit id parameter, if you try to re-execute the job, the parameter won't be there.

This isn't a huge deal since this interface shouldn't change often at all.

Can you describe what steps to take when you do get stuck because of this change?
The job will need to be manually updated (by just running it and then stopping it). We could also possibly run a dummy parameterized job (that does nothing) as part of the CD pipeline to update the parameterized job's to that of the commit id being processed state.

Like if I were to try to emulate the process with a build.py run on a local instance, what would I need to do differently?

It doesn't quite translate to the process on the local instance because the binary is compiled there.

Does any of this make sense? It's kinda convoluted to explain. I'll make a quick change and add this dummy run. Then we can avoid the problem altogether.

@perdasilva
Copy link
Contributor Author

@aaronmarkham also - no problem with you blocking it. I know you want to get this website released hehehe

@perdasilva perdasilva force-pushed the cd_commit_id_param branch 6 times, most recently from a222c60 to 8d91a8f Compare September 19, 2019 19:19
@perdasilva perdasilva force-pushed the cd_commit_id_param branch 4 times, most recently from aa4c9ee to 715d1e3 Compare September 19, 2019 19:46
@perdasilva
Copy link
Contributor Author

@aaronmarkham any movement on this?

@aaronmarkham aaronmarkham merged commit cbbb96a into apache:master Sep 24, 2019
drivanov pushed a commit to drivanov/incubator-mxnet that referenced this pull request Sep 26, 2019
* Removes unnecessary parameter

* Adds revision parameter to release job

* Update documentation

* Changes RELEASE_JOB_TYPE to string parameter

* Adds variant and release job type checks

* Adds release job update to CD pipeline

* Disable concurrent builds for CD pipeline

* Updates documentation with new release job mechanics
larroy pushed a commit to larroy/mxnet that referenced this pull request Sep 28, 2019
* Removes unnecessary parameter

* Adds revision parameter to release job

* Update documentation

* Changes RELEASE_JOB_TYPE to string parameter

* Adds variant and release job type checks

* Adds release job update to CD pipeline

* Disable concurrent builds for CD pipeline

* Updates documentation with new release job mechanics
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants