-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve travis.yml config for faster CI #2278
Conversation
For comparison of build performance, here's a job with the existing Travis config that ran essentially no tests (typo fix PR) and took a total of 49 minutes: https://travis-ci.com/tensorflow/tensorboard/builds/113105919 With this PR, running with an empty commit (so the cache is warm, and also running no tests) it takes a total of 37.5 minutes: https://travis-ci.com/tensorflow/tensorboard/builds/113164615 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shaving a little over 2 minutes off of each of the 6 jobs in our test matrix.
Yay!
- factored out bazel settings into a
ci/bazelrc
file- factored out bazel downloading logic into
ci/download_bazel.sh
and streamlined it slightly- added bazelrc common
--curses=no
to suppress attempted progress bar spam
Yaaay! :-)
.travis.yml
Outdated
# We need to pass the PATH from our virtualenv down into our tests, | ||
# which is non-hermetic and so disabled by default in Bazel 0.21.0+. | ||
- echo "test --action_env=PATH" >>~/.bazelrc | ||
- elapsed() { TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" "$@"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps you mean "$*"
here?
$ bash -c 'TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" one two'
Time 00:00:00 one
bash: line 0: printf: two: invalid number
Time 00:00:00
(Or, from usage, maybe just "${1-}"
?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Changed to just ${1}
.
ci/download_bazel.sh
Outdated
|
||
temp_dest="$(mktemp)" | ||
|
||
mirror_url="https://mirror.bazel.build/github.com/bazelbuild/bazel/releases/download/${version}/bazel-${version}-linux-x86_64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use http://mirror.tensorflow.org/
instead?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ci/download_bazel.sh
Outdated
} | ||
|
||
if [ "$#" -ne 3 ]; then | ||
die "Usage: ${0} <version> <sha256sum> <destination-file>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Google style is to omit braces on positional parameters:
https://google.github.io/styleguide/shell.xml?showone=Variable_expansion#Variable_expansion
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
.travis.yml
Outdated
- pip install yamllint==1.5.0 | ||
# TensorBoard deps. | ||
- pip install futures==3.1.1 | ||
- pip install grpcio==1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the version downgrade? Was grpcio==1.6.3
before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted, thanks for the catch. I have no idea how that line ended up changing... I can't trace it to any other change. I must have just clobbered it somehow.
.travis.yml
Outdated
# When TensorFlow is not installed, run a restricted subset of tests. | ||
if [ -z "${TF_VERSION_ID}" ]; then | ||
test_tag_filters=support_notf | ||
# Run tests (only a restricted subset if TensorFlow is not installed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a note that this makes the string “bazel test
… exited with 3” no
longer show up on one line of the build log, which was the original
reason for pulling out test_tag_filters
:
#2075 (comment)
If you still prefer it this way, fine with me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, so that's why it's structured like this. I reverted the change but moved the variable setting up to before_script
so that if that part fails the build will error, rather than proceeding with the test command.
@@ -29,97 +24,55 @@ env: | |||
- TF_VERSION_ID= # Do not install TensorFlow in this case | |||
|
|||
cache: | |||
pip: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, it looks like this just caches the Pip cache directory and not the
actual virtualenv contents, which is good. (If it cached the virtualenv,
we might be relatively safe anyway, given that all our Pip dependencies
are either pinned or installed with -I
… but we could still be hit by
any Pip problems w.r.t. reinstalling packages in the same env.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment, but yes this is fine, though we do get weird pip cache deserialization errors fairly often for reasons that are unclear to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ಠ_ಠ
🙈
lgtm
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The set of changes left after removing all of them from view trivially looks good, huh?
pip install "absl-py>=0.7.0" | ||
pip install "numpy<2.0,>=1.14.5" | ||
# Requirements typically found through TensorFlow. | ||
pip install "absl-py>=0.7.0" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ooh, good catch. Would be nice to have some kind of --chain-lint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
PTAL
@@ -29,97 +24,55 @@ env: | |||
- TF_VERSION_ID= # Do not install TensorFlow in this case | |||
|
|||
cache: | |||
pip: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a comment, but yes this is fine, though we do get weird pip cache deserialization errors fairly often for reasons that are unclear to me.
.travis.yml
Outdated
# We need to pass the PATH from our virtualenv down into our tests, | ||
# which is non-hermetic and so disabled by default in Bazel 0.21.0+. | ||
- echo "test --action_env=PATH" >>~/.bazelrc | ||
- elapsed() { TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" "$@"; } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Changed to just ${1}
.
.travis.yml
Outdated
- pip install yamllint==1.5.0 | ||
# TensorBoard deps. | ||
- pip install futures==3.1.1 | ||
- pip install grpcio==1.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reverted, thanks for the catch. I have no idea how that line ended up changing... I can't trace it to any other change. I must have just clobbered it somehow.
.travis.yml
Outdated
# When TensorFlow is not installed, run a restricted subset of tests. | ||
if [ -z "${TF_VERSION_ID}" ]; then | ||
test_tag_filters=support_notf | ||
# Run tests (only a restricted subset if TensorFlow is not installed). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah ok, so that's why it's structured like this. I reverted the change but moved the variable setting up to before_script
so that if that part fails the build will error, rather than proceeding with the test command.
ci/download_bazel.sh
Outdated
} | ||
|
||
if [ "$#" -ne 3 ]; then | ||
die "Usage: ${0} <version> <sha256sum> <destination-file>" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
ci/download_bazel.sh
Outdated
|
||
temp_dest="$(mktemp)" | ||
|
||
mirror_url="https://mirror.bazel.build/github.com/bazelbuild/bazel/releases/download/${version}/bazel-${version}-linux-x86_64" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lovely; thanks!
@@ -29,97 +24,55 @@ env: | |||
- TF_VERSION_ID= # Do not install TensorFlow in this case | |||
|
|||
cache: | |||
pip: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ಠ_ಠ
🙈
lgtm
Summary: This was removed in #2278, but is actually pretty helpful. wchargin-branch: ci-pip-freeze
Summary: This was removed in #2278, but is actually pretty helpful. wchargin-branch: ci-pip-freeze
Summary: The Pip HTTP cache stores all downloaded wheels and is never evicted. With new `tf-nightly` wheels every day, this adds up quickly. We last cleared our Travis caches about a month ago, and they’re up to 14.3 GB. Investigation shows that the Pip HTTP cache accounts for the majority of the cache (about 70% after about a month of cache accrual), and also that jobs with larger caches have significantly longer startup times, with delta on the order of 8 minutes (again, after about a month). Also, uploading large caches at the end of a job can take minutes, and Travis doesn’t report success until this finishes. Fetching `tf-nightly` should be comparatively cheap. This reverts part of #2278. Test Plan: This PR reduces the “before install” time (i.e., time spent by Travis internals before it gets to our script, including restoring cache) from 9m40s to 4m59s, a 48% improvement. The “install” time is increased from 3m36s to 3m56s, which seems acceptable. wchargin-branch: ci-drop-pip-http-cache
This streamlines and improves our .travis.yml configuration and caching behavior in hopes of reducing our painfully slow Travis builds. In my testing so far this drops the total time to execute by ~30% (so from 50m total across 3 machines to more like 35m), by shaving a little over 2 minutes off of each of the 6 jobs in our test matrix.
Here's the list of changes I implemented that I felt overall improved speed (or were mixed, but seemed like better practice):
bazel fetch
resultsAnd then I did some cleanup:
ci/bazelrc
fileci/download_bazel.sh
and streamlined it slightlycommon --curses=no
to suppress attempted progress bar spamFinally I added some simple instrumentation showing the elapsed time since the shell started at various points during the job, which makes it a bit easier to get a sense of where the slow parts are without having to sum up the time taken by many individual commands (also, the travis time indicators are slightly flaky and don't always get parsed correctly by the fancy UI).
I was originally hoping for a bigger speedup by combining all the testing for the different TF versions into a single job, since then we could reuse the build step and avoid rebuilding each time. The problem I ran into was that bazel test result caching falls apart at that point - I wasn't aware of a simple way to tell bazel to treat each run of the tests as a separate set of cacheable test results (rather than re-using the same cache, which due to the non-hermeticity could fail to rerun tests that need to be rerun). Trying to work around this via cache munging would erase a lot of the benefits of the simplification above, and that approach was already much more complicated than what I have here.
Down the road, I think further improvements could be: