Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve travis.yml config for faster CI #2278

Merged
merged 4 commits into from
May 30, 2019
Merged

Conversation

nfelt
Copy link
Contributor

@nfelt nfelt commented May 24, 2019

This streamlines and improves our .travis.yml configuration and caching behavior in hopes of reducing our painfully slow Travis builds. In my testing so far this drops the total time to execute by ~30% (so from 50m total across 3 machines to more like 35m), by shaving a little over 2 minutes off of each of the 6 jobs in our test matrix.

Here's the list of changes I implemented that I felt overall improved speed (or were mixed, but seemed like better practice):

  • changed caching approach to use Bazel's properly designed caching APIs
    • repository caching for bazel fetch results
    • remote caching API for build results/artifacts, using local disk backend
    • con: fetch is slower (60ish seconds instead of 15) - not entirely sure why, but perhaps it doesn't cache all the downloaded results of fetch (some indication of this on Better documentation/controls to allow CI caching bazelbuild/bazel#6869). We may be able to tweak fetching behavior to improve this.
    • con: build is slower (90ish seconds instead of 60) - I think with the new caching we're rebuilding more things, e.g. now there are a bunch of closure warnings. We may be able to improve this by restructuring some of our build rules to be more cacheable, or generally speeding up the build.
    • pro: fetch/store operations for cache are much faster (60s fetch + 140s store reduced to 15s fetch + 60-75s store), which basically makes up for the above slowdowns
    • pro: no more need for hacky code to manually "prune" the output directory
  • added caching for pip packages
    • pro: pip installs are perhaps 30% or so faster, which also speeds up the smoke test (I'm not sure why they aren't even faster than this; I think sometimes incorrectly skip the cache or in some cases the packages run setup.py operations that still take time)
    • pro: less at the whim of pypi latency; less uninteresting logging of package downloads
    • con: some extra time to fetch/store cache but not much (maybe 15s)
  • removed warning-only flake8 run
    • pro: 10 seconds saved for every build job
    • con: no notice of warnings, but we never looked at this routinely anyway
  • changed bazel --action_env=PATH to apply to build+test, not just test
    • pro: can re-use package analysis from build step in test step, saving a few seconds

And then I did some cleanup:

  • factored out bazel settings into a ci/bazelrc file
  • factored out bazel downloading logic into ci/download_bazel.sh and streamlined it slightly
  • added bazelrc common --curses=no to suppress attempted progress bar spam
  • removed aggressive use of bazel --host_jvm_args that I didn't see justification for
  • removed redundant travis.yml config elements like sudo and os (Fixes .travis.yml: The 'sudo' tag is now deprecated in Travis CI #2113)
  • slightly reworked some other misc travis steps for clarity

Finally I added some simple instrumentation showing the elapsed time since the shell started at various points during the job, which makes it a bit easier to get a sense of where the slow parts are without having to sum up the time taken by many individual commands (also, the travis time indicators are slightly flaky and don't always get parsed correctly by the fancy UI).


I was originally hoping for a bigger speedup by combining all the testing for the different TF versions into a single job, since then we could reuse the build step and avoid rebuilding each time. The problem I ran into was that bazel test result caching falls apart at that point - I wasn't aware of a simple way to tell bazel to treat each run of the tests as a separate set of cacheable test results (rather than re-using the same cache, which due to the non-hermeticity could fail to rerun tests that need to be rerun). Trying to work around this via cache munging would erase a lot of the benefits of the simplification above, and that approach was already much more complicated than what I have here.

Down the road, I think further improvements could be:

  • making the actual fetch and build steps faster / caching them more effectively
  • figuring out a way to incorporate the TF version into Bazel's hermeticized understanding of the world so that it caches test results for different TF versions differently, but somehow still re-uses the build outputs (I have no idea how to do this but maybe we could ask the Bazel team)
  • reducing our TF-dependent tests to a smaller set, and then only running a much more limited set of tests against each TF version we test against (though of course once we bump versions to 2.0 we could stop testing against TF 1.x)
  • waiting for Python 2 to finally die in 2020 and then no longer running those tests either

@nfelt
Copy link
Contributor Author

nfelt commented May 25, 2019

For comparison of build performance, here's a job with the existing Travis config that ran essentially no tests (typo fix PR) and took a total of 49 minutes: https://travis-ci.com/tensorflow/tensorboard/builds/113105919

With this PR, running with an empty commit (so the cache is warm, and also running no tests) it takes a total of 37.5 minutes: https://travis-ci.com/tensorflow/tensorboard/builds/113164615

Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shaving a little over 2 minutes off of each of the 6 jobs in our test matrix.

Yay!

  • factored out bazel settings into a ci/bazelrc file
  • factored out bazel downloading logic into ci/download_bazel.sh and streamlined it slightly
  • added bazelrc common --curses=no to suppress attempted progress bar spam

Yaaay! :-)

.travis.yml Outdated
# We need to pass the PATH from our virtualenv down into our tests,
# which is non-hermetic and so disabled by default in Bazel 0.21.0+.
- echo "test --action_env=PATH" >>~/.bazelrc
- elapsed() { TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" "$@"; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps you mean "$*" here?

$ bash -c 'TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" one two'
Time 00:00:00 one
bash: line 0: printf: two: invalid number
Time 00:00:00 

(Or, from usage, maybe just "${1-}"?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Changed to just ${1}.


temp_dest="$(mktemp)"

mirror_url="https://mirror.bazel.build/github.com/bazelbuild/bazel/releases/download/${version}/bazel-${version}-linux-x86_64"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use http://mirror.tensorflow.org/ instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

}

if [ "$#" -ne 3 ]; then
die "Usage: ${0} <version> <sha256sum> <destination-file>"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Google style is to omit braces on positional parameters:
https://google.github.io/styleguide/shell.xml?showone=Variable_expansion#Variable_expansion

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

.travis.yml Outdated
- pip install yamllint==1.5.0
# TensorBoard deps.
- pip install futures==3.1.1
- pip install grpcio==1.0
Copy link
Contributor

@wchargin wchargin May 24, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the version downgrade? Was grpcio==1.6.3 before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted, thanks for the catch. I have no idea how that line ended up changing... I can't trace it to any other change. I must have just clobbered it somehow.

.travis.yml Outdated
# When TensorFlow is not installed, run a restricted subset of tests.
if [ -z "${TF_VERSION_ID}" ]; then
test_tag_filters=support_notf
# Run tests (only a restricted subset if TensorFlow is not installed).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a note that this makes the string “bazel test … exited with 3” no
longer show up on one line of the build log, which was the original
reason for pulling out test_tag_filters:
#2075 (comment)

If you still prefer it this way, fine with me.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so that's why it's structured like this. I reverted the change but moved the variable setting up to before_script so that if that part fails the build will error, rather than proceeding with the test command.

@@ -29,97 +24,55 @@ env:
- TF_VERSION_ID= # Do not install TensorFlow in this case

cache:
pip: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, it looks like this just caches the Pip cache directory and not the
actual virtualenv contents, which is good. (If it cached the virtualenv,
we might be relatively safe anyway, given that all our Pip dependencies
are either pinned or installed with -I… but we could still be hit by
any Pip problems w.r.t. reinstalling packages in the same env.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, but yes this is fine, though we do get weird pip cache deserialization errors fairly often for reasons that are unclear to me.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ಠ_ಠ

🙈

lgtm

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The set of changes left after removing all of them from view trivially looks good, huh?

pip install "absl-py>=0.7.0"
pip install "numpy<2.0,>=1.14.5"
# Requirements typically found through TensorFlow.
pip install "absl-py>=0.7.0" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh, good catch. Would be nice to have some kind of --chain-lint.

Copy link
Contributor Author

@nfelt nfelt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PTAL

@@ -29,97 +24,55 @@ env:
- TF_VERSION_ID= # Do not install TensorFlow in this case

cache:
pip: true
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment, but yes this is fine, though we do get weird pip cache deserialization errors fairly often for reasons that are unclear to me.

.travis.yml Outdated
# We need to pass the PATH from our virtualenv down into our tests,
# which is non-hermetic and so disabled by default in Bazel 0.21.0+.
- echo "test --action_env=PATH" >>~/.bazelrc
- elapsed() { TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" "$@"; }
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Changed to just ${1}.

.travis.yml Outdated
- pip install yamllint==1.5.0
# TensorBoard deps.
- pip install futures==3.1.1
- pip install grpcio==1.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reverted, thanks for the catch. I have no idea how that line ended up changing... I can't trace it to any other change. I must have just clobbered it somehow.

.travis.yml Outdated
# When TensorFlow is not installed, run a restricted subset of tests.
if [ -z "${TF_VERSION_ID}" ]; then
test_tag_filters=support_notf
# Run tests (only a restricted subset if TensorFlow is not installed).
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok, so that's why it's structured like this. I reverted the change but moved the variable setting up to before_script so that if that part fails the build will error, rather than proceeding with the test command.

}

if [ "$#" -ne 3 ]; then
die "Usage: ${0} <version> <sha256sum> <destination-file>"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


temp_dest="$(mktemp)"

mirror_url="https://mirror.bazel.build/github.com/bazelbuild/bazel/releases/download/${version}/bazel-${version}-linux-x86_64"
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Contributor

@wchargin wchargin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely; thanks!

@@ -29,97 +24,55 @@ env:
- TF_VERSION_ID= # Do not install TensorFlow in this case

cache:
pip: true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ಠ_ಠ

🙈

lgtm

@nfelt nfelt merged commit 234fb36 into tensorflow:master May 30, 2019
@nfelt nfelt deleted the travis-speedup branch May 30, 2019 01:18
wchargin added a commit that referenced this pull request Aug 4, 2019
Summary:
This was removed in #2278, but is actually pretty helpful.

wchargin-branch: ci-pip-freeze
wchargin added a commit that referenced this pull request Aug 5, 2019
Summary:
This was removed in #2278, but is actually pretty helpful.

wchargin-branch: ci-pip-freeze
wchargin added a commit that referenced this pull request Jan 23, 2020
Summary:
The Pip HTTP cache stores all downloaded wheels and is never evicted.
With new `tf-nightly` wheels every day, this adds up quickly. We last
cleared our Travis caches about a month ago, and they’re up to 14.3 GB.

Investigation shows that the Pip HTTP cache accounts for the majority of
the cache (about 70% after about a month of cache accrual), and also
that jobs with larger caches have significantly longer startup times,
with delta on the order of 8 minutes (again, after about a month). Also,
uploading large caches at the end of a job can take minutes, and Travis
doesn’t report success until this finishes. Fetching `tf-nightly` should
be comparatively cheap.

This reverts part of #2278.

Test Plan:
This PR reduces the “before install” time (i.e., time spent by Travis
internals before it gets to our script, including restoring cache) from
9m40s to 4m59s, a 48% improvement. The “install” time is increased from
3m36s to 3m56s, which seems acceptable.

wchargin-branch: ci-drop-pip-http-cache
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

.travis.yml: The 'sudo' tag is now deprecated in Travis CI
3 participants