Improve travis.yml config for faster CI #2278

nfelt · 2019-05-24T23:10:41Z

This streamlines and improves our .travis.yml configuration and caching behavior in hopes of reducing our painfully slow Travis builds. In my testing so far this drops the total time to execute by ~30% (so from 50m total across 3 machines to more like 35m), by shaving a little over 2 minutes off of each of the 6 jobs in our test matrix.

Here's the list of changes I implemented that I felt overall improved speed (or were mixed, but seemed like better practice):

changed caching approach to use Bazel's properly designed caching APIs
- repository caching for bazel fetch results
- remote caching API for build results/artifacts, using local disk backend
- con: fetch is slower (60ish seconds instead of 15) - not entirely sure why, but perhaps it doesn't cache all the downloaded results of fetch (some indication of this on Better documentation/controls to allow CI caching bazelbuild/bazel#6869). We may be able to tweak fetching behavior to improve this.
- con: build is slower (90ish seconds instead of 60) - I think with the new caching we're rebuilding more things, e.g. now there are a bunch of closure warnings. We may be able to improve this by restructuring some of our build rules to be more cacheable, or generally speeding up the build.
- pro: fetch/store operations for cache are much faster (60s fetch + 140s store reduced to 15s fetch + 60-75s store), which basically makes up for the above slowdowns
- pro: no more need for hacky code to manually "prune" the output directory
added caching for pip packages
- pro: pip installs are perhaps 30% or so faster, which also speeds up the smoke test (I'm not sure why they aren't even faster than this; I think sometimes incorrectly skip the cache or in some cases the packages run setup.py operations that still take time)
- pro: less at the whim of pypi latency; less uninteresting logging of package downloads
- con: some extra time to fetch/store cache but not much (maybe 15s)
removed warning-only flake8 run
- pro: 10 seconds saved for every build job
- con: no notice of warnings, but we never looked at this routinely anyway
changed bazel --action_env=PATH to apply to build+test, not just test
- pro: can re-use package analysis from build step in test step, saving a few seconds

And then I did some cleanup:

factored out bazel settings into a ci/bazelrc file
factored out bazel downloading logic into ci/download_bazel.sh and streamlined it slightly
added bazelrc common --curses=no to suppress attempted progress bar spam
removed aggressive use of bazel --host_jvm_args that I didn't see justification for
removed redundant travis.yml config elements like sudo and os (Fixes .travis.yml: The 'sudo' tag is now deprecated in Travis CI #2113)
slightly reworked some other misc travis steps for clarity

Finally I added some simple instrumentation showing the elapsed time since the shell started at various points during the job, which makes it a bit easier to get a sense of where the slow parts are without having to sum up the time taken by many individual commands (also, the travis time indicators are slightly flaky and don't always get parsed correctly by the fancy UI).

I was originally hoping for a bigger speedup by combining all the testing for the different TF versions into a single job, since then we could reuse the build step and avoid rebuilding each time. The problem I ran into was that bazel test result caching falls apart at that point - I wasn't aware of a simple way to tell bazel to treat each run of the tests as a separate set of cacheable test results (rather than re-using the same cache, which due to the non-hermeticity could fail to rerun tests that need to be rerun). Trying to work around this via cache munging would erase a lot of the benefits of the simplification above, and that approach was already much more complicated than what I have here.

Down the road, I think further improvements could be:

making the actual fetch and build steps faster / caching them more effectively
figuring out a way to incorporate the TF version into Bazel's hermeticized understanding of the world so that it caches test results for different TF versions differently, but somehow still re-uses the build outputs (I have no idea how to do this but maybe we could ask the Bazel team)
reducing our TF-dependent tests to a smaller set, and then only running a much more limited set of tests against each TF version we test against (though of course once we bump versions to 2.0 we could stop testing against TF 1.x)
waiting for Python 2 to finally die in 2020 and then no longer running those tests either

nfelt · 2019-05-25T01:05:37Z

For comparison of build performance, here's a job with the existing Travis config that ran essentially no tests (typo fix PR) and took a total of 49 minutes: https://travis-ci.com/tensorflow/tensorboard/builds/113105919

With this PR, running with an empty commit (so the cache is warm, and also running no tests) it takes a total of 37.5 minutes: https://travis-ci.com/tensorflow/tensorboard/builds/113164615

wchargin

shaving a little over 2 minutes off of each of the 6 jobs in our test matrix.

Yay!

factored out bazel settings into a ci/bazelrc file

factored out bazel downloading logic into ci/download_bazel.sh and streamlined it slightly

added bazelrc common --curses=no to suppress attempted progress bar spam

Yaaay! :-)

wchargin · 2019-05-24T23:24:27Z

.travis.yml

-  # We need to pass the PATH from our virtualenv down into our tests,
-  # which is non-hermetic and so disabled by default in Bazel 0.21.0+.
-  - echo "test --action_env=PATH" >>~/.bazelrc
+  - elapsed() { TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" "$@"; }


Perhaps you mean "$*" here?

$ bash -c 'TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" one two' Time 00:00:00 one bash: line 0: printf: two: invalid number Time 00:00:00

(Or, from usage, maybe just "${1-}"?)

Good point. Changed to just ${1}.

wchargin · 2019-05-24T23:29:12Z

ci/download_bazel.sh

+
+temp_dest="$(mktemp)"
+
+mirror_url="https://mirror.bazel.build/github.com/bazelbuild/bazel/releases/download/${version}/bazel-${version}-linux-x86_64"


Use http://mirror.tensorflow.org/ instead?

wchargin · 2019-05-24T23:32:49Z

ci/download_bazel.sh

+}
+
+if [ "$#" -ne 3 ]; then
+  die "Usage: ${0} <version> <sha256sum> <destination-file>"


nit: Google style is to omit braces on positional parameters:
https://google.github.io/styleguide/shell.xml?showone=Variable_expansion#Variable_expansion

wchargin · 2019-05-24T23:38:37Z

.travis.yml

  - pip install yamllint==1.5.0
+  # TensorBoard deps.
+  - pip install futures==3.1.1
+  - pip install grpcio==1.0


Why the version downgrade? Was grpcio==1.6.3 before.

Reverted, thanks for the catch. I have no idea how that line ended up changing... I can't trace it to any other change. I must have just clobbered it somehow.

wchargin · 2019-05-25T00:57:14Z

.travis.yml

-    # When TensorFlow is not installed, run a restricted subset of tests.
-    if [ -z "${TF_VERSION_ID}" ]; then
-      test_tag_filters=support_notf
+    # Run tests (only a restricted subset if TensorFlow is not installed).


Just a note that this makes the string “bazel test … exited with 3” no
longer show up on one line of the build log, which was the original
reason for pulling out test_tag_filters:
#2075 (comment)

If you still prefer it this way, fine with me.

Ah ok, so that's why it's structured like this. I reverted the change but moved the variable setting up to before_script so that if that part fails the build will error, rather than proceeding with the test command.

wchargin · 2019-05-25T01:10:10Z

.travis.yml

@@ -29,97 +24,55 @@ env:
    - TF_VERSION_ID=  # Do not install TensorFlow in this case

 cache:
+  pip: true


Okay, it looks like this just caches the Pip cache directory and not the
actual virtualenv contents, which is good. (If it cached the virtualenv,
we might be relatively safe anyway, given that all our Pip dependencies
are either pinned or installed with -I… but we could still be hit by
any Pip problems w.r.t. reinstalling packages in the same env.)

Added a comment, but yes this is fine, though we do get weird pip cache deserialization errors fairly often for reasons that are unclear to me.

ಠ_ಠ

🙈

lgtm

The set of changes left after removing all of them from view trivially looks good, huh?

wchargin · 2019-05-25T01:17:18Z

.travis.yml

-      pip install "absl-py>=0.7.0"
-      pip install "numpy<2.0,>=1.14.5"
+      # Requirements typically found through TensorFlow.
+      pip install "absl-py>=0.7.0" \


Ooh, good catch. Would be nice to have some kind of --chain-lint.

nfelt

PTAL

nfelt · 2019-05-29T23:52:50Z

.travis.yml

@@ -29,97 +24,55 @@ env:
    - TF_VERSION_ID=  # Do not install TensorFlow in this case

 cache:
+  pip: true


Added a comment, but yes this is fine, though we do get weird pip cache deserialization errors fairly often for reasons that are unclear to me.

nfelt · 2019-05-29T23:53:11Z

.travis.yml

-  # We need to pass the PATH from our virtualenv down into our tests,
-  # which is non-hermetic and so disabled by default in Bazel 0.21.0+.
-  - echo "test --action_env=PATH" >>~/.bazelrc
+  - elapsed() { TZ=UTC printf "Time %(%T)T %s\n" "$SECONDS" "$@"; }


Good point. Changed to just ${1}.

nfelt · 2019-05-29T23:53:42Z

.travis.yml

  - pip install yamllint==1.5.0
+  # TensorBoard deps.
+  - pip install futures==3.1.1
+  - pip install grpcio==1.0


Reverted, thanks for the catch. I have no idea how that line ended up changing... I can't trace it to any other change. I must have just clobbered it somehow.

nfelt · 2019-05-29T23:54:33Z

.travis.yml

-    # When TensorFlow is not installed, run a restricted subset of tests.
-    if [ -z "${TF_VERSION_ID}" ]; then
-      test_tag_filters=support_notf
+    # Run tests (only a restricted subset if TensorFlow is not installed).


Ah ok, so that's why it's structured like this. I reverted the change but moved the variable setting up to before_script so that if that part fails the build will error, rather than proceeding with the test command.

nfelt · 2019-05-29T23:54:52Z

ci/download_bazel.sh

+}
+
+if [ "$#" -ne 3 ]; then
+  die "Usage: ${0} <version> <sha256sum> <destination-file>"


nfelt · 2019-05-29T23:54:56Z

ci/download_bazel.sh

+
+temp_dest="$(mktemp)"
+
+mirror_url="https://mirror.bazel.build/github.com/bazelbuild/bazel/releases/download/${version}/bazel-${version}-linux-x86_64"


wchargin

Lovely; thanks!

wchargin · 2019-05-30T01:10:57Z

.travis.yml

@@ -29,97 +24,55 @@ env:
    - TF_VERSION_ID=  # Do not install TensorFlow in this case

 cache:
+  pip: true


ಠ_ಠ

🙈

lgtm

Summary: This was removed in #2278, but is actually pretty helpful. wchargin-branch: ci-pip-freeze

Summary: The Pip HTTP cache stores all downloaded wheels and is never evicted. With new `tf-nightly` wheels every day, this adds up quickly. We last cleared our Travis caches about a month ago, and they’re up to 14.3 GB. Investigation shows that the Pip HTTP cache accounts for the majority of the cache (about 70% after about a month of cache accrual), and also that jobs with larger caches have significantly longer startup times, with delta on the order of 8 minutes (again, after about a month). Also, uploading large caches at the end of a job can take minutes, and Travis doesn’t report success until this finishes. Fetching `tf-nightly` should be comparatively cheap. This reverts part of #2278. Test Plan: This PR reduces the “before install” time (i.e., time spent by Travis internals before it gets to our script, including restoring cache) from 9m40s to 4m59s, a 48% improvement. The “install” time is increased from 3m36s to 3m56s, which seems acceptable. wchargin-branch: ci-drop-pip-http-cache

Improve travis.yml config for faster CI

5b7a3b5

nfelt added type:build/install type:cleanup labels May 24, 2019

nfelt requested a review from wchargin May 24, 2019 23:10

empty commit to rerun travis now that cache is populated

0e0b0f0

wchargin reviewed May 25, 2019

View reviewed changes

abhinavsagar approved these changes May 27, 2019

View reviewed changes

nfelt added 2 commits May 29, 2019 16:51

CR: revert change in how we set --test_tag_filters

8da3709

CR: misc one-liner fixes

213d8cc

nfelt commented May 29, 2019

View reviewed changes

wchargin approved these changes May 30, 2019

View reviewed changes

nfelt merged commit 234fb36 into tensorflow:master May 30, 2019

nfelt deleted the travis-speedup branch May 30, 2019 01:18

wchargin mentioned this pull request May 30, 2019

ci: change VM base from Trusty to Xenial #2293

Closed

wchargin added a commit that referenced this pull request Aug 4, 2019

ci: print pip freeze output, for debugging

b88c51f

Summary: This was removed in #2278, but is actually pretty helpful. wchargin-branch: ci-pip-freeze

wchargin mentioned this pull request Aug 4, 2019

ci: print pip freeze output, for debugging #2495

Merged

wchargin added a commit that referenced this pull request Aug 5, 2019

ci: print pip freeze output, for debugging (#2495)

f39f71c

Summary: This was removed in #2278, but is actually pretty helpful. wchargin-branch: ci-pip-freeze

wchargin mentioned this pull request Jan 23, 2020

ci: don’t store Pip HTTP cache #3167

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve travis.yml config for faster CI #2278

Improve travis.yml config for faster CI #2278

nfelt commented May 24, 2019

nfelt commented May 25, 2019

wchargin left a comment

wchargin May 24, 2019

nfelt May 29, 2019

wchargin May 24, 2019

nfelt May 29, 2019

wchargin May 24, 2019

nfelt May 29, 2019

wchargin May 24, 2019 •

edited

Loading

nfelt May 29, 2019

wchargin May 25, 2019

nfelt May 29, 2019

wchargin May 25, 2019

nfelt May 29, 2019

wchargin May 30, 2019

nfelt May 30, 2019

wchargin May 25, 2019

nfelt left a comment

nfelt May 29, 2019

nfelt May 29, 2019

nfelt May 29, 2019

nfelt May 29, 2019

nfelt May 29, 2019

nfelt May 29, 2019

wchargin left a comment

wchargin May 30, 2019


		temp_dest="$(mktemp)"

		mirror_url="https://mirror.bazel.build/github.com/bazelbuild/bazel/releases/download/${version}/bazel-${version}-linux-x86_64"

Improve travis.yml config for faster CI #2278

Improve travis.yml config for faster CI #2278

Conversation

nfelt commented May 24, 2019

nfelt commented May 25, 2019

wchargin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wchargin May 24, 2019 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nfelt left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wchargin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wchargin May 24, 2019 •

edited

Loading