Skip to content

[CI] Use GCS buckets for bazel remote caching#131345

Merged
brianseeders merged 14 commits intoelastic:mainfrom
brianseeders:bazel-remote-cache
May 3, 2022
Merged

[CI] Use GCS buckets for bazel remote caching#131345
brianseeders merged 14 commits intoelastic:mainfrom
brianseeders:bazel-remote-cache

Conversation

@brianseeders
Copy link
Contributor

@brianseeders brianseeders commented May 2, 2022

TLDR: Use GCS buckets for bazel remote caching in CI for a cheap and easy bootstrap speed boost. Local remote cache is currently unchanged. All steps can read and write.

Bootstrap times in CI vary a lot over the course of the day, and can get pretty long on smaller machines when changes are made that invalidate the local cache inside the agent images (which are refreshed daily in the morning). We would like to enable caching across CI in a performant and cost-effective way.

I trialed:

  • Using bazel-remote, with grpc, hosted on an instance in our GCP project
  • Using a single, "multi-regional" GCS bucket located in the U.S.
  • Using single-region GCS buckets in every GCP region where we run CI (there are currently 5).

bazel-remote notes:

  • Probably the most expensive and least performant option, at least as I had it configured. Worse performance for instances further from where the service was hosted
  • Could possibly host a separate instance in each region to gain performance across regions, but it will be 5x as expensive
  • Does not have an HA solution
  • Would have to be hosted, maintained, upgraded, and monitored by us
  • With 100 instances running bootstrap starting from 0 cache (a worst-case scenario), CPU load spiked to around 4
  • Example with 100 workers, no disk cache, full remote cache - 2min-4min for bootstrap, depending on the region of the agent. Has a lot of variability inside the same region as well

GCS single bucket:

  • Uses https instead of grpc
  • Was faster than bazel-remote for worst-case scenario
  • Similar to bazel-remote, it's much slower for instances not close to the U.S.
  • Cheap and zero maintenance
  • File retention set to 48 hours, as we only need to cache objects not present in local cache, which is updated every 24 hours
  • Example with 100 workers, no disk cache, full remote cache - 2min-4min for bootstrap, depending on the region of the agent

GCS bucket-per-region:

  • Same as GCS single bucket, except:
  • Storage cost will be 5x as much as single bucket, but it's a pretty small cost. I'm not sure how much storage 48 hours worth of objects will be (it should be pretty small, probably in MB), but 1TB is about $20/mo.
  • Objects have to be cached separately across all the regions, but will generally happen during the on-merge job by jest jobs, and FTR jobs if still missing by then
  • All regions are fast
  • Bandwidth costs should be smaller as objects are stored in the same region
  • Example with 100 workers, no disk cache, full remote cache - about 2min for bootstrap

Given all of this, the last option (GCS bucket-per-region) seems like the best choice. This may change in the future when we're utilizing bazel for even more, and we can always reassess.

As a side note: We could also use this for local remote cache if we wanted. We just wouldn't get the UIs, statistics, historical tracking, etc. that Buildbuddy provides.

@brianseeders brianseeders added Feature:CI Continuous integration release_note:skip Skip the PR/issue when compiling release notes v8.3.0 Team:Operations Kibana-Operations Team labels May 2, 2022
@brianseeders
Copy link
Contributor Author

buildkite build this

@brianseeders
Copy link
Contributor Author

buildkite build this

@brianseeders brianseeders added v8.2.1 v7.17.4 auto-backport Deprecated - use backport:version if exact versions are needed labels May 3, 2022
@brianseeders brianseeders changed the title Trying our own bazel remote cache [CI] Use GCS buckets for bazel remote caching May 3, 2022
@brianseeders brianseeders marked this pull request as ready for review May 3, 2022 20:12
@brianseeders brianseeders requested a review from a team as a code owner May 3, 2022 20:12
@elasticmachine
Copy link
Contributor

Pinging @elastic/kibana-operations (Team:Operations)

@brianseeders brianseeders requested a review from mistic May 3, 2022 20:12
Copy link
Contributor

@spalger spalger left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉🎉🎉 This is awesome!! 🎉🎉🎉

@kibana-ci
Copy link

💚 Build Succeeded

Metrics [docs]

✅ unchanged

History

To update your PR or re-run it, just comment with:
@elasticmachine merge upstream

@mistic
Copy link
Contributor

mistic commented May 3, 2022

I believe it is also worth it to try https://github.com/znly/bazel-cache in the future

@brianseeders brianseeders merged commit 3bc9c42 into elastic:main May 3, 2022
@brianseeders brianseeders deleted the bazel-remote-cache branch May 3, 2022 20:49
kibanamachine pushed a commit that referenced this pull request May 3, 2022
kibanamachine pushed a commit that referenced this pull request May 3, 2022
@kibanamachine
Copy link
Contributor

💚 All backports created successfully

Status Branch Result
8.2
7.17

Note: Successful backport PRs will be merged automatically after passing CI.

Questions ?

Please refer to the Backport tool documentation

kibanamachine added a commit that referenced this pull request May 3, 2022
(cherry picked from commit 3bc9c42)

Co-authored-by: Brian Seeders <brian.seeders@elastic.co>
kibanamachine added a commit that referenced this pull request May 3, 2022
(cherry picked from commit 3bc9c42)

Co-authored-by: Brian Seeders <brian.seeders@elastic.co>
academo added a commit that referenced this pull request May 5, 2022
* Add severity field to create API and migration

* Adds integration test for severity field migration

* remove exclusive test

* Change severity levels

* Update integration tests for post case

* Add more integration tests

* Fix all cases list test

* Fix some server test

* Fix util server test

* Fix client util test

* Convert event log's duration from number to string in Kibana (keep as "long" in Elasticsearch) (#130819)

* Convert event.duration to string in TypeScript, keep as long in Elasticsearch

* Fix jest test

* Fix functional tests

* Add ecsStringOrNumber to event log schema

* Fix jest test

* Add utility functions to event log plugin

* Use new event log utility functions

* PR fixes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

* filter o11y rule aggregations (#131301)

* [Cloud Posture] Display and save rules per benchmark (#131412)

* Adding aria-label for discover data grid select document checkbox (#131277)

* Update API docs (#130999)

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

* [CI] Use GCS buckets for bazel remote caching (#131345)

* [Actionable Observability] Add license modal to rules table (#131232)

* Add fix license link

* fix localization

* fix CI error

* fix more translation issues

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

* [RAM] Add shareable rule status filter (#130705)

* rule state filter

* turn off experiment

* [CI] Auto-commit changed files from 'node scripts/eslint --no-cache --fix'

* Status filter API call

* Fix tests

* rename state to status, added tests

* Address comments and fix tests

* Revert experiment flag

* Remove unused translations

* Addressed comments

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

* [storybook] Watch for changes in packages (#131467)

* [storybook] Watch for changes in packages

* Update default_config.ts

* Improve saved objects migrations failure errors and logs (#131359)

* [Unified observability] Add tour step to guided setup (#131149)

* [Lens] Improved interval input (#131372)

* [Vega] Adjust vega doc for usage of ems files (#130948)

* adjust vega doc

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Excess intersections

* Create severity user action

* Add severity to create_case user action

* Fix and add integration tests

* Minor improvements

Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: mgiota <panagiota.mitsopoulou@elastic.co>
Co-authored-by: Jordan <51442161+JordanSh@users.noreply.github.com>
Co-authored-by: Bhavya RM <bhavya@elastic.co>
Co-authored-by: Thomas Neirynck <thomas@elastic.co>
Co-authored-by: Brian Seeders <brian.seeders@elastic.co>
Co-authored-by: Jiawei Wu <74562234+JiaweiWu@users.noreply.github.com>
Co-authored-by: Clint Andrew Hall <clint.hall@elastic.co>
Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
Co-authored-by: Alejandro Fernández Gómez <alejandro.fernandez@elastic.co>
Co-authored-by: Joe Reuter <johannes.reuter@elastic.co>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>
Co-authored-by: Christos Nasikas <christos.nasikas@elastic.co>
kertal pushed a commit to kertal/kibana that referenced this pull request May 24, 2022
kertal pushed a commit to kertal/kibana that referenced this pull request May 24, 2022
* Add severity field to create API and migration

* Adds integration test for severity field migration

* remove exclusive test

* Change severity levels

* Update integration tests for post case

* Add more integration tests

* Fix all cases list test

* Fix some server test

* Fix util server test

* Fix client util test

* Convert event log's duration from number to string in Kibana (keep as "long" in Elasticsearch) (elastic#130819)

* Convert event.duration to string in TypeScript, keep as long in Elasticsearch

* Fix jest test

* Fix functional tests

* Add ecsStringOrNumber to event log schema

* Fix jest test

* Add utility functions to event log plugin

* Use new event log utility functions

* PR fixes

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

* filter o11y rule aggregations (elastic#131301)

* [Cloud Posture] Display and save rules per benchmark (elastic#131412)

* Adding aria-label for discover data grid select document checkbox (elastic#131277)

* Update API docs (elastic#130999)

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

* [CI] Use GCS buckets for bazel remote caching (elastic#131345)

* [Actionable Observability] Add license modal to rules table (elastic#131232)

* Add fix license link

* fix localization

* fix CI error

* fix more translation issues

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>

* [RAM] Add shareable rule status filter (elastic#130705)

* rule state filter

* turn off experiment

* [CI] Auto-commit changed files from 'node scripts/eslint --no-cache --fix'

* Status filter API call

* Fix tests

* rename state to status, added tests

* Address comments and fix tests

* Revert experiment flag

* Remove unused translations

* Addressed comments

Co-authored-by: kibanamachine <42973632+kibanamachine@users.noreply.github.com>

* [storybook] Watch for changes in packages (elastic#131467)

* [storybook] Watch for changes in packages

* Update default_config.ts

* Improve saved objects migrations failure errors and logs (elastic#131359)

* [Unified observability] Add tour step to guided setup (elastic#131149)

* [Lens] Improved interval input (elastic#131372)

* [Vega] Adjust vega doc for usage of ems files (elastic#130948)

* adjust vega doc

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Update docs/user/dashboard/vega-reference.asciidoc

Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>

* Excess intersections

* Create severity user action

* Add severity to create_case user action

* Fix and add integration tests

* Minor improvements

Co-authored-by: Mike Côté <mikecote@users.noreply.github.com>
Co-authored-by: Kibana Machine <42973632+kibanamachine@users.noreply.github.com>
Co-authored-by: mgiota <panagiota.mitsopoulou@elastic.co>
Co-authored-by: Jordan <51442161+JordanSh@users.noreply.github.com>
Co-authored-by: Bhavya RM <bhavya@elastic.co>
Co-authored-by: Thomas Neirynck <thomas@elastic.co>
Co-authored-by: Brian Seeders <brian.seeders@elastic.co>
Co-authored-by: Jiawei Wu <74562234+JiaweiWu@users.noreply.github.com>
Co-authored-by: Clint Andrew Hall <clint.hall@elastic.co>
Co-authored-by: Christiane (Tina) Heiligers <christiane.heiligers@elastic.co>
Co-authored-by: Alejandro Fernández Gómez <alejandro.fernandez@elastic.co>
Co-authored-by: Joe Reuter <johannes.reuter@elastic.co>
Co-authored-by: Nick Peihl <nickpeihl@gmail.com>
Co-authored-by: Christos Nasikas <christos.nasikas@elastic.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto-backport Deprecated - use backport:version if exact versions are needed Feature:CI Continuous integration release_note:skip Skip the PR/issue when compiling release notes Team:Operations Kibana-Operations Team v7.17.4 v8.2.1 v8.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants