-
Notifications
You must be signed in to change notification settings - Fork 607
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[rush] Build cache feature. #2393
Comments
I love this, because I've tried to implement this twice myself, via docker registry images and also npm packages. In the end, it would fit much better within rush. Maybe it would be a good idea to provide some kind of hooks in a js file so that people can upload/download the cache files in whatever way they choose? The connection string is a little hard to apply for more than a few use cases, unless I'm missing something. |
I've thought about that. We've put together a pretty flexible plugin system in Heft, but so far I haven't been able to come up with a way that would cleanly work in Rush. Maybe a
Azure Storage's connection strings are a simple way to specify a storage endpoint (i.e. - an account name) and an access token. I think we'll want to support some sort of interactive and AAD authentication, and probably some way to pass credentials in via env variable. I think those will be features that we'll need to figure out after some community feedback because what works for my team at Microsoft is almost definitely not going to work for everyone else. What are your concerns here? It'd be good to try to be aware of other scenarios as early as possible. |
I don't know if this is a bad idea, or too much work, but what about this:
You could also have a seperate package.json script for each action/hook type, to avoid the parameter/env passing.
My org uses Amazon S3 to store artifacts currently, and we intend to use the S3 buckets to serve the files through a CDN (Cloudfront), so that is one thing to consider. We also have a widely established method of uploading files complete .tar.gz files of each monorepo project to Artifactory which we use to later serve the files through a CMS (unpacked). I'm not very familiar with the details you mention (like AAD authentication) so I'm not sure how these components would fit your proposal and ideas. |
Launching a separate process will probably impact perf for a 100% cached repo quite a bit. I also don't think the cache provider should be configureable per-project.
Oh yeah I absolutely want to also support S3 in addition to Azure Storage. I want to get that in soon after the initial implementation is checked in. The official AWS package is pretty enormous, so we'll probably want to call its APIs via REST, or use a trimmed down package.
Would you want to use Artifactory as your cache store? That sounds more like a post-build |
That's a very good point! Ok, if S3 is getting first class support then that's great! Actually that's what a colleague of mine mentioned, I was not sure if you'd be up for that but that's great to hear. Regarding artifactory, yes, first I thought it could be one and the same, but build caches probably deserve their own storage solution. |
Here's my initial feedback:
|
Symlinks cannot be stored easily in a tarball. Rush should probably report an error if a symlink is encountered while archiving a file. |
@octogonz - here are some answers to your questions:
I've thought about this. I think in the MVP of the multiphase feature, we'll just cache the output of all of each projects' phases, and consider all of the project's phases complete in the event of a cache hit. Because both of these features will be released as opt-in "experiments," we can tweak this design as we get more experience with it.
Right now the MVP supports only a single cache provider. It'd be pretty easy to change the "filesystem" provider to be another option (something like
I think I agree here - it probably should be an option in the per-project config, and those should have rig support. I think I'd like to make that change in a follow-up PR after the initial checkin. Should be a simple change, but it'll be easier to reason with as a standalone change.
Yes, it is a security credential. I've removed this property and, instead, created a file called
Yes, Azure supports that.
I've updated the design to stick the Azure-specific stuff in a |
I've made some updates to the design after reviewing feedback. Configuration Files
|
Really impressed with the dedication here. My org will be testing the S3 provider as soon as something is ready to test. 🙂 |
Merged in the initial implementation PR. Summary of changes since my last post:
|
This has been published in Rush version 5.36.1. It's behind the |
I hacked together an S3 implementation (https://github.com/raihle/rushstack/tree/aws-build-cache), but there are a lot of issues with it:
Should I open a PR to discuss, or do we keep it here? |
Finally had the chance to test out the build cache experiment. I was hoping, that because the new mechanism includes the command being run, that Rush could handle the following scenario in a smart way:
Last command could be skipped because b and c were already built with this build state including the additional parameter Update: I've just realized that the build cache feature only kicks in, if there is a rush-project.json (locally or via rig) with at least one "projectOutputFolder" (as indicated in the Rush log output...). Now the scenario from above is working, and I love it! 😍 Currently I don't see any way in just using the skip feature, but don't actually cache the project outputs? My use case: I’m building several packages with emscripten. Based on the "command-line.json"-parameters --build-type (which affects the optimization) and --threading-model (which includes pthread/offscreen support) a dedicated build workspace ( All dependents will include the .wasm file based on the dist folder derived of their parameters. With this structure all output will reside in the projects |
Amazon S3 support has been released in Rush 5.42.0. Many thanks to @raihle for putting this together! |
@iclanton with I agree with @octogonz 's comment; The S3 and Azure dependencies should be excluded from Rush..?! Wouldn't it be nice to have a way to provide external/custom cache implementations? The S3/Azure implementations would go to a dedicated rushstack package, and I could provide an exotic implementation (like the null cache proposed above) as a package inside the repo. |
|
@iclanton did the schema update? I get the following error using
From searching github it looks like we should be using |
Yeah it's |
I'm currently playing with the S3 bucket support. One thing I haven't quite figured out how Rush's credential cache is supposed to interact with typical S3 user and CI flows. For the CI environment, I would expect it's quite unusual to have a user (with AWS Key/Token credentials). Normally in enterprise you would assign a role to the machine that is running the build, and then the AWS SDK automatically discovers that role. You're avoiding use of the SDK, which is OK -- you could recommend that users instead make an AssumeRole call early in the build and save temporary credentials. However, these credentials would typically live in For individual users, in some cases you'll have actual AWS users with credentials, but you'll often have federated access. In this case the users don't have cacheable credentials -- they've obtained credentials (usually lasting 8-12 hours) through a process involving MFA or perhaps Okta login, and those temporary credentials, again, go in (What I'm trying to figure out is the easiest way to make rush build work in both local and CI situations, without having to write lots of little helpers that copy credentials to/from Rush's credential cache and the normal AWS credentials file.) EDIT: Minutes after posting this I discovered |
About amazon provider: I see from discussion that it's been implemented using HTTPS because It only knows how to talk to s3 and is tree-shakable AFAIK. Not sure if it's worth a rewrite, but wanted to make you aware. One definite gain would be to easily support "Normal" ways to authenticate with AWS |
I am testing the functionality with AWS S3 and I was wondering if it would be possible to override the The behaviour that we would like to get is that we use an S3 bucket in our CI environment but developers continue using a local build cache ( |
Is there any interest in adding support for customizing the I'm imagining support for some kind of environment variable token, which would be interpolated when Rush generates the cache entry key. Something like so: /* build-cache.json */
{
"cacheEntryNamePattern": "[projectName:normalized]-[hash]-[envVar:NODE_VERSION]"
} And then the parsePattern function would need to support looking for an This would help cover situations like I ran into, where the project hasn't changed, but the context around it in the CI environment has changed. For us, the build hadn't changed, but a change to the Node version used to build it led to new errors thrown in a Rush command used in the build. That command used the incremental build feature, so when the cache key hadn't changed between the two builds compiled with different Node versions, it didn't run that incremental command for all of the Rush packages in the repo. It wasn't until the repo saw more changes later in the day that the incremental command ran on all the packages, and surfaced a new failure. Being able to factor the Node version (or any other aribtrary environment variable) into the cache key would have helped avoid that mistake. |
Double the request of the @robmosca-travelperk had similar thoughts. I just set up S3 caching and it works wonders and on CI it speeds up everything A LOT. Thanks! Although, few issues with dangerous cache hits, more below :) An issue that I encountered, that seems to be similar to @istateside, is that the output of certain commands can change with the environment variable. So, let's say, for my libraries, it will be absent, as libs are not dependent nor inject any envs into resulting
{
"operationName": "build",
"outputFolderNames": ["dist"],
"breakingEnvVars": ["MODE"]
} So if it runs with I tried to get around that with some Also, disabling cache with env or parameter will be nice. So if you have such a situation like above, you can just |
Here is how Turborepo handles that. Unsure about how global it is (I like the colocation of |
I think in Rush that rather than modifying the cache key, we'd want to model a dependency on environment variables as a dependency proper -- it would change the hash and the hashes of all other dependent projects in the monorepo. For example, perhaps |
Just wanted to +1 the need for environment variables affecting cache. Not having this is pretty sad for frontend projects that bake environment into output. |
Just want to emphasize that it should be per command, and not a global thing per project/package. As only 1 command may be sensetive to envs, but the rest, maybe, are fine with any envs so their cache can be reused across different environments. E.g. local development, CI/CD etc. |
What is the easiest way to get the current/most-recent build-cache key-hits post a successful run? I would like to be able to purge non-recent-relevant tarballs. Our build-cache dir can get quite quite big... |
If we talk buckets in the cloud you can make a policy to remove unused files after some time they haven't been accessed. If we talk locally, I think you can write a script that will go through the files and read their created date and delete respectively. |
@RIP21 I totally could, but I also think it would be finicky script; I would personally opt for grepping around the rush-lib/rushstack-repo to find the bits and pieces I need to calculate the build-cache keys post run. It seems like it would be relatively easy to have rush dump the cache-hit keys after a build-run. I found the s3 build-caching to be ok (tho I tried it quite a while ago) We/me are/is using composite github workflows w/ github-actions-cache to push and pull the build-cache. |
Summary
Rush has an incremental build feature, where projects with non-gitignored files and external dependencies that haven't changed since the last build aren't rebuilt. This feature has several limitations: it doesn't ensure the output files haven't changed, only the previous build is accounted for, and incremental builds are contained only to a single copy of a repo.
I propose a more robust build caching feature with the following structure:
common/config/temp
folder, a remote object store like Azure Storage, some machine-wide shared store, or a combination of thoseFeedback is welcome.
Proposed Implementation
I've put together an initial implementation in this pull request: #2390
I propose the following configuration files:
common/config/rush/build-cache.json
This file describes the cache store and the default project folders that are cached. In this initial implementation, I've decided upon two cache stores: a per-repo filesystem location and Azure Storage.
Here is an example per-repo filesystem location
common/config/rush/build-cache.json
file that stores the cache incommon/temp/build-cache
Here is an example Azure Storage
common/config/rush/build-cache.json
file that stores the cache in an Azure Storage account calledexample-account
, a container calledfiles
, and with anexample/
prefix for each cache entry:<project root>/.rush/build-cache.json
This file describes per-project overrides for build cache configuration.
Here is an example
<project root>/.rush/build-cache.json
file that adds alib-commonjs
cache folder:Here is an example
<project root>/.rush/build-cache.json
file that caches alib-commonjs
folder and ignores theprojectOutputFolderNames
property incommon/config/rush/build-cache.json
:The text was updated successfully, but these errors were encountered: