Garbage collection for cache directory #71

aomarks · 2022-04-03T17:41:39Z

The .wireit/<script>/cache directory currently can grow indefinitely. We should implement a garbage collection strategy to cap the size of this directory.

An LRU cache with a configurable maximum number of entries seems like what we want.

We will want to make sure we have an efficient way to maintain the cache hit rate data which scales well with the size of the cache. We will probably want some kind of on-disk index file that lets us read/write cache hit rates efficiently, to determine which cache entry needs to be deleted when the cap is hit. A doubly-linked-list implemented in the filesystem itself with symlinks (or just files containing SHAs) could also be an interesting way to do this.

The text was updated successfully, but these errors were encountered:

joshkraft · 2023-01-29T23:14:16Z

Hi @aomarks - this issue piqued my curiosity. I am wondering what the current status is? I'm interested in giving it a shot, but it has been sitting for awhile so I figured I would check before digging in.

aomarks · 2023-01-30T00:42:39Z

Hi @joshkraft! Contributions would be more than welcome.

Lately I've been wondering if local caching is something we really want in general. One thing it's nice for is when switching back and forth between git branches, or undos. But I'm not sure the complexity and additional latency of maintaining the local cache is worthwhile. What do you think? Have you been relying on local caching? Did you find that the cache folder was getting too big?

joshkraft · 2023-02-01T20:32:55Z

That is a great question. I actually just started using wireit, so unfortunately I have no real opinion on the utility of the local caching.

If the goal is keeping additional complexity/latency low, one interesting approach I noticed in another project is to inject some randomness into the process. Rather than keeping track of every get/put to the cache, each execution calls a random() function that returns true 1%/5%/etc. of the time. If it returns true, kick off a 'cache cleaning' process.

Cache cleaning could be done in a number of ways:

delete random cache entries until we are beneath the desired limit (could also consider the 2-random approach)
delete oldest cache entries (based on stats.mtimeMs) until we are beneath the desired limit
delete all cache entries that are older than some time (could be user defined)

I am curious if you think this approach would be a good fit here? I'm happy to do some experimentation and see what it would look like in this project.

joshkraft · 2023-03-22T01:45:52Z

Hi @aomarks,

I've had a chance to start using Wireit personally and I have found local caching to be useful, as someone who is a serial branch switcher.

I did some POC work implementing a few different eviction mechanisms. My guess is that the additional latency introduced here would largely be driven by the time it takes to delete evicted entries from the cache. I'm curious if you have suggestions on some parameters I can use to benchmark the different approaches? For example:

What would be a reasonable cache limit to default to? 100, 1000?
What might a 'practical worst-case scenario' output folder look like? Interested in size, structure, etc.

Also, feel free to let me know if adding this functionality doesn't align with the project's goals. I'm happy to investigate some options and report back with benchmarking results, but if the value added here isn't worth the added complexity I'm happy to direct my effort towards another issue.

Westbrook · 2023-04-18T01:48:43Z

How is the mangement of the default cache and the cache that GitHub Actions might you different currently? In particular, I'm leveraging CircleCI, for which I actually turn off the default CI cache mechanism, because I couldn't get it to actually build/save a cache in that context, however that cache does grow indefintiely. Is there something about the GH Actions cache approach that could be leveraged in "local" mode? If not, then raising the priority of this issue would be really nice to see and support expanded consumption of Wireit across projects I work with at Adobe.

joshkraft · 2023-09-26T22:41:54Z

Hi @aomarks || @rictic - I'm thinking about doing some work on this. Do you think this would still be a useful addition to wireit, or not worth the added complexity?

rictic · 2023-09-26T23:51:35Z

I do think this is helpful. for cache size, one idea would be for the size target to be a multiple of the size of the live entries in the cache. something like 3x the size, to comfortably fit a few branches (likely more than 3, because they'll share many cache entries)

WDYT about a design for this @aomarks? I'm thinking a wireit vacuum command that would analyze the current package and each referenced package and do GC over each of them in parallel

aomarks · 2023-09-27T00:23:07Z

Hi @aomarks || @rictic - I'm thinking about doing some work on this. Do you think this would still be a useful addition to wireit, or not worth the added complexity?

Yeah, I do think that we should either implement garbage collection, or get rid of local caching.

I am a little unsure if local caching is worthwhile, so I'd be interested to hear @joshkraft and @rictic 's thoughts on that. Remote caching for CI is clearly useful, and it does its own GC. Local caching feels a bit less useful, and it comes with a cost (latency & storage).

WDYT about a design for this @aomarks? I'm thinking a wireit vacuum command that would analyze the current package and each referenced package and do GC over each of them in parallel

That sounds cool. Are you thinking that it's up the user to run the vacuum command? Would we also run it automatically? Or maybe prompt the user to run it sometimes?

rictic · 2023-09-27T01:56:38Z

@aomarks in a large monorepo like lit I get a lot of mileage out of local caching. Being able to run npm test at the top level and it only runs tests that potentially changed since my last run saves a good bit of time

rictic · 2023-09-27T01:57:26Z

Are you thinking that it's up the user to run the vacuum command? Would we also run it automatically? Or maybe prompt the user to run it sometimes?

Yeah, start manual, and then as we gather experience and performance info we could run it automatically sometimes (maybe randomly, every Nth run?)

joshkraft · 2023-09-28T00:15:19Z

+1, I also enjoy having the local caching when working in beefy monorepos. Starting with the manual approach seems like a good idea before weaving GC deeper into the workflow.

On the UX side, how would users invoke this command manually? Are we thinking something like npx wireit vacuum, or is there a more elegant approach?

aomarks · 2023-09-28T01:07:30Z

On the UX side, how would users invoke this command manually? Are we thinking something like npx wireit vacuum, or is there a more elegant approach?

Maybe it could be similar to --watch? E.g. npm run foo --vacuum would mean "vacuum the cache for foo and all of its transitive dependencies" (would it also mean run it? I don't know, that's maybe why it could be confusing).

npx wireit vacuum could be good too, I suppose that would mean "vacuum the cache for everything in this package"? But would it also apply to all transitive dependencies?

joshkraft · 2023-09-29T01:04:22Z

In the short term, I was personally thinking that a simple solution would be to just have the vacuum/eviction command work at the project level, cleaning up all caches present in the .wireit folders. Perhaps delete all entries that are older than N days?

As far as ongoing cache management, I'm intrigued by the idea of basing size on a multiple of live entries. I'm also wondering if having some sort of user-defined TTL (w/ a sane default) would effective and perhaps simpler... in my use of wireit, I've found that older cache entries are rarely used, though 'older' is relative so it might be nice to have a configurable value.

ObliviousHarmony · 2023-12-20T21:29:07Z

We've currently got developers with multi-gigabyte caches so I'm 100% on board with a pruning feature.

I am a little unsure if local caching is worthwhile

The configurability of wireit's cache and associated fingerprint are one of the major drivers for our adoption in our monorepo. It is very useful when switching branches and significantly reduces build times. Removing local caching would definitely be a deal breaker for us.

Maybe it could be similar to --watch? E.g. npm run foo --vacuum would mean "vacuum the cache for foo and all of its transitive dependencies" (would it also mean run it? I don't know, that's maybe why it could be confusing).

I like --vacuum (or --prune-cache/--evict-cache or something? You usually "prune" or "evict" caches, not vacuum them, but maybe the terminology I'm imagining is ancient 😅) because, like --watch, it's a modifier to the behavior of the command and would do it prior to running the command. This also aligns with the possibility of making this the default behavior in the future since you could use --no-vacuum to do the inverse.

On the subject of how to handle eviction, it's probably fine to start with something time-based to at least get a solution in-place. We could go back and forth on how best to decide what to evict but ultimately this can be changed since the internals aren't as important to consumers as the fact that it evicts the cache at all.

To add to the chaos of how to evict though, what if we took the (configurable) branch into consideration? It's very likely the case that you'd want the cache of your main branch to persist preferentially, for example.

agawley · 2024-03-14T16:04:06Z

Is there likely to be any movement on this soon? I notice the issue is 2 years old (nearly) and am getting increasingly frustrated with remembering to delete my wireit cache every few weeks to avoid it gobbling my whole hard drive. No worries if it's not a priority. Just trying to get a feel for fix timelines.

aomarks self-assigned this Apr 3, 2022

aomarks mentioned this issue Apr 5, 2022

Local caching #78

Merged

google-admin unassigned aomarks Apr 5, 2022

Westbrook mentioned this issue Apr 20, 2023

ci: create an empty cache by including extra key data when loading cache adobe/spectrum-web-components#3131

Merged

9 tasks

rajsite mentioned this issue May 31, 2023

Use Wireit for nimble-components build ni/nimble#1267

Closed

1 task

jonathansadowski mentioned this issue Jan 30, 2024

Create means to clean up wireit cache woocommerce/woocommerce#44203

Open

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collection for cache directory #71

Garbage collection for cache directory #71

aomarks commented Apr 3, 2022 •

edited

Loading

joshkraft commented Jan 29, 2023

aomarks commented Jan 30, 2023 •

edited

Loading

joshkraft commented Feb 1, 2023

joshkraft commented Mar 22, 2023

Westbrook commented Apr 18, 2023 •

edited

Loading

joshkraft commented Sep 26, 2023

rictic commented Sep 26, 2023

aomarks commented Sep 27, 2023

rictic commented Sep 27, 2023

rictic commented Sep 27, 2023

joshkraft commented Sep 28, 2023

aomarks commented Sep 28, 2023 •

edited

Loading

joshkraft commented Sep 29, 2023

ObliviousHarmony commented Dec 20, 2023

agawley commented Mar 14, 2024

Garbage collection for cache directory #71

Garbage collection for cache directory #71

Comments

aomarks commented Apr 3, 2022 • edited Loading

joshkraft commented Jan 29, 2023

aomarks commented Jan 30, 2023 • edited Loading

joshkraft commented Feb 1, 2023

joshkraft commented Mar 22, 2023

Westbrook commented Apr 18, 2023 • edited Loading

joshkraft commented Sep 26, 2023

rictic commented Sep 26, 2023

aomarks commented Sep 27, 2023

rictic commented Sep 27, 2023

rictic commented Sep 27, 2023

joshkraft commented Sep 28, 2023

aomarks commented Sep 28, 2023 • edited Loading

joshkraft commented Sep 29, 2023

ObliviousHarmony commented Dec 20, 2023

agawley commented Mar 14, 2024

aomarks commented Apr 3, 2022 •

edited

Loading

aomarks commented Jan 30, 2023 •

edited

Loading

Westbrook commented Apr 18, 2023 •

edited

Loading

aomarks commented Sep 28, 2023 •

edited

Loading