-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can appending a date stamp work, and what is it for? #189
Comments
Now I'm starting to understand one possible reason for the date stamp: github won't replace an existing cache with the same key, which they don't warn us about. But then if you allow people to turn off the date stamp, shouldn't you also automatically clear the existing entry as shown here? |
Nice finding, highly appreciated! There has been so much fruitless debates about the timestamp, see the long discussion in #138 about this. If you dig into issues and PR's you find more. Your work collects the facts, I agree, best if we put that into the README. After long debate I agreed on accepting a PR to make the timestamp optional. I have never been convinced about its usefulness and therefore I am also not using it as most user of this action. However in the interest of a few that have a strong opinion, I merged the option to disable the timestamp. It is optional and use on own risk(well that is true anyway ;-) ). Later in #138 there are reports about problems especially when actions run in parallel. That would actually also be my concern regarding the workaround you mention. Again, I am not convinced about disabling the timestamp as I don't see what problem it solves, IMO it does not solve the problem of running out of space, which is definitely a problem for some large projects. However, I think this must be solved in a different way. But, if anyone from the non-timestamp users is willing to implement the clear functionality I happily merge it. As you have spend a lot of time on it: Feel free to put your findings into a PR for the README. I am super busy at the moment and sorry for the lagging response. But if you or anyone wants to pick this up and improve docs, I do my best reviewing and merging asap. |
Sure, if two jobs running in parallel try to write the same cache, there's a race condition. It seems like doing that is a “buyer beware” situation that any programmer should be aware of. If you wanted to protect the naïve user from this problem you could always inject a job identifier into the cache key. I guess if GitHub's implementation was very poor it might be possible to corrupt caches, but “surely” they would make cache writes functionally atomic, right?!
I don't understand why you would think it doesn't help prevent running out of space; the default behavior is to accumulate new caches and let GitHub clear up the older ones when it notices you're over the limit—which is not immediate AFAICT; I've often seen the message that “Approaching total cache storage limit (27.3 GB of 10 GB Used)”. So by default projects are constantly going over the limit. My project was, and disabling the timestamp and using clear is preventing that. I seldom go over the limit at all anymore, because I'm not leaving old caches around; they get replaced. But exceeding the cache limit wouldn't be a problem at all (except for GitHub's own resource usage concerns) if it weren't for the fact that old caches get marked as used when the updated cache is written. I have a broad matrix build, with each element of the matrix using and contributing cache. When one element finishes running, its new cache will be older than the old cache from another element of the matrix. Maybe the problem you're referring to is that one single cache can exceed the limit on its own. But you could cap your max cache size at 10G and issue an error or warning if the user tries to select a larger cache size, so that's easily addressed, no? I'm happy to submit a PR for the README once you and I come to some consensus here. I've never developed a GH action and don't have a clue how they're tested, so actually implementing the clear functionality is a bigger lift for me. |
That's what the
That's not what I wrote. I wrote: "IMO it does not solve the problem of running out of space". With other words: You can prevent running of disk space, the same way as you can prevent running out of space by not using the cache at all. I think disabling the timestamp is not a good solution for solving the problem. It moves cache eviction into the client, but I think cache eviction should happen from the cache implementation itself(server-side). If you run over the limit github cleans up for you, github does not reject new entries. That's how it should work. Note that github has a much better view on the cache than a runner in a workflow.
Sounds like a configuration problem to me. If you already have problems with short-lived cache entries, I don't understand how disabling Are you sharing caches between different matrix elements? I think that might be the problem. I am using a different Again, I am not against disabling the timestamp, use cases are different, if you are happy with it - together with your manual clear - that's fine for me. In my projects I don't run into the problems you describe, my approach would be:
|
I guess I'm to blame for parts of the long discussion in #138, and now I continue (or repeat myself) here. Sorry. ... Our use case is a normal PR workflow - PRs are reviewed and merged to master. For simplicity imagine that the build matrix has 4 entries and each cache has size 250 MB. Then 1 GB of storage is used for the master branch. Each PR uses the these caches and saves their own updated copies of the caches restricted to the PR branch. Each PR uses 1GB storage per pushed update to the branch because the caches (per branch) aren't replaced, and you quickly exceed 10 GB of cache storage. If you have some active PRs with several updates, you very easily end up with the caches for the master branch being the oldest and evicted. Then a new PR starts with no cache ... @hendrikmuhs, I understand that you prefer that the evicting is handle by the "cache implementation itself", but it just doesn't work very well for the scenario above which should be quite common. I also understand that timestamps are needed because of the limitation of the cache action posted above - you can't update a cache. (I tested without timestamps earlier and it didn't work for me.) @dabrahams has already pointed out that you can improve cache storage usage by disabling timestamps and clearing caches manually (if I understood it correctly). The solution I plan to implement in our workflows keeps the timestamps for caches:
I think this should be quite easy after playing with I wonder: Could a general solution / improvement be that this action continues to use timestamps for caches, but at the same time makes sure that there is just one cache per ref - the last - independant of which type of ref? That would probably make my plan above unnecessary. Should I try to create a PR solving this with |
For me, I would prefer the ability to save only one copy per key. (and automatically delete the old cache if required) Restore cache => Build => Delete old cache => Save the new cache For ccache, depending on the size of the repo, it can get pretty large. The cache would kill other cache like vcpkg/msys. |
@jianmingyong what you describe can be done by setting @hansfn if you have an idea, go ahead. I just checked I think what we can look into is limiting the amount of cache pollution from PR branches. If we can interact with the cache like you describe, the action could delete branch caches more aggressively and e.g. delete all but 1 before upload, so allowing only up to 2 cache entries per pr branch. Just an idea, we can play around with the parameters. |
But, but ... I thought this did't work because "github won't replace an existing cache with the same key" (as posted in the start of this thread). At least when I tested, the timestamp is needed if you want an updated cache.
No, but that would actually be nice and very consistent. I don't think we should wait for GitHub though ...
Yes, something like that would be very useful for everyone using PR workflows. I currently do it opposite - removing the old cache so there is just one per branch. My ugly workflow code is roughly like below. It's a bit conservative - just deleting the oldest cache for the PR (if it exists).
|
I did it without this action. Restore cache, compile and I hash the folder excluding stats file to see if the cache update (if source code didn’t change, the hash won’t change) so if the hash update, I delete the old one before saving the cache. It fits what I want. Only one cache per matrix capped at a certain size. This trick only works if you don’t do it in parallel with multiple commits running at the same time. or it will get real messy afterwards. Setting up a concurrency id would be better. Pretty much my use case is probably what some people might want but not necessarily required. Only storing one cache per key that can edit when needed. Though I could make my own actions, this is pretty much a unique use case that has side effects should the same cache be used concurrently and which cache would end up being saved. |
Yes, I agree. I don't think it works reliable. I think the default - which is appending the date - is the better option. I just answered the question from @jianmingyong which comes up regularly. Although I think the default is the better option, I don't want to keep people from making their own investigations. @jianmingyong you want to further discuss your use case, feel free to open a new issue or use https://github.com/hendrikmuhs/ccache-action/discussions. |
From reading Github's docs on the underlying cache mechanism, it seems like the appending could only work if you were treating what we put in the
key:
option as one of therestore-keys:
, becausekey
s are matched exactly. But then, you offer arestore-keys:
option too. So are you concatenating our specifiedkey
with therestored-key
s?Also, it's not clear what kind of scenario benefits from appending a date stamp. It seems like appending a date stamp can be counterproductive for cache invalidation, because when an old cache is matched it is marked "used" at the end of the run, so it is roughly as new as the copy with the new date stamp. Because there is limited space for caches and GitHub throws away the oldest ones, the matched old cache can become newer than something that is still needed, even though in almost every scenario the copy with the new date stamp should take precedence. Can you explain why anyone would want this option enabled (and eventually put the answer in the README?)
The text was updated successfully, but these errors were encountered: