How can appending a date stamp work, and what is it for? #189

dabrahams · 2024-02-29T23:08:50Z

From reading Github's docs on the underlying cache mechanism, it seems like the appending could only work if you were treating what we put in the key: option as one of the restore-keys:, because keys are matched exactly. But then, you offer a restore-keys: option too. So are you concatenating our specified key with the restored-keys?

Also, it's not clear what kind of scenario benefits from appending a date stamp. It seems like appending a date stamp can be counterproductive for cache invalidation, because when an old cache is matched it is marked "used" at the end of the run, so it is roughly as new as the copy with the new date stamp. Because there is limited space for caches and GitHub throws away the oldest ones, the matched old cache can become newer than something that is still needed, even though in almost every scenario the copy with the new date stamp should take precedence. Can you explain why anyone would want this option enabled (and eventually put the answer in the README?)

The text was updated successfully, but these errors were encountered:

dabrahams · 2024-03-03T19:15:08Z

Now I'm starting to understand one possible reason for the date stamp: github won't replace an existing cache with the same key, which they don't warn us about. But then if you allow people to turn off the date stamp, shouldn't you also automatically clear the existing entry as shown here?

hendrikmuhs · 2024-03-04T08:27:04Z

Nice finding, highly appreciated! There has been so much fruitless debates about the timestamp, see the long discussion in #138 about this. If you dig into issues and PR's you find more. Your work collects the facts, I agree, best if we put that into the README.

After long debate I agreed on accepting a PR to make the timestamp optional. I have never been convinced about its usefulness and therefore I am also not using it as most user of this action. However in the interest of a few that have a strong opinion, I merged the option to disable the timestamp. It is optional and use on own risk(well that is true anyway ;-) ). Later in #138 there are reports about problems especially when actions run in parallel. That would actually also be my concern regarding the workaround you mention. Again, I am not convinced about disabling the timestamp as I don't see what problem it solves, IMO it does not solve the problem of running out of space, which is definitely a problem for some large projects. However, I think this must be solved in a different way. But, if anyone from the non-timestamp users is willing to implement the clear functionality I happily merge it.

As you have spend a lot of time on it: Feel free to put your findings into a PR for the README. I am super busy at the moment and sorry for the lagging response. But if you or anyone wants to pick this up and improve docs, I do my best reviewing and merging asap.

dabrahams · 2024-03-04T22:51:04Z

there are reports about problems especially when actions run in parallel.

Sure, if two jobs running in parallel try to write the same cache, there's a race condition. It seems like doing that is a “buyer beware” situation that any programmer should be aware of. If you wanted to protect the naïve user from this problem you could always inject a job identifier into the cache key. I guess if GitHub's implementation was very poor it might be possible to corrupt caches, but “surely” they would make cache writes functionally atomic, right?!

I am not convinced about disabling the timestamp as I don't see what problem it solves, IMO it does not solve the problem of running out of space, which is definitely a problem for some large projects

I don't understand why you would think it doesn't help prevent running out of space; the default behavior is to accumulate new caches and let GitHub clear up the older ones when it notices you're over the limit—which is not immediate AFAICT; I've often seen the message that “Approaching total cache storage limit (27.3 GB of 10 GB Used)”. So by default projects are constantly going over the limit. My project was, and disabling the timestamp and using clear is preventing that. I seldom go over the limit at all anymore, because I'm not leaving old caches around; they get replaced.

But exceeding the cache limit wouldn't be a problem at all (except for GitHub's own resource usage concerns) if it weren't for the fact that old caches get marked as used when the updated cache is written. I have a broad matrix build, with each element of the matrix using and contributing cache. When one element finishes running, its new cache will be older than the old cache from another element of the matrix.

Maybe the problem you're referring to is that one single cache can exceed the limit on its own. But you could cap your max cache size at 10G and issue an error or warning if the user tries to select a larger cache size, so that's easily addressed, no?

I'm happy to submit a PR for the README once you and I come to some consensus here. I've never developed a GH action and don't have a clue how they're tested, so actually implementing the clear functionality is a bigger lift for me.

hendrikmuhs · 2024-03-05T06:33:24Z

If you wanted to protect the naïve user from this problem you could always inject a job identifier into the cache key.

That's what the key option is for, users should use a different identifiers for every case

I don't understand why you would think it doesn't help prevent running out of space;

That's not what I wrote. I wrote: "IMO it does not solve the problem of running out of space". With other words: You can prevent running of disk space, the same way as you can prevent running out of space by not using the cache at all. I think disabling the timestamp is not a good solution for solving the problem. It moves cache eviction into the client, but I think cache eviction should happen from the cache implementation itself(server-side). If you run over the limit github cleans up for you, github does not reject new entries. That's how it should work. Note that github has a much better view on the cache than a runner in a workflow.

Maybe the problem you're referring to is that one single cache can exceed the limit on its own. But you could cap your max cache size at 10G and issue an error or warning if the user tries to select a larger cache size, so that's easily addressed, no?

max-size has a default of 500M - are you setting this to a higher value? If not changed it fits.

When one element finishes running, its new cache will be older than the old cache from another element of the matrix.

Sounds like a configuration problem to me. If you already have problems with short-lived cache entries, I don't understand how disabling append-timestamp helps.

Are you sharing caches between different matrix elements? I think that might be the problem. I am using a different key for every matrix element to prevent sharing the cache. It does not make sense for me to share the cache between different build environments, because ccache would not hit anyway, but the entries would live side by side. I rather use smaller individual caches per build env.

Again, I am not against disabling the timestamp, use cases are different, if you are happy with it - together with your manual clear - that's fine for me.

In my projects I don't run into the problems you describe, my approach would be:

1 cache per workflow and matrix element
limit max-size to fit more elements into the repository limit
experiment with the --evict-older-than AGE option in ccache (would be a nice option for the action)

hansfn · 2024-08-12T00:30:38Z

I guess I'm to blame for parts of the long discussion in #138, and now I continue (or repeat myself) here. Sorry. ... Our use case is a normal PR workflow - PRs are reviewed and merged to master. For simplicity imagine that the build matrix has 4 entries and each cache has size 250 MB. Then 1 GB of storage is used for the master branch. Each PR uses the these caches and saves their own updated copies of the caches restricted to the PR branch. Each PR uses 1GB storage per pushed update to the branch because the caches (per branch) aren't replaced, and you quickly exceed 10 GB of cache storage. If you have some active PRs with several updates, you very easily end up with the caches for the master branch being the oldest and evicted. Then a new PR starts with no cache ...

@hendrikmuhs, I understand that you prefer that the evicting is handle by the "cache implementation itself", but it just doesn't work very well for the scenario above which should be quite common. I also understand that timestamps are needed because of the limitation of the cache action posted above - you can't update a cache. (I tested without timestamps earlier and it didn't work for me.)

@dabrahams has already pointed out that you can improve cache storage usage by disabling timestamps and clearing caches manually (if I understood it correctly). The solution I plan to implement in our workflows keeps the timestamps for caches:

Whenever a PR is updated, after the the new cache is saved, delete the old cache with the same ref (refs/pull/XXX/merge).
Whenever a PR is merged, delete the cache with the PR ref (refs/pull/XXX/merge)
Whenever a PR is merged, after the new cache for the master branch is saved (in our update workflow), delete the old cache with the master ref (refs/heads/master)

I think this should be quite easy after playing with gh cache (list and delete) locally.

I wonder: Could a general solution / improvement be that this action continues to use timestamps for caches, but at the same time makes sure that there is just one cache per ref - the last - independant of which type of ref? That would probably make my plan above unnecessary. Should I try to create a PR solving this with gh cache or isn't this interesting?

jianmingyong · 2024-09-13T16:56:57Z

For me, I would prefer the ability to save only one copy per key. (and automatically delete the old cache if required)

Restore cache => Build => Delete old cache => Save the new cache

For ccache, depending on the size of the repo, it can get pretty large. The cache would kill other cache like vcpkg/msys.

hendrikmuhs · 2024-09-14T09:08:04Z

@jianmingyong what you describe can be done by setting append-timestamp to false - this will always overwrite the existing cache.

@hansfn if you have an idea, go ahead. I just checked gh cache seems to be a recent addition, I miss it on my local system or my system is simply too old. I wonder: github offers to auto-delete the corresponding branch after merge: does it delete caches attached to the branch as well?

I think what we can look into is limiting the amount of cache pollution from PR branches. If we can interact with the cache like you describe, the action could delete branch caches more aggressively and e.g. delete all but 1 before upload, so allowing only up to 2 cache entries per pr branch. Just an idea, we can play around with the parameters.

hansfn · 2024-09-15T21:52:29Z

@jianmingyong what you describe can be done by setting append-timestamp to false - this will always overwrite the existing cache.

But, but ... I thought this did't work because "github won't replace an existing cache with the same key" (as posted in the start of this thread). At least when I tested, the timestamp is needed if you want an updated cache.

@hansfn if you have an idea, go ahead. I just checked gh cache seems to be a recent addition, I miss it on my local system or my system is simply too old. I wonder: github offers to auto-delete the corresponding branch after merge: does it delete caches attached to the branch as well?

No, but that would actually be nice and very consistent. I don't think we should wait for GitHub though ...

I think what we can look into is limiting the amount of cache pollution from PR branches. If we can interact with the cache like you describe, the action could delete branch caches more aggressively and e.g. delete all but 1 before upload, so allowing only up to 2 cache entries per pr branch. Just an idea, we can play around with the parameters.

Yes, something like that would be very useful for everyone using PR workflows. I currently do it opposite - removing the old cache so there is just one per branch.

My ugly workflow code is roughly like below. It's a bit conservative - just deleting the oldest cache for the PR (if it exists).

  clean-up:
    needs: WORKFLOW_THAT_CREATES_CACHE
    runs-on: ubuntu-latest
    env:
      GH_TOKEN: ${{ secrets.GITHUB_TOKEN }}
    steps:
      - name: Find oldest cache
        run: |
          oldCache="$(gh cache list -L 100 --repo REPO \
            --json id,key,ref --sort created_at \
            -q 'map(select(.ref == "${{ github.ref }}")) | map(select(.key | startswith("CACHE_TAG"))) | .[1:] | last | .id')"
          echo "OLD_CACHE=$oldCache" >> $GITHUB_ENV
      - name: Deleting oldest cache
        run: |
          if [[ -n $OLD_CACHE ]]; then
            echo "Deleting cache $OLD_CACHE (from ${{ github.ref }})"
            gh cache delete $OLD_CACHE --repo REPO
          else
            echo "No old cache to delete"
          fi

jianmingyong · 2024-09-16T05:26:23Z

I did it without this action.

Restore cache, compile and I hash the folder excluding stats file to see if the cache update (if source code didn’t change, the hash won’t change) so if the hash update, I delete the old one before saving the cache.

It fits what I want. Only one cache per matrix capped at a certain size.

This trick only works if you don’t do it in parallel with multiple commits running at the same time. or it will get real messy afterwards. Setting up a concurrency id would be better.

Pretty much my use case is probably what some people might want but not necessarily required. Only storing one cache per key that can edit when needed. Though I could make my own actions, this is pretty much a unique use case that has side effects should the same cache be used concurrently and which cache would end up being saved.

hendrikmuhs · 2024-09-16T08:37:44Z

@jianmingyong what you describe can be done by setting append-timestamp to false - this will always overwrite the existing cache.

But, but ... I thought this did't work because "github won't replace an existing cache with the same key" (as posted in the start of this thread). At least when I tested, the timestamp is needed if you want an updated cache.

Yes, I agree. I don't think it works reliable. I think the default - which is appending the date - is the better option. I just answered the question from @jianmingyong which comes up regularly. Although I think the default is the better option, I don't want to keep people from making their own investigations.

@jianmingyong you want to further discuss your use case, feel free to open a new issue or use https://github.com/hendrikmuhs/ccache-action/discussions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How can appending a date stamp work, and what is it for? #189

How can appending a date stamp work, and what is it for? #189

dabrahams commented Feb 29, 2024

dabrahams commented Mar 3, 2024

hendrikmuhs commented Mar 4, 2024

dabrahams commented Mar 4, 2024

hendrikmuhs commented Mar 5, 2024

hansfn commented Aug 12, 2024

jianmingyong commented Sep 13, 2024

hendrikmuhs commented Sep 14, 2024

hansfn commented Sep 15, 2024

jianmingyong commented Sep 16, 2024 •

edited

Loading

hendrikmuhs commented Sep 16, 2024

How can appending a date stamp work, and what is it for? #189

How can appending a date stamp work, and what is it for? #189

Comments

dabrahams commented Feb 29, 2024

dabrahams commented Mar 3, 2024

hendrikmuhs commented Mar 4, 2024

dabrahams commented Mar 4, 2024

hendrikmuhs commented Mar 5, 2024

hansfn commented Aug 12, 2024

jianmingyong commented Sep 13, 2024

hendrikmuhs commented Sep 14, 2024

hansfn commented Sep 15, 2024

jianmingyong commented Sep 16, 2024 • edited Loading

hendrikmuhs commented Sep 16, 2024

jianmingyong commented Sep 16, 2024 •

edited

Loading