Skip to content

Conversation

@akien-mga
Copy link
Member

@akien-mga akien-mga commented Mar 13, 2025

Including the commit sha in the cache key means that we create a new unique cache for every PR commit push or merge event. Previous caches for obsolete commits don't get cleaned up until after 7 days and so our cache quota gets filled extremely fast, occasionally leading to losing our master branch cache.

Through some tests, I found that it works fine to just use the branch ref as part of the cache key, and thus expect that:

  • Only the latest commit in the master or other dev branches has a cache.
  • Only the latest commit in a PR has a cache.

So cache keys take the form:

linux-template-minimal|master
linux-template-minimal|104076/merge

When e.g. merging a PR in master, its CI workflow will reuse the existing linux-template-minimal|master cache, and then replace it with an updated cache using the same key (so the old cache is discarded).

There's a potential problem though, which is that when restoring a cache, GitHub Actions puts a "reserved" lock on it to prevent concurrent access from other workflows. This lock is removed when saving the cache in the same job, but if an intermediate step fails and terminates the job before that, the cache stays stuck in a reserved state (at least for one hour).
Subsequent workflows can't save to it, instead getting an error like this:

Failed to save: Unable to reserve cache with key linux-template-minimal|104076/merge, another job may be creating this cache.

We work it around by ensuring that the cache/save action always run even if previous steps failed, so that the lock is removed, if there's any.

This should drastically reduce our cache congestion.

Comment on lines +65 to +69
- name: Restore Godot build cache
uses: ./.github/actions/godot-cache-restore
with:
cache-name: ${{ matrix.cache-name }}
continue-on-error: true
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just moved the cache restore and save as close as possible to the compiling step to minimal the time where the cache is locked.

scons-cache-limit: ${{ matrix.cache-limit }}

- name: Save Godot build cache
if: always()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it run even if previous steps fail, which ensures that if the cache was locked, it gets unlocked.

If the job fails before cache/restore, then this step still runs and passes fine with just warnings:

Warning: Path Validation Error: Path(s) specified in the action for caching do(es) not exist, hence no cache is being saved.
Warning: Cache save failed.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What I haven't tested though is what happens if the job is canceled (e.g. by pushing a new commit to the PR branch). Let's test this in this PR and makes sure it also runs this steps and unlocks the cache key.

@AThousandShips AThousandShips self-requested a review March 13, 2025 15:49
@akien-mga

This comment was marked as resolved.

@AThousandShips
Copy link
Member

Once the CI is done let's push some unrelated change to confirm it replaces the cache entry correctly

@akien-mga
Copy link
Member Author

I'll see if I can add this here as a second commit, I'm just not too familiar with how to format this as an action instead of a workflow.

I've actually put that one in its own PR as it can be merged independently from this: #104077.

Once the CI is done let's push some unrelated change to confirm it replaces the cache entry correctly

Feel free to push commits to my PR branch to do tests, I probably won't have too much time to do it myself.

…eserved

Including the commit sha in the cache key means that we create a new unique
cache for every PR commit push or merge event. Previous caches for obsolete
commits don't get cleaned up until after 7 days and so our cache quota gets
filled extremely fast, occasionally leading to losing our `master` branch
cache.

Through some tests, I found that it works fine to just use the branch ref
as part of the cache key, and thus expect that:
- Only the latest commit in the `master` or other dev branches has a cache.
- Only the latest commit in a PR has a cache.

So cache keys take the form:
```
linux-template-minimal|master
linux-template-minimal|104076/merge
```

When e.g. merging a PR in `master`, its CI workflow will reuse the existing
`linux-template-minimal|master` cache, and then replace it with an updated
cache using the same key (so the old cache is discarded).

There's a potential problem though, which is that when restoring a cache,
GitHub Actions puts a "reserved" lock on it to prevent concurrent access
from other workflows. This lock is removed when saving the cache in the
same job, but if an intermediate step fails and terminates the job before
that, the cache stays stuck in a reserved state (at least for one hour).
Subsequent workflows can't save to it, instead getting an error like this:
```
Failed to save: Unable to reserve cache with key linux-template-minimal|104076/merge, another job may be creating this cache.
```

We work it around by ensuring that the `cache/save` action always run even
if previous steps failed, so that the lock is removed, if there's any.

This should drastically reduce our cache congestion.
@akien-mga akien-mga force-pushed the ci-cache-key-no-sha branch from 5529d07 to 553fbeb Compare March 13, 2025 16:03
@AThousandShips
Copy link
Member

@Repiteo
Copy link
Contributor

Repiteo commented Mar 13, 2025

The SHA was required because caches are immutable. If we wanted to save with the exact same name, it'd need to be immediately preceeded by deleting the existing cache.

@akien-mga
Copy link
Member Author

The SHA was required because caches are immutable. If we wanted to save with the exact same name, it'd need to be immediately preceeded by deleting the existing cache.

Hm, seems like I misread my results in another repo where I experimented with this. It looked to me like it was able to just replace the cache by a new entry with the same name, but now I see indeed that it's failing and still using a 2 days old cache from when I first did that change.

This isn't going to work then :(

@akien-mga akien-mga closed this Mar 13, 2025
@akien-mga akien-mga removed this from the 4.x milestone Mar 13, 2025
@akien-mga akien-mga deleted the ci-cache-key-no-sha branch March 13, 2025 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants