Skip to content

GHProxy: Cleanup old caches#23621

Merged
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
alvaroaleman:cleanup
Sep 17, 2021
Merged

GHProxy: Cleanup old caches#23621
k8s-ci-robot merged 1 commit intokubernetes:masterfrom
alvaroaleman:cleanup

Conversation

@alvaroaleman
Copy link
Member

Currently, Ghproxy never cleans up caches. This can relatively quickly
lead to inode exhaustion when apps auth is used, as it results in many,
relatively shortlived caches (1h).

This change adds pruning for those which works as follows:

  • The github client will add an expiry header if it sends a request with
    a token that expires
  • Ghproxy will write the expiry time into a metadata file at the root of
    the cache partition
  • A background routine in ghproxy will iterate over all cache partitions
    and delete them when they have expired

Fixes #23407

@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 15, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: alvaroaleman

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added area/ghproxy Issues or PRs related to code in /ghproxy approved Indicates a PR has been approved by an approver from all required OWNERS files. area/prow Issues or PRs related to prow sig/testing Categorizes an issue or PR as relevant to SIG Testing. labels Sep 15, 2021
}

go func() {
for range time.NewTicker(cachePruneInterval).C {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iirc there's some gotcha with this construct that leaks tickers?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if ghproxy uses interrupts, prefer interrupts.Tick()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the Ticket never gets garbage collected, however it runs as long as the binary so that doesn't matter


func writecachePartitionMetadata(basePath, tempDir string, expiresAt time.Time) error {
if expiresAt.IsZero() {
return nil
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Won't this lead to leaks? Why not failsafe to writing metadata that expires at time.Now().Add(time.Hour) or something?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it won't and we can't do that. The whole reason the expiry information is passed on from the client is tokens validity varies and in the case of PAT, it never expires which will gets by an empty expiresAt. If we unconditionally added 1h here, we would always delete all caches after one hour, we cant do that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I see - this is to handle PAT. Perhaps a comment would be good to clarify this since it's implicit behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess I would have expected "no expiry header" -> "no call to writing metadata" rather than "no expiry header" -> "pass an invalid date" -> "do nothing" but maybe that's just me being confused by it all

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also I guess it would not hurt to have a default cache TTL for PAT entries, too, since they could hit the same issues that apps auth hits, on a smaller machine or with fewer inodes free? Setting the TTL to a week or something should not cause adverse effects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a comment and made it a pointer to further clarify this might not be set. The TTL is for the entire cache, not individual entries so we can never evict a PAT cache

Currently, Ghproxy never cleans up caches. This can relatively quickly
lead to inode exhaustion when apps auth is used, as it results in many,
relatively shortlived caches (1h).

This change adds pruning for those which works as follows:
* The github client will add an expiry header if it sends a request with
  a token that expires
* Ghproxy will write the expiry time into a metadata file at the root of
  the cache partition
* A background routine in ghproxy will iterate over all cache paritions
  and delete them when they have expired
@stevekuznetsov
Copy link
Contributor

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Sep 17, 2021
@k8s-ci-robot k8s-ci-robot merged commit b279425 into kubernetes:master Sep 17, 2021
@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Sep 17, 2021
@spiffxp
Copy link
Contributor

spiffxp commented Sep 17, 2021

I just caught a flake over in #23656 (comment) that I suspect is related to this, only because I've never seen ghproxy flake on me before and this just landed recently

@alvaroaleman
Copy link
Member Author

I just caught a flake over in #23656 (comment) that I suspect is related to this, only because I've never seen ghproxy flake on me before and this just landed recently

Thanks. I suspect that is because the test relies on timings and if things to too slow, it will fail like this. I'll try to improve this through a fake clock.

@alvaroaleman alvaroaleman deleted the cleanup branch September 18, 2021 20:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. area/ghproxy Issues or PRs related to code in /ghproxy area/prow Issues or PRs related to prow cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/testing Categorizes an issue or PR as relevant to SIG Testing. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

GH cache should prune its storage to avoid hitting limits

4 participants