Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Set retention for staging images #525

Open
stp-ip opened this issue Dec 19, 2019 · 32 comments
Open

Set retention for staging images #525

stp-ip opened this issue Dec 19, 2019 · 32 comments
Assignees
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Milestone

Comments

@stp-ip
Copy link
Member

stp-ip commented Dec 19, 2019

We currently set a 60d retention on staging storage and staging gcb storage, but don't enforce any retention for images.

Staging images should be discouraged from being used and therefore adding a retention policy would help setting the right expectations as well as keep our storage needs lower in the long run.

I am proposing the same 60d retention to keep things the same across all staging retention settings. Happy for other suggestions.

Additional notes:
Currently GCR itself doesn't provide retention settings. We could create the retention on the GCR created bucket, but I assume this could lead to weird issues.
The other option could run a prow job every week to clean up older images.

"Manual" removal script example: https://gist.github.com/ahmetb/7ce6d741bd5baa194a3fac6b1fec8bb7

@rajibmitra
Copy link
Member

I would like to work on prow job that will clean up the older images.

@rajibmitra
Copy link
Member

rajibmitra commented Feb 15, 2020

/assign fiorm

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label May 18, 2020
@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 17, 2020
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member

spiffxp commented Aug 3, 2020

/reopen

@k8s-ci-robot k8s-ci-robot reopened this Aug 3, 2020
@k8s-ci-robot
Copy link
Contributor

@spiffxp: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@spiffxp
Copy link
Member

spiffxp commented Aug 3, 2020

I agree 60d for GCR is reasonable. Staging GCR repos have images older than 60d. They should not, or people are going to assume they can use them in perpetuity.

This came up because @msau42 mentioned CSI images were close to 60d and was concerned they would expire and break kubernetes CI. They won't.

$ export prj=k8s-staging-csi; for b in $prj $prj-gcb artifacts.$prj.appspot.com; do echo $b: $(gsutil lifecycle get gs://$b); done
k8s-staging-csi: {"rule": [{"action": {"type": "Delete"}, "condition": {"age": 60}}]}
k8s-staging-csi-gcb: {"rule": [{"action": {"type": "Delete"}, "condition": {"age": 60}}]}
artifacts.k8s-staging-csi.appspot.com: gs://artifacts.k8s-staging-csi.appspot.com/ has no lifecycle configuration.

We should give SIG Storage time to promote their images to k8s.gcr.io and update tests to use them. Then I think we should implement this.

/assign @thockin @dims @bartsmykla
as a heads up

@thockin thockin removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 3, 2020
@thockin
Copy link
Member

thockin commented Aug 3, 2020

I don't object in theory. There isn't a good mechanism to do it, short of writing our own daily things that loops over every staging repo and nukes old images.

@bartsmykla
Copy link
Contributor

I can help with our own solution :-)

@msau42
Copy link
Member

msau42 commented Aug 5, 2020

Is there a recommended way to do canary testing with the 60d removal? For example, in csi, our periodic canary testing tests multiple repos' canary images in one job. But some repos are more active than others, and the inactive ones may not have any merges for > 60d. Is there a way we can keep the canary images around to facilitate this workflow without having to promote the canary tag?

@thockin
Copy link
Member

thockin commented Aug 5, 2020 via email

@msau42
Copy link
Member

msau42 commented Aug 5, 2020

canary in our case is actually images with a "canary" tag. Every pr that merges, we build and repush the "canary" tag. We have a specific canary job that's configured to test using images with the canary tag. We do end up promoting those images with official release tags, and we have separate jobs that test using release images, but we will still have a canary job that tests against head of everything.

@thockin
Copy link
Member

thockin commented Aug 5, 2020 via email

@msau42
Copy link
Member

msau42 commented Aug 6, 2020

If there's another way we could achieve a "test against latest for all images, even if some of the repos are inactive", open to suggestions.

@pohly
Copy link
Contributor

pohly commented Aug 6, 2020

Perhaps we can add a periodic job which rebuilds canary images once a month? The added bonus is that we'll notice if something breaks in the build environment (shouldn't happen, but one never knows...) before actually trying to build a proper release.

@thockin
Copy link
Member

thockin commented Aug 6, 2020 via email

@msau42
Copy link
Member

msau42 commented Aug 6, 2020

I think a daily snapshot also works. It's similar to Patrick's suggestion but more frequent. We'll need to redo our tooling to be able to query/find the daily rc, but it's feasible.

I'm curious what other projects are doing, because I can't imagine we're the only ones.

@thockin
Copy link
Member

thockin commented Aug 6, 2020 via email

@pohly
Copy link
Contributor

pohly commented Aug 6, 2020

We'll need to redo our tooling to be able to query/find the daily rc, but it's feasible.

So you want "canary-" tags because then a test failure can be reproduced locally with the exact same images? I'm not sure how important that is. If it's a genuine issue that hasn't been fixed yet, a more recent "canary" should still expose it, and if it doesn't, would we really dig into such a failure when it no longer occurs?

@pohly
Copy link
Contributor

pohly commented Aug 6, 2020

or they are releasing all components together

Built from the same repo? That works because the tip-of-branch components can all be built in the same test job.

But Kubernetes-CSI uses several different repos and then needs to collect the output from different build jobs for a combined test job.

@pohly
Copy link
Contributor

pohly commented Aug 7, 2020

I guess we could set up a canary job which checks out all of the relevant repos and then builds everything anew each time it runs. 🤷

@thockin
Copy link
Member

thockin commented Aug 7, 2020 via email

@msau42
Copy link
Member

msau42 commented Aug 7, 2020

We do promote our canaries. The main challenge here is we have two test jobs, one configured to only run against canaries, and one configured to run against release images. The one configured to run against canary is going to be prone to expiration of staging images for inactive repos unless we have something that will periodically build new images.

@spiffxp
Copy link
Member

spiffxp commented Oct 27, 2020

/lifecycle frozen

@k8s-ci-robot k8s-ci-robot added the lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. label Oct 27, 2020
@pohly
Copy link
Contributor

pohly commented Oct 28, 2020

We do promote our canaries. ... The one configured to run against canary is going to be prone to expiration of staging images for inactive repos

I think you meant "we don't promote our canaries", right?

unless we have something that will periodically build new images

Here's a PR which tentatively defines a job which refreshes "canary" for one repo: kubernetes/test-infra#19734

Release candidates are still problematic. We sometimes need those while preparing new sidecars for an upcoming Kubernetes release. On the other hand, the time period where we do need them might be small enough that the normal retention period is okay, so this might not be a problem?

@msau42
Copy link
Member

msau42 commented Oct 28, 2020

What I meant was we do promote canary builds to official release version tags. We don't promote the "canary" tag.

Yes I think we can treat release candidates separately. We don't want to promote release candidates and merge any tests that depend on release candidates in k/k

@spiffxp spiffxp added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Jan 22, 2021
@spiffxp
Copy link
Member

spiffxp commented Feb 18, 2021

From https://cloud.google.com/container-registry/docs/managing#deleting_images:

  • "Do not apply Cloud Storage retention policies to storage buckets used by Container Registry. These policies to not work for managing images in Container Registry storage buckets." - so we're following the recommended guidance by not setting these
  • https://github.com/sethvargo/gcr-cleaner is a not-official-Google-product that could accomplish this

@spiffxp
Copy link
Member

spiffxp commented Jul 16, 2021

/milestone v1.23

@k8s-ci-robot k8s-ci-robot added this to the v1.23 milestone Jul 16, 2021
@k8s-ci-robot k8s-ci-robot added sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. and removed wg/k8s-infra labels Sep 29, 2021
@ameukam
Copy link
Member

ameukam commented Dec 14, 2021

/milestone clear

@k8s-ci-robot k8s-ci-robot removed this from the v1.23 milestone Dec 14, 2021
@dims dims assigned ameukam and unassigned dims Jan 31, 2022
@ameukam
Copy link
Member

ameukam commented Mar 3, 2024

/milestone v1.32

@k8s-ci-robot k8s-ci-robot added this to the v1.32 milestone Mar 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/frozen Indicates that an issue or PR should not be auto-closed due to staleness. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra.
Projects
None yet
Development

No branches or pull requests