cluster-launch-installer-e2e: Set expirationDate tags #1103

wking · 2018-07-26T20:51:27Z

For four hours in the future. Successful runs are currently taking around 40 minutes, so this gives us a reasonable buffer without leaving leaked resources around forever.

CC @smarterclayton.

wking · 2018-07-26T21:00:56Z

cluster/ci/config/prow/jobs/cluster-launch-installer-e2e.yaml

Should I add --utc to this to make searching easier?

probably

Done with 1d4385c -> f64b8ea.

For four hours in the future. Successful runs are currently taking around 40 minutes [1], so this gives us a reasonable buffer without leaving leaked resources around forever. [1]: https://deck-ci.svc.ci.openshift.org/?job=pull-ci-origin-installer-e2e-aws&state=success

smarterclayton · 2018-07-27T15:12:31Z

/lgtm

I'm not familiar enough with grafiti to want to run the old clean-aws.sh in production. This commit lands a collection of per-resource cleaners, as well as an 'all' script that ties them all together in what seems like a reasonable order. I'm generally trying to use the resource creation date with a four-hour expiration to decide if a resource is stale. And for resources that don't expose a creation date, I'm assuming they're stale if they're still around after an hour (our Prow tasks are running in their own CI-only AWS account, so we don't have to worry about long-running resources). I'm also using jq for filtering, because that's what I was familiar with. It would be better to use JMESPath and --query to perform these filters. It will me much more reliable (and faster) if we can migrate these to use the new expirationDate tag from openshift/release@f64b8eaf (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103), but I haven't gotten around to that yet. I'd also like to use something like: $ aws resourcegroupstaggingapi get-resources --query "ResourceTagMappingList[?Tags[? Key == 'expirationDate' && Value < '$(date --utc -d '4 hours' '+%Y-%m-%dT%H:%M')']].ResourceARN" --output text to get all expired resources in one go regardless of type, but haven't had time to get that settled either.

Avoiding [1]: Invoking installer ... time="2018-08-17T18:05:11Z" level=fatal msg="failed to get configuration from file \"inputs.yaml\": inputs.yaml is not a valid config file: yaml: line 75: found unexpected ':'" due to: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/142/pull-ci-origin-installer-e2e-aws/523/artifacts/e2e-aws/installer/inputs.yaml | sed -n '75p;76p' # Example: `{ "key" = "value", "foo" = "bar" }` extraTags: {"expirationDate" = "2018-08-17T22:05+0000"} The typo is from f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift#1103). [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/142/pull-ci-origin-installer-e2e-aws/523/build-log.txt

For four hours in the future. Successful runs are currently taking around 30 minutes [1], so this gives us a reasonable buffer without leaving leaked resources around forever. Adding an explicit +0000 offset makes it obvious that the times are in UTC (even for folks viewing the tag without knowledge of the generating script). And it matches what we're doing for the e2e-aws tests since openshift/release@f64b8eaf (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103). The empty line between the sys and yaml imports conforms to PEP 8's [2]: Imports should be grouped in the following order: 1. Standard library imports. 2. Related third party imports. 3. Local application/library specific imports. You should put a blank line between each group of imports. [1]: https://jenkins-tectonic-installer.prod.coreos.systems/job/openshift-installer/view/change-requests/job/PR-166/ [2]: https://www.python.org/dev/peps/pep-0008/#imports

Convert from per-resource scripts to ARN-based cleanup, now that we're setting expirationDate on our CI clusters (openshift/release@f64b8eaf, cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103). With this approach, it's more obvious when we are missing support for deleting a given resource type, although not all resource types are taggable [1]. I'm using jq to delete resource record sets from our hosted zones. Apparently Amazon doesn't give those an ID (!), and we need to delete them before we can delete the hosted zone itself. Some of the deletions may fail. For example: An error occurred (InvalidChangeBatch) when calling the ChangeResourceRecordSets operation: A HostedZone must contain at least one NS record for the zone itself. but we don't really care, because delete_route54 will succeed or fail based on the final delete-hosted-zone. The 'Done' status I'm excluding is part of the POSIX spec for 'jobs' [2], so it should be portable. When called without operands, 'wait' exits zero after all jobs have exited, regardless of the job exit statuses [3]. But tracking the PIDs so we could wait on them all individually is too much trouble at the moment. [1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-ec2-resources-table [2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html [3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/wait.html

Using CI's expirationDate tags, originally from openshift/release@f64b8eaf (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103) to determine which VPCs have expired. Then extract the lookup tags from those VPCs and use them to create metadata.json. Feed that metadata.json back into the installer to delete the orphaned clusters.

Bumping the timeout will reduce the frequency of a few problem classes: * Jobs which take a while acquire a Boskos lease, and then aren't left with enough ProwJob time remaining to complete their test. For example, [1]: INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5 ... INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05] ... INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra. {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"} [1] is an openshift/origin job, and I'm only bumping the timeout for the chained-update periodics, but those periodics are exposed to this same problem class. * Jobs where the test is slow enough that it does not fit into the ProwJob time, regardless of how quickly they acquire a ProwJob lease. For example, [2]: INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289 ... INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03] ... {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"} Downsides to raising the timeout are mostly increases to CI cost and reduced CI capacity due to resource consumption. Possible mechanisms include: * Jobs that actually hang might consume resources for longer. These can be mitigated with lower-level timeouts, e.g. on the steps [3]. * Jobs that were aborted without completing teardown. These can be mitigated by improving workflows to make it more likely that the teardown step gets completed when an earlier step times out [4]. These can also be mitigated by per-job knobs on leaked resource reaping, like AWS's expirationDate, which we've set since f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift#1103), and which is used by core-services/ipi-deprovision/aws.sh. In this commit, I'm raising the expirationDate offset to 8h for step workflows, because those have access to the step timeouts [3]. But I'm leaving it at 4h for the templates, because those do not. Slow AWS jobs which continue to use templates should migrate to steps. The GCP installer currently lacks the ability to create that sort of resource tag, so I'm adjusting some hard-coded defaults in that file to keep up with this commit. [1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26235/pull-ci-openshift-origin-master-e2e-aws-fips/1405162123687890944 [2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1407709521396109312 [3]: https://docs.ci.openshift.org/docs/architecture/timeouts/#step-registry-test-process-timeouts [4]: https://docs.ci.openshift.org/docs/architecture/timeouts/#how-interruptions-may-be-handled

On a default virtual env the master node running metal3 will only have 3G free, give us some extra head room so that we can play with ironic if we need to. The underlying qcow files are thinly provisioned so shouldn't use up much extra space unless its needed.

openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 26, 2018

wking commented Jul 26, 2018

View reviewed changes

wking force-pushed the expiration-date-tags branch from 1d4385c to f64b8ea Compare July 26, 2018 21:34

openshift-ci-robot assigned smarterclayton Jul 27, 2018

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2018

openshift-merge-robot merged commit 8a29ab3 into openshift:master Jul 27, 2018

wking deleted the expiration-date-tags branch July 27, 2018 15:38

wking mentioned this pull request Jul 27, 2018

WIP - This is just a test PR openshift/installer#48

Closed

wking mentioned this pull request Aug 17, 2018

cluster/test-deploy/aws/openshift: Move sshKey to the admin block #1206

Merged

wking mentioned this pull request Aug 17, 2018

cluster/test-deploy/aws/openshift: Fix = -> : for extraTags #1209

Merged

wking mentioned this pull request Aug 23, 2018

tests/run: Use one Python invocation (no jq) for editing the config openshift/installer#166

Merged

This was referenced Aug 24, 2018

tests/run: Set aws.extraTags.expirationDate openshift/installer#169

Merged

Add a new full run of the presubmits and change the e2e-aws focus #1271

Merged

wking mentioned this pull request Sep 12, 2018

installer/pkg/config: Support loading InstallConfig YAML openshift/installer#236

Closed

wking mentioned this pull request Mar 13, 2019

Adds platform-specific specs to Infrastructure in config.openshift.io openshift/api#231

Closed

wking mentioned this pull request Jun 23, 2021

ci-operator/jobs/openshift/release: 8h timeout for chained updates #19638

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

cluster-launch-installer-e2e: Set expirationDate tags #1103

cluster-launch-installer-e2e: Set expirationDate tags #1103

Uh oh!

wking commented Jul 26, 2018 •

edited

Loading

Uh oh!

wking Jul 26, 2018

Uh oh!

smarterclayton Jul 26, 2018

Uh oh!

wking Jul 26, 2018

Uh oh!

smarterclayton commented Jul 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

cluster-launch-installer-e2e: Set expirationDate tags #1103

cluster-launch-installer-e2e: Set expirationDate tags #1103

Uh oh!

Conversation

wking commented Jul 26, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking Jul 26, 2018

Choose a reason for hiding this comment

Uh oh!

smarterclayton Jul 26, 2018

Choose a reason for hiding this comment

Uh oh!

wking Jul 26, 2018

Choose a reason for hiding this comment

Uh oh!

smarterclayton commented Jul 27, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wking commented Jul 26, 2018 •

edited

Loading