-
Notifications
You must be signed in to change notification settings - Fork 2k
cluster-launch-installer-e2e: Set expirationDate tags #1103
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
cluster-launch-installer-e2e: Set expirationDate tags #1103
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should I add --utc to this to make searching easier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
probably
Done with 1d4385c -> f64b8ea.
For four hours in the future. Successful runs are currently taking around 40 minutes [1], so this gives us a reasonable buffer without leaving leaked resources around forever. [1]: https://deck-ci.svc.ci.openshift.org/?job=pull-ci-origin-installer-e2e-aws&state=success
1d4385c to
f64b8ea
Compare
|
/lgtm |
I'm not familiar enough with grafiti to want to run the old clean-aws.sh in production. This commit lands a collection of per-resource cleaners, as well as an 'all' script that ties them all together in what seems like a reasonable order. I'm generally trying to use the resource creation date with a four-hour expiration to decide if a resource is stale. And for resources that don't expose a creation date, I'm assuming they're stale if they're still around after an hour (our Prow tasks are running in their own CI-only AWS account, so we don't have to worry about long-running resources). I'm also using jq for filtering, because that's what I was familiar with. It would be better to use JMESPath and --query to perform these filters. It will me much more reliable (and faster) if we can migrate these to use the new expirationDate tag from openshift/release@f64b8eaf (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103), but I haven't gotten around to that yet. I'd also like to use something like: $ aws resourcegroupstaggingapi get-resources --query "ResourceTagMappingList[?Tags[? Key == 'expirationDate' && Value < '$(date --utc -d '4 hours' '+%Y-%m-%dT%H:%M')']].ResourceARN" --output text to get all expired resources in one go regardless of type, but haven't had time to get that settled either.
Avoiding [1]: Invoking installer ... time="2018-08-17T18:05:11Z" level=fatal msg="failed to get configuration from file \"inputs.yaml\": inputs.yaml is not a valid config file: yaml: line 75: found unexpected ':'" due to: $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/142/pull-ci-origin-installer-e2e-aws/523/artifacts/e2e-aws/installer/inputs.yaml | sed -n '75p;76p' # Example: `{ "key" = "value", "foo" = "bar" }` extraTags: {"expirationDate" = "2018-08-17T22:05+0000"} The typo is from f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift#1103). [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/142/pull-ci-origin-installer-e2e-aws/523/build-log.txt
For four hours in the future. Successful runs are currently taking around 30 minutes [1], so this gives us a reasonable buffer without leaving leaked resources around forever. Adding an explicit +0000 offset makes it obvious that the times are in UTC (even for folks viewing the tag without knowledge of the generating script). And it matches what we're doing for the e2e-aws tests since openshift/release@f64b8eaf (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103). The empty line between the sys and yaml imports conforms to PEP 8's [2]: Imports should be grouped in the following order: 1. Standard library imports. 2. Related third party imports. 3. Local application/library specific imports. You should put a blank line between each group of imports. [1]: https://jenkins-tectonic-installer.prod.coreos.systems/job/openshift-installer/view/change-requests/job/PR-166/ [2]: https://www.python.org/dev/peps/pep-0008/#imports
Convert from per-resource scripts to ARN-based cleanup, now that we're setting expirationDate on our CI clusters (openshift/release@f64b8eaf, cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103). With this approach, it's more obvious when we are missing support for deleting a given resource type, although not all resource types are taggable [1]. I'm using jq to delete resource record sets from our hosted zones. Apparently Amazon doesn't give those an ID (!), and we need to delete them before we can delete the hosted zone itself. Some of the deletions may fail. For example: An error occurred (InvalidChangeBatch) when calling the ChangeResourceRecordSets operation: A HostedZone must contain at least one NS record for the zone itself. but we don't really care, because delete_route54 will succeed or fail based on the final delete-hosted-zone. The 'Done' status I'm excluding is part of the POSIX spec for 'jobs' [2], so it should be portable. When called without operands, 'wait' exits zero after all jobs have exited, regardless of the job exit statuses [3]. But tracking the PIDs so we could wait on them all individually is too much trouble at the moment. [1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-ec2-resources-table [2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html [3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/wait.html
Using CI's expirationDate tags, originally from openshift/release@f64b8eaf (cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26, openshift/release#1103) to determine which VPCs have expired. Then extract the lookup tags from those VPCs and use them to create metadata.json. Feed that metadata.json back into the installer to delete the orphaned clusters.
Bumping the timeout will reduce the frequency of a few problem
classes:
* Jobs which take a while acquire a Boskos lease, and then aren't left
with enough ProwJob time remaining to complete their test. For
example, [1]:
INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
...
INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
...
INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}
[1] is an openshift/origin job, and I'm only bumping the timeout for
the chained-update periodics, but those periodics are exposed to
this same problem class.
* Jobs where the test is slow enough that it does not fit into the
ProwJob time, regardless of how quickly they acquire a ProwJob
lease. For example, [2]:
INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
...
INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
...
{"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}
Downsides to raising the timeout are mostly increases to CI cost and
reduced CI capacity due to resource consumption. Possible mechanisms
include:
* Jobs that actually hang might consume resources for longer. These
can be mitigated with lower-level timeouts, e.g. on the steps [3].
* Jobs that were aborted without completing teardown. These can be
mitigated by improving workflows to make it more likely that the
teardown step gets completed when an earlier step times out [4].
These can also be mitigated by per-job knobs on leaked resource
reaping, like AWS's expirationDate, which we've set since
f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags,
2018-07-26, openshift#1103), and which is used by
core-services/ipi-deprovision/aws.sh. In this commit, I'm raising
the expirationDate offset to 8h for step workflows, because those
have access to the step timeouts [3]. But I'm leaving it at 4h for
the templates, because those do not. Slow AWS jobs which continue
to use templates should migrate to steps.
The GCP installer currently lacks the ability to create that sort of
resource tag, so I'm adjusting some hard-coded defaults in that file
to keep up with this commit.
[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26235/pull-ci-openshift-origin-master-e2e-aws-fips/1405162123687890944
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1407709521396109312
[3]: https://docs.ci.openshift.org/docs/architecture/timeouts/#step-registry-test-process-timeouts
[4]: https://docs.ci.openshift.org/docs/architecture/timeouts/#how-interruptions-may-be-handled
On a default virtual env the master node running metal3 will only have 3G free, give us some extra head room so that we can play with ironic if we need to. The underlying qcow files are thinly provisioned so shouldn't use up much extra space unless its needed.
For four hours in the future. Successful runs are currently taking around 40 minutes, so this gives us a reasonable buffer without leaving leaked resources around forever.
CC @smarterclayton.