Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Jul 26, 2018

For four hours in the future. Successful runs are currently taking around 40 minutes, so this gives us a reasonable buffer without leaving leaked resources around forever.

CC @smarterclayton.

@openshift-ci-robot openshift-ci-robot added the size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. label Jul 26, 2018
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should I add --utc to this to make searching easier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably

Done with 1d4385c -> f64b8ea.

For four hours in the future.  Successful runs are currently taking
around 40 minutes [1], so this gives us a reasonable buffer without
leaving leaked resources around forever.

[1]: https://deck-ci.svc.ci.openshift.org/?job=pull-ci-origin-installer-e2e-aws&state=success
@wking wking force-pushed the expiration-date-tags branch from 1d4385c to f64b8ea Compare July 26, 2018 21:34
@smarterclayton
Copy link
Contributor

/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Jul 27, 2018
@openshift-merge-robot openshift-merge-robot merged commit 8a29ab3 into openshift:master Jul 27, 2018
@wking wking deleted the expiration-date-tags branch July 27, 2018 15:38
wking added a commit to wking/openshift-installer that referenced this pull request Aug 3, 2018
I'm not familiar enough with grafiti to want to run the old
clean-aws.sh in production.  This commit lands a collection of
per-resource cleaners, as well as an 'all' script that ties them all
together in what seems like a reasonable order.  I'm generally trying
to use the resource creation date with a four-hour expiration to
decide if a resource is stale.  And for resources that don't expose a
creation date, I'm assuming they're stale if they're still around
after an hour (our Prow tasks are running in their own CI-only AWS
account, so we don't have to worry about long-running resources).  I'm
also using jq for filtering, because that's what I was familiar with.
It would be better to use JMESPath and --query to perform these
filters.

It will me much more reliable (and faster) if we can migrate these to
use the new expirationDate tag from openshift/release@f64b8eaf
(cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26,
openshift/release#1103), but I haven't gotten around to that yet.  I'd
also like to use something like:

  $ aws resourcegroupstaggingapi get-resources --query "ResourceTagMappingList[?Tags[? Key == 'expirationDate' && Value < '$(date --utc -d '4 hours' '+%Y-%m-%dT%H:%M')']].ResourceARN" --output text

to get all expired resources in one go regardless of type, but haven't
had time to get that settled either.
wking added a commit to wking/openshift-release that referenced this pull request Aug 17, 2018
Avoiding [1]:

  Invoking installer ...
  time="2018-08-17T18:05:11Z" level=fatal msg="failed to get configuration from file \"inputs.yaml\": inputs.yaml is not a valid config file: yaml: line 75: found unexpected ':'"

due to:

  $ curl -s https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/142/pull-ci-origin-installer-e2e-aws/523/artifacts/e2e-aws/installer/inputs.yaml | sed -n '75p;76p'
    # Example: `{ "key" = "value", "foo" = "bar" }`
    extraTags: {"expirationDate" = "2018-08-17T22:05+0000"}

The typo is from f64b8ea (cluster-launch-installer-e2e: Set
expirationDate tags, 2018-07-26, openshift#1103).

[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/openshift_installer/142/pull-ci-origin-installer-e2e-aws/523/build-log.txt
wking added a commit to wking/openshift-installer that referenced this pull request Aug 24, 2018
For four hours in the future.  Successful runs are currently taking
around 30 minutes [1], so this gives us a reasonable buffer without
leaving leaked resources around forever.

Adding an explicit +0000 offset makes it obvious that the times are in
UTC (even for folks viewing the tag without knowledge of the
generating script).  And it matches what we're doing for the e2e-aws
tests since openshift/release@f64b8eaf (cluster-launch-installer-e2e:
Set expirationDate tags, 2018-07-26, openshift/release#1103).

The empty line between the sys and yaml imports conforms to PEP 8's
[2]:

  Imports should be grouped in the following order:

  1. Standard library imports.
  2. Related third party imports.
  3. Local application/library specific imports.

  You should put a blank line between each group of imports.

[1]: https://jenkins-tectonic-installer.prod.coreos.systems/job/openshift-installer/view/change-requests/job/PR-166/
[2]: https://www.python.org/dev/peps/pep-0008/#imports
wking added a commit to wking/openshift-installer that referenced this pull request Nov 6, 2018
Convert from per-resource scripts to ARN-based cleanup, now that we're
setting expirationDate on our CI clusters (openshift/release@f64b8eaf,
cluster-launch-installer-e2e: Set expirationDate tags, 2018-07-26,
openshift/release#1103).  With this approach, it's more obvious when
we are missing support for deleting a given resource type, although
not all resource types are taggable [1].

I'm using jq to delete resource record sets from our hosted zones.
Apparently Amazon doesn't give those an ID (!), and we need to delete
them before we can delete the hosted zone itself.  Some of the
deletions may fail.  For example:

  An error occurred (InvalidChangeBatch) when calling the
  ChangeResourceRecordSets operation: A HostedZone must contain at
  least one NS record for the zone itself.

but we don't really care, because delete_route54 will succeed or fail
based on the final delete-hosted-zone.

The 'Done' status I'm excluding is part of the POSIX spec for 'jobs'
[2], so it should be portable.

When called without operands, 'wait' exits zero after all jobs have
exited, regardless of the job exit statuses [3].  But tracking the
PIDs so we could wait on them all individually is too much trouble at
the moment.

[1]: https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/Using_Tags.html#tag-ec2-resources-table
[2]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/jobs.html
[3]: http://pubs.opengroup.org/onlinepubs/9699919799/utilities/wait.html
wking added a commit to wking/openshift-installer that referenced this pull request Feb 5, 2019
Using CI's expirationDate tags, originally from
openshift/release@f64b8eaf (cluster-launch-installer-e2e: Set
expirationDate tags, 2018-07-26, openshift/release#1103) to determine
which VPCs have expired.  Then extract the lookup tags from those VPCs
and use them to create metadata.json.  Feed that metadata.json back
into the installer to delete the orphaned clusters.
wking added a commit to wking/openshift-release that referenced this pull request Jun 24, 2021
Bumping the timeout will reduce the frequency of a few problem
classes:

* Jobs which take a while acquire a Boskos lease, and then aren't left
  with enough ProwJob time remaining to complete their test.  For
  example, [1]:

    INFO[2021-06-16T13:55:36Z] ci-operator version v20210616-eab9ae5
    ...
    INFO[2021-06-16T15:58:16Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-1--aws-quota-slice-05]
    ...
    INFO[2021-06-16T17:55:07Z] Running step e*e-aws-fips-gather-extra.
    {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-16T17:55:36Z"}

  [1] is an openshift/origin job, and I'm only bumping the timeout for
  the chained-update periodics, but those periodics are exposed to
  this same problem class.

* Jobs where the test is slow enough that it does not fit into the
  ProwJob time, regardless of how quickly they acquire a ProwJob
  lease.  For example, [2]:

    INFO[2021-06-23T14:38:04Z] ci-operator version v20210623-b3b4289
    ...
    INFO[2021-06-23T14:39:43Z] Acquired 1 lease(s) for aws-quota-slice: [us-east-2--aws-quota-slice-03]
    ...
    {"component":"entrypoint","file":"prow/entrypoint/run.go:165","func":"k8s.io/test-infra/prow/entrypoint.Options.ExecuteProcess","level":"error","msg":"Process did not finish before 4h0m0s timeout","severity":"error","time":"2021-06-23T18:38:03Z"}

Downsides to raising the timeout are mostly increases to CI cost and
reduced CI capacity due to resource consumption.  Possible mechanisms
include:

* Jobs that actually hang might consume resources for longer.  These
  can be mitigated with lower-level timeouts, e.g. on the steps [3].

* Jobs that were aborted without completing teardown.  These can be
  mitigated by improving workflows to make it more likely that the
  teardown step gets completed when an earlier step times out [4].
  These can also be mitigated by per-job knobs on leaked resource
  reaping, like AWS's expirationDate, which we've set since
  f64b8ea (cluster-launch-installer-e2e: Set expirationDate tags,
  2018-07-26, openshift#1103), and which is used by
  core-services/ipi-deprovision/aws.sh.  In this commit, I'm raising
  the expirationDate offset to 8h for step workflows, because those
  have access to the step timeouts [3].  But I'm leaving it at 4h for
  the templates, because those do not.  Slow AWS jobs which continue
  to use templates should migrate to steps.

  The GCP installer currently lacks the ability to create that sort of
  resource tag, so I'm adjusting some hard-coded defaults in that file
  to keep up with this commit.

[1]: https://prow.ci.openshift.org/view/gs/origin-ci-test/pr-logs/pull/26235/pull-ci-openshift-origin-master-e2e-aws-fips/1405162123687890944
[2]: https://prow.ci.openshift.org/view/gs/origin-ci-test/logs/release-openshift-origin-installer-e2e-aws-upgrade-4.6-to-4.7-to-4.8-to-4.9-ci/1407709521396109312
[3]: https://docs.ci.openshift.org/docs/architecture/timeouts/#step-registry-test-process-timeouts
[4]: https://docs.ci.openshift.org/docs/architecture/timeouts/#how-interruptions-may-be-handled
derekhiggins added a commit to derekhiggins/release that referenced this pull request Oct 24, 2023
On a default virtual env the master node running metal3
will only have 3G free, give us some extra head room so
that we can play with ironic if we need to.

The underlying qcow files are thinly provisioned so shouldn't
use up much extra space unless its needed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

lgtm Indicates that a PR is ready to be merged. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants