Bug 1791400: cmd/openshift-install/destroy: Remove terraform.tfstate in 'destroy cluster' #2433

wking · 2019-09-30T18:42:00Z

So destroy cluster gets you all the way back to a clean slate, even if the cluster in question died during infrastructure provisioning (leaking mentioned here and here).

This will obviously not clean up the original asset directory in workflows where the user copies their metadata.json over into a new directory and runs destroy cluster in the new directory. But since we have existing asset-state removal code that also behaves that way, I don't think it's a big deal.

cmd/openshift-install/destroy.go is a convenient place to put this now, when all of our providers are Terraform-based. If, in the future, we move some providers off of Terraform (or add new, non-Terraform providers), we can push this down into the platform-specific destroyers. We could also leave it here, because terraform.tfstate is unlikely to exist in the asset directory for non-Terraform providers and be something that the user wants to keep around, so the risk of false-positive removal is low.

patrickdillon · 2019-09-30T19:15:31Z

cmd/openshift-install/destroy.go

You might be able to tighten this up a little:

if err = os.Remove(tfStateFilePath); !os.IsNotExist(err) { return errors.Wrap(err, "failed to remove Terraform state") }

(stolen from here)

if err = os.Remove(tfStateFilePath); !os.IsNotExist(err) {

We don't want to error on nil, and IsNotExist(nil) is false. I like separating the call from the error-handling conditionals, but I can squash down to:

if err = os.Remove(tfStateFilePath); err != nil && !os.IsNotExist(err) {

if folks see that as a blocker ;).

We don't want to error on nil

Oh of course. I left out the most important condition. Nvm!

patrickdillon · 2019-09-30T19:58:02Z

Should https://github.com/openshift/installer/blob/master/cmd/openshift-install/destroy.go#L69 already be removing the tfstate file or is that referring to a different "state file"?

wking · 2019-09-30T20:16:39Z

Should https://github.com/openshift/installer/blob/master/cmd/openshift-install/destroy.go#L69 already be removing the tfstate file or is that referring to a different "state file"?

Different (that DestroyState call is about installer assets/state, not Terraform state).

patrickdillon · 2019-09-30T20:20:14Z

This lgtm but I will give others a chance to review.

abhinavdahiya · 2019-09-30T20:53:34Z

how is the tfstate not part of the asset graph, compared to the tfvars files?

Can we define the case where the tfstate is left behind first, instead of deleting this file blindly in destroy?

wking · 2019-09-30T21:19:54Z

how is the tfstate not part of the asset graph, compared to the tfvars files?

Hmm, yeah. We should be adding it to the state here. I'll hunt through CI to try and find a reproducer.

wking · 2019-09-30T22:56:37Z

Here is a job that leaked terraform.tfstate. destroy cluster wrapped up with:

time="2019-09-30T20:21:40Z" level=debug msg="search for untaggable resources"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Terraform Variables\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Kubeconfig Admin Client\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Kubeadmin Password\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Certificate (journal-gatewayd)\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Metadata\" from disk"

But an asset is only committed to the store if its Generate succeeds. The Cluster.Generate call failed, so it (and its Terraform state file) were never committed to the store. Do we want to reroll the store to record the output of a failed Generate call? I'd expect the Cluster asset to be the only case where we'd want that, and in other cases we'd actively not want it to avoid polluting the store with broken content. I'm comfortable handling this Terraform state file (the only file in the Cluster asset) directly as I have here instead of trying to find a way to wedge it into the asset-store framework. I'd also be ok special-casing Cluster somehow to get it committed to the store regardless of success. Thoughts?

patrickdillon · 2019-10-01T14:39:04Z

I'd also be ok special-casing Cluster somehow to get it committed to the store regardless of success. Thoughts?

I'm still getting familiar with this code, but this approach of adding tfstate to the store looks cleaner and less of a special case than direct handling of the state file. I'm not sure of the exact definition of assets, but I think it should encompass this file (something written to disk and needed by the installer).

abhinavdahiya · 2019-10-01T16:44:07Z

@wking

So for the leaked run the terraform.tfstate file was created in the install_dir but not with the help of the asset graph?? or did we copy the state file from tmp_terraform_workspace to install_dir but didn't report that as part of asset output..

because if it's the latter, that to be seems like the bug.

wking · 2019-10-01T23:52:15Z

So for the leaked run the terraform.tfstate file was created in the install_dir but not with the help of the asset graph??

We created it through the asset graph in Cluster.Generate(). But because that Generate call returned an error (Terraform failling), we exited here without inserting the asset into the store here. So as far as the asset store is concerned, there was no terraform.tfstate under management. As I said above, not adding assets to the store after a failed install is a good thing for all the other assets. But it's not a good thing for the Cluster asset, if the goal is having DestroyState on the store clean up terraform.tfstate. We can fix it with 06a836c0546 as it stands, or with something like:

if err := a.Generate(parents); err != nil {
  if a was the Cluster asset {
    assetState.asset = a
    assetState.source = generatedSource
  }
  return errors.Wrapf(err, "failed to generate asset %q", a.Name())
}
assetState.asset = a
assetState.source = generatedSource

I'm fine either way.

patrickdillon · 2019-10-02T00:53:22Z

If assets are only added to the asset store as a result of Generate(), it seems like the best solution would be to delegate that action to the assets themselves. In this case, I think that would mean expanding Generate to accept assetState.

That is probably more code than we want to write to fix this (though not that bad I think), but worth discussing.

wking · 2019-10-02T17:12:03Z

If assets are only added to the asset store as a result of Generate(), it seems like the best solution would be to delegate that action to the assets themselves.

My asset-graph opinions are in #556, which is far enough from what we have now that I don't have opinions about minor pivots ;). Restructuring the asset framework to allow assets to decide if/when to save themselves would work, but it's a larger pivot than either of the two alternatives I gave here. But folks should pick an approach, and then tell me and I'll implement it ;).

jstuever · 2019-10-11T20:52:42Z

This looks like a good quick fix. We should create a card to revisit if/how to do it using asset-graph.
/lgtm

abhinavdahiya · 2019-10-11T21:11:32Z

/hold

wking · 2019-10-11T21:41:07Z

/retest

Before we reopen discussion here, I want to see if this thing actually compiles ;)

jstuever · 2019-10-23T17:16:33Z

You put a hold on it.
/assign @abhinavdahiya
/unassign @jstuever

wking · 2020-01-16T00:46:02Z

/retitle Bug 1791400: cmd/openshift-install/destroy: Remove terraform.tfstate in 'destroy cluster'

abhinavdahiya · 2020-02-04T19:25:15Z

/approve
/lgtm

/hold cancel

openshift-ci-robot · 2020-02-04T19:31:56Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, jstuever

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [abhinavdahiya]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-bot · 2020-02-04T21:48:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-04T22:37:01Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T00:38:51Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T00:51:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T01:30:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T01:43:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T01:57:09Z

/retest

Please review the full test history for this PR and help us cut down flakes.

wking · 2020-02-05T01:59:31Z

Update job 4676 failed with:

Cluster did not complete upgrade: timed out waiting for the condition: Working towards 0.0.1-2020-02-04-234838: 13% complete

Can't be related to my teardown change.

openshift-bot · 2020-02-05T02:09:53Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T02:22:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T03:14:46Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T03:53:46Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T04:19:50Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T04:32:57Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T04:45:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T06:03:44Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T06:16:46Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T07:08:55Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T08:26:54Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T08:39:50Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-bot · 2020-02-05T09:19:38Z

/retest

Please review the full test history for this PR and help us cut down flakes.

openshift-ci-robot · 2020-02-05T09:34:50Z

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1791400 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1791400: cmd/openshift-install/destroy: Remove terraform.tfstate in 'destroy cluster'

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

openshift-ci-robot · 2020-02-05T10:12:40Z

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name	Commit	Details	Rerun command
ci/prow/e2e-aws-disruptive	06a836c0546e1bc038e59892ece8accc46e42b09	link	`/test e2e-aws-disruptive`
ci/prow/e2e-libvirt	`1000fd4`	link	`/test e2e-libvirt`
ci/prow/e2e-aws-scaleup-rhel7	`1000fd4`	link	`/test e2e-aws-scaleup-rhel7`
ci/prow/e2e-ovirt	`1000fd4`	link	`/test e2e-ovirt`

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 30, 2019

openshift-ci-robot requested review from jstuever and patrickdillon September 30, 2019 18:42

wking force-pushed the remove-terraform-state-on-destroy branch from 8e18cbb to 06a836c Compare September 30, 2019 19:00

openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 30, 2019

patrickdillon reviewed Sep 30, 2019

View reviewed changes

openshift-ci-robot assigned jstuever Oct 11, 2019

openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2019

jstuever approved these changes Oct 11, 2019

View reviewed changes

openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 11, 2019

openshift-ci-robot assigned abhinavdahiya and unassigned jstuever Oct 23, 2019

openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 4, 2020

abhinavdahiya removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2020

openshift-merge-robot merged commit 552f107 into openshift:master Feb 5, 2020

wking deleted the remove-terraform-state-on-destroy branch February 12, 2020 00:40

Bug 1791400: cmd/openshift-install/destroy: Remove terraform.tfstate in 'destroy cluster' #2433

Bug 1791400: cmd/openshift-install/destroy: Remove terraform.tfstate in 'destroy cluster' #2433

Uh oh!

Conversation

wking commented Sep 30, 2019

Uh oh!

patrickdillon Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

wking Sep 30, 2019

Choose a reason for hiding this comment

Uh oh!

patrickdillon Oct 1, 2019

Choose a reason for hiding this comment

Uh oh!

patrickdillon commented Sep 30, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wking commented Sep 30, 2019

Uh oh!

patrickdillon commented Sep 30, 2019

Uh oh!

abhinavdahiya commented Sep 30, 2019

Uh oh!

wking commented Sep 30, 2019

Uh oh!

wking commented Sep 30, 2019

Uh oh!

patrickdillon commented Oct 1, 2019

Uh oh!

abhinavdahiya commented Oct 1, 2019

Uh oh!

wking commented Oct 1, 2019

Uh oh!

patrickdillon commented Oct 2, 2019

Uh oh!

wking commented Oct 2, 2019

Uh oh!

jstuever commented Oct 11, 2019

Uh oh!

abhinavdahiya commented Oct 11, 2019

Uh oh!

wking commented Oct 11, 2019

Uh oh!

jstuever commented Oct 23, 2019

Uh oh!

wking commented Jan 16, 2020

Uh oh!

abhinavdahiya commented Feb 4, 2020

Uh oh!

openshift-ci-robot commented Feb 4, 2020

Uh oh!

openshift-bot commented Feb 4, 2020

Uh oh!

openshift-bot commented Feb 4, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

wking commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

Uh oh!

openshift-bot commented Feb 5, 2020

patrickdillon Sep 30, 2019 •

edited

Loading

patrickdillon commented Sep 30, 2019 •

edited

Loading