Skip to content

Conversation

@wking
Copy link
Member

@wking wking commented Sep 30, 2019

So destroy cluster gets you all the way back to a clean slate, even if the cluster in question died during infrastructure provisioning (leaking mentioned here and here).

This will obviously not clean up the original asset directory in workflows where the user copies their metadata.json over into a new directory and runs destroy cluster in the new directory. But since we have existing asset-state removal code that also behaves that way, I don't think it's a big deal.

cmd/openshift-install/destroy.go is a convenient place to put this now, when all of our providers are Terraform-based. If, in the future, we move some providers off of Terraform (or add new, non-Terraform providers), we can push this down into the platform-specific destroyers. We could also leave it here, because terraform.tfstate is unlikely to exist in the asset directory for non-Terraform providers and be something that the user wants to keep around, so the risk of false-positive removal is low.

@openshift-ci-robot openshift-ci-robot added size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. approved Indicates a PR has been approved by an approver from all required OWNERS files. labels Sep 30, 2019
@wking wking force-pushed the remove-terraform-state-on-destroy branch from 8e18cbb to 06a836c Compare September 30, 2019 19:00
@openshift-ci-robot openshift-ci-robot added size/S Denotes a PR that changes 10-29 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Sep 30, 2019
Copy link
Contributor

@patrickdillon patrickdillon Sep 30, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to tighten this up a little:

if err = os.Remove(tfStateFilePath); !os.IsNotExist(err) {
		return errors.Wrap(err, "failed to remove Terraform state")
	}

(stolen from here)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err = os.Remove(tfStateFilePath); !os.IsNotExist(err) {

We don't want to error on nil, and IsNotExist(nil) is false. I like separating the call from the error-handling conditionals, but I can squash down to:

if err = os.Remove(tfStateFilePath); err != nil && !os.IsNotExist(err) {

if folks see that as a blocker ;).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't want to error on nil

Oh of course. I left out the most important condition. Nvm!

@patrickdillon
Copy link
Contributor

patrickdillon commented Sep 30, 2019

Should https://github.com/openshift/installer/blob/master/cmd/openshift-install/destroy.go#L69 already be removing the tfstate file or is that referring to a different "state file"?

@wking
Copy link
Member Author

wking commented Sep 30, 2019

Should https://github.com/openshift/installer/blob/master/cmd/openshift-install/destroy.go#L69 already be removing the tfstate file or is that referring to a different "state file"?

Different (that DestroyState call is about installer assets/state, not Terraform state).

@patrickdillon
Copy link
Contributor

This lgtm but I will give others a chance to review.

@abhinavdahiya
Copy link
Contributor

how is the tfstate not part of the asset graph, compared to the tfvars files?

Can we define the case where the tfstate is left behind first, instead of deleting this file blindly in destroy?

@wking
Copy link
Member Author

wking commented Sep 30, 2019

how is the tfstate not part of the asset graph, compared to the tfvars files?

Hmm, yeah. We should be adding it to the state here. I'll hunt through CI to try and find a reproducer.

@wking
Copy link
Member Author

wking commented Sep 30, 2019

Here is a job that leaked terraform.tfstate. destroy cluster wrapped up with:

time="2019-09-30T20:21:40Z" level=debug msg="search for untaggable resources"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Terraform Variables\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Kubeconfig Admin Client\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Kubeadmin Password\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Certificate (journal-gatewayd)\" from disk"
time="2019-09-30T20:21:40Z" level=debug msg="Purging asset \"Metadata\" from disk"

But an asset is only committed to the store if its Generate succeeds. The Cluster.Generate call failed, so it (and its Terraform state file) were never committed to the store. Do we want to reroll the store to record the output of a failed Generate call? I'd expect the Cluster asset to be the only case where we'd want that, and in other cases we'd actively not want it to avoid polluting the store with broken content. I'm comfortable handling this Terraform state file (the only file in the Cluster asset) directly as I have here instead of trying to find a way to wedge it into the asset-store framework. I'd also be ok special-casing Cluster somehow to get it committed to the store regardless of success. Thoughts?

@patrickdillon
Copy link
Contributor

I'd also be ok special-casing Cluster somehow to get it committed to the store regardless of success. Thoughts?

I'm still getting familiar with this code, but this approach of adding tfstate to the store looks cleaner and less of a special case than direct handling of the state file. I'm not sure of the exact definition of assets, but I think it should encompass this file (something written to disk and needed by the installer).

@abhinavdahiya
Copy link
Contributor

@wking

So for the leaked run the terraform.tfstate file was created in the install_dir but not with the help of the asset graph?? or did we copy the state file from tmp_terraform_workspace to install_dir but didn't report that as part of asset output..

because if it's the latter, that to be seems like the bug.

@wking
Copy link
Member Author

wking commented Oct 1, 2019

So for the leaked run the terraform.tfstate file was created in the install_dir but not with the help of the asset graph??

We created it through the asset graph in Cluster.Generate(). But because that Generate call returned an error (Terraform failling), we exited here without inserting the asset into the store here. So as far as the asset store is concerned, there was no terraform.tfstate under management. As I said above, not adding assets to the store after a failed install is a good thing for all the other assets. But it's not a good thing for the Cluster asset, if the goal is having DestroyState on the store clean up terraform.tfstate. We can fix it with 06a836c0546 as it stands, or with something like:

if err := a.Generate(parents); err != nil {
  if a was the Cluster asset {
    assetState.asset = a
    assetState.source = generatedSource
  }
  return errors.Wrapf(err, "failed to generate asset %q", a.Name())
}
assetState.asset = a
assetState.source = generatedSource

I'm fine either way.

@patrickdillon
Copy link
Contributor

If assets are only added to the asset store as a result of Generate(), it seems like the best solution would be to delegate that action to the assets themselves. In this case, I think that would mean expanding Generate to accept assetState.

That is probably more code than we want to write to fix this (though not that bad I think), but worth discussing.

@wking
Copy link
Member Author

wking commented Oct 2, 2019

If assets are only added to the asset store as a result of Generate(), it seems like the best solution would be to delegate that action to the assets themselves.

My asset-graph opinions are in #556, which is far enough from what we have now that I don't have opinions about minor pivots ;). Restructuring the asset framework to allow assets to decide if/when to save themselves would work, but it's a larger pivot than either of the two alternatives I gave here. But folks should pick an approach, and then tell me and I'll implement it ;).

@jstuever
Copy link
Contributor

This looks like a good quick fix. We should create a card to revisit if/how to do it using asset-graph.
/lgtm

@openshift-ci-robot openshift-ci-robot added the lgtm Indicates that a PR is ready to be merged. label Oct 11, 2019
@abhinavdahiya
Copy link
Contributor

/hold

@openshift-ci-robot openshift-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Oct 11, 2019
@wking
Copy link
Member Author

wking commented Oct 11, 2019

/retest

Before we reopen discussion here, I want to see if this thing actually compiles ;)

@jstuever
Copy link
Contributor

You put a hold on it.
/assign @abhinavdahiya
/unassign @jstuever

@wking
Copy link
Member Author

wking commented Jan 16, 2020

/retitle Bug 1791400: cmd/openshift-install/destroy: Remove terraform.tfstate in 'destroy cluster'

@abhinavdahiya
Copy link
Contributor

/approve
/lgtm

/hold cancel

@openshift-ci-robot openshift-ci-robot added lgtm Indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Feb 4, 2020
@abhinavdahiya abhinavdahiya removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Feb 4, 2020
@openshift-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: abhinavdahiya, jstuever

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

6 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@wking
Copy link
Member Author

wking commented Feb 5, 2020

Update job 4676 failed with:

Cluster did not complete upgrade: timed out waiting for the condition: Working towards 0.0.1-2020-02-04-234838: 13% complete

Can't be related to my teardown change.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

12 similar comments
@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-bot
Copy link
Contributor

/retest

Please review the full test history for this PR and help us cut down flakes.

@openshift-merge-robot openshift-merge-robot merged commit 552f107 into openshift:master Feb 5, 2020
@openshift-ci-robot
Copy link
Contributor

@wking: All pull requests linked via external trackers have merged. Bugzilla bug 1791400 has been moved to the MODIFIED state.

Details

In response to this:

Bug 1791400: cmd/openshift-install/destroy: Remove terraform.tfstate in 'destroy cluster'

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@openshift-ci-robot
Copy link
Contributor

@wking: The following tests failed, say /retest to rerun all failed tests:

Test name Commit Details Rerun command
ci/prow/e2e-aws-disruptive 06a836c0546e1bc038e59892ece8accc46e42b09 link /test e2e-aws-disruptive
ci/prow/e2e-libvirt 1000fd4 link /test e2e-libvirt
ci/prow/e2e-aws-scaleup-rhel7 1000fd4 link /test e2e-aws-scaleup-rhel7
ci/prow/e2e-ovirt 1000fd4 link /test e2e-ovirt

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@wking wking deleted the remove-terraform-state-on-destroy branch February 12, 2020 00:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. bugzilla/valid-bug Indicates that a referenced Bugzilla bug is valid for the branch this PR is targeting. lgtm Indicates that a PR is ready to be merged. size/S Denotes a PR that changes 10-29 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants