clean up azure temporary managed os disk by alexeldeib · Pull Request #10713 · hashicorp/packer

alexeldeib · 2021-03-02T21:05:18Z

related to #9559

Closes #10268
Closes #10686
Closes #10416
Closes #9615

testing was slightly awkward. I tried re-introducing the code from #9559, but it seems because the VM cleanup doesn't happen until the DeployTemplate clean up step, we need to do it ~around the same time as deleting disks (after the VM has been deleted). I also tried running it as a separate step after Additional Disk cleanup, but that seemed to move error handling that should belong in the core builder?

this is necessary any time you create a VM with managed disk without creating it separately. It won't be deployed as part of the template, but WILL be created as a dependency of the VM, so you have to delete it after creating the VM.

we don't need this when the disk is in a temporary RG, since we delete the whole RG.

this is probably also necessary for VHD builds, actually? but i'm less sure about that. I'll test one out when I have a chance.

codecov · 2021-03-02T21:13:24Z

Codecov Report

Merging #10713 (a2957d2) into master (6ff6916) will increase coverage by 0.00%.
The diff coverage is 16.66%.

Impacted Files	Coverage Δ
builder/azure/arm/step_deploy_template.go	`16.07% <16.66%> (+5.60%)`	⬆️

alexeldeib · 2021-03-03T19:58:49Z

cc @nywilken -- looks like you were working on these bits recently :)

nywilken

@alexeldeib apologies for the delay here, and thank you for taking on this fix. I did touch the file last and have had some trouble figuring out the right path. So happy to see new eyes on the issue.

I gave your changes a test and they work as expected. I found it a little confusing at first to see us having to handle the deletion of the OS in the StepDeleteAdditionalDisk. But your comments in the description provided some insight.

While testing I ran into an issue with one of my templates and ultimately found that I made a mistake in #9559. I left a comment on the PR. After my findings I made a slight change to the original code to see if it would solve the problem as well and I think it does I would like to get your thoughts on it. I think we can improve things further in your PR to avoid potential bugs with the call to StepDeployTemplate#Cleanup.

Here is a gist with my changes https://gist.github.com/nywilken/de296af15f55fe3c8c2267735120321b

Look forward to reading your feedback.

builder/azure/arm/step_delete_additional_disks.go

nywilken · 2021-03-04T20:54:55Z

builder/azure/arm/step_delete_additional_disks.go


-	if isManagedDisk && !isExistingResourceGroup {
-		s.say(fmt.Sprintf(" -> Additional Disk : skipping, managed disk was used..."))
+	if isManagedDisk {


I think part of this confusion around where we can delete this disk is that I moved the call to NewStepDeleteAddtionalDisk into the StepDeployTemplate#Cleanup but in reality it should be its own thing.

While testing this today with a Windows OS build, non-managed, I was getting failures due to NewStepDeleteAddtionalDisk being called twice. After further investigation I found that for Windows builds NewStepDeployTemplate happens twice: one for the key vault deployment, and one for the virtual machine deployment. This issue only occurs for VHD builds.

Your solution for deleting the OS disk works, but still ends in a failed build because of the double call. I know what I'm pointing out is a different issue than the one you are fixing, but I believe they are related and if we move NewStepDeleteAddtionalDisk out of StepDeployTemplate#Cleanup then we can use your logic for deleting the OS disk or even the original code where we capture the os type and uri to be used for deletion in the cleanup function.

https://github.com/hashicorp/packer/pull/9559/files#diff-75e07109af7dc9db6f9e087b685e273810d56c6c48f04c153c2300e2d17c094eR134-R144

I moved NewStepDeleteAddtionalDisk into the Deployment cleanup step because it was previously being called after the Publish to SIG step, and if that failed the deletion would not happen. So my thought was to ensure the cleanup always happens where they are created - in StepDeployTemplate.

Here's a gist of the template I am using.

Ah, nice! That takes care of the ordering issue between the cleanup steps very cleanly. I'll incorporate your patch and give it a try on my end 👍

Co-authored-by: Wilken Rivera <dev@wilkenrivera.com>

nywilken · 2021-03-09T14:44:16Z

Hi @alexeldeib, did the changes work as expected for you?

I see you applied the path I sent your way. I posted a link to the binaries with the potential fix on the reporting issues for additional testing. If the patch is working as expected I think we can clean up the tests and see about adding an acceptance test that validates that all resources have been deleted. Let me know what you think.

alexeldeib · 2021-03-09T18:22:48Z

Sorry for the delay! Yep, things are working well for me now. I still need to update the tests and then I think we will be good to go?

alexeldeib · 2021-03-11T02:01:07Z

I added a few unit tests, is something like that what you had in mind? I hit some null refs in the test setup as well, so tweaked the step a bit further.

see about adding an acceptance test that validates that all resources have been deleted

are there some other e2e/integration tests you were thinking about? or more sophisticated mocking/resource tracking here? I saw https://github.com/hashicorp/packer/tree/master/test but those looked fairly minimal (albeit missing an azure version). What are you thinking 🙂 ?

From the report in the other issue, we still should wait a bit anyway to validate this does what's expected. It seems to for me, and I tested both linux/win and temp/pre-existing resource group before the last commit.

alexeldeib · 2021-03-11T23:34:14Z

there was a bit of a gotcha. we skip cleaning up the template resources if packer created the resource group, because we'll delete the resource group. however using unmanaged disks creates a copy not in the temporary RG, but in the storage account for the images.

this makes it difficult to run as a separate step, but we could unconditionally always clean up the individual resources in packer? It shouldn't actually be much worse than what Azure does internally for RG delete -- fire off deletes with retries on each of the individual resources, basically ignoring any dependencies. Then when we actually do the RG delete in the last step, it should be nearly immediate.

This should have the added benefit of reducing the number of times the RG delete times out, which I see surprisingly often.

SwampDragons · 2021-03-16T16:30:44Z

This should have the added benefit of reducing the number of times the RG delete times out, which I see surprisingly often.

That sounds like a big win

nywilken

@alexeldeib awesome work. Thank you for going so deep and for the extra info on the temp resource group. I left a small suggestion and a question. But this is otherwise good and it appears to have resolved the open issues.

Re:testing I was thinking of adding an acceptance test like

packer/builder/azure/arm/builder_acc_test.go

Line 49 in 0b5f890

    
           func TestBuilderAcc_ManagedDisk_Windows_Build_Resource_Group_Additional_Disk(t *testing.T) {

with a Check that we could use to validate that all of the resources for a deployment, including the OS disk, have been deleted. But I dont know how feasible that is since some of the items are deleted async. I also don't have insight on the effort to track all the items created and store them so that we can compare in the check.

builder/azure/arm/step_deploy_template.go

nywilken · 2021-03-16T17:12:49Z

builder/azure/arm/step_deploy_template.go

 	step.delete = step.deleteDeploymentResources
 	step.disk = step.getImageDetails
 	step.deleteDisk = step.deleteImage
+	step.deleteDeployment = step.deleteDeploymentObject


Nice add. The name provides more details for sure.

builder/azure/arm/step_deploy_template_test.go

Co-authored-by: Wilken Rivera <dev@wilkenrivera.com>

alexeldeib · 2021-03-17T20:59:15Z

@nywilken what do you think of the approach in this commit for acceptance checks?

It's not pretty, and pulls out some of the private internals of the builder, but it seems to work okay. I wanted to see if this makes sense before adding anything else (or maybe just leave it here for this PR).

I didn't see a good way to track resources deployed in the template and verify their deletion at the end, but we generally have enough information between the test case itself, and the builder internals to handle manually validating that XYZ resources were deleted. The main ones I think worth checking would be removing the VHD from the storage account, since it's not in the same resource group, and doing proper cleanup when using a pre-existing RG. I added the former, but not the latter.

…ace/removeOsDisk

builder/azure/arm/builder_acc_test.go

alexeldeib · 2021-03-17T21:43:38Z

(also happy to rebase and clean up the history before merging, just wanted to give you a chance to review the diffs)

alexeldeib · 2021-03-18T02:12:23Z

builder/azure/arm/step_deploy_template.go

+	}
+
+	var wg sync.WaitGroup
+	wg.Add(len(resources))


soo funny story. without the synchronization, Azure was returning the deployed resources in an order that happened to Just Work(TM) for deletion, incidentally ordering dependencies nicely.

Adding this basically replicates what Azure does internally, but it does make you more likely to see some transient deletion failures during cleanup...I hope that's an acceptable compromise? The alternative would be sorting by resource type to make a semi-DAG of dependencies, which feels like a bit bigger change

how it looks right now with retries on failures

(Incidentally, that means this logic was always racy and potentially broken, but somehow the likelihood of triggering it in a broken way was fairly low when doing serial deletion)

nywilken

One extra quote, but otherwise good to go. If you can fix and rebase I will merge immediately.

--- PASS: TestBuilderAcc_Blob_Linux (340.37s)
--- PASS: TestBuilderAcc_Blob_Windows (476.38s)
PASS
ok      github.com/hashicorp/packer/builder/azure/arm   476.398s

Thanks again for the great work here.

nywilken · 2021-03-18T01:56:49Z

builder/azure/arm/builder_acc_test.go


 	  "storage_account": "{{user ` + "`storage_account`" + `}}",
-	  "resource_group_name": "packer-acceptance-test",
+	  "resource_group_name": "packer-acceptance-test"",


One extra quote

Suggested change

"resource_group_name": "packer-acceptance-test"",

"resource_group_name": "packer-acceptance-test",

alexeldeib · 2021-03-18T02:31:31Z

sorry for all the last minute changes -- I think I'm actually done 😄 that bit with the deletion ordering threw me for a loop.

nywilken · 2021-03-18T16:09:44Z

Results with the latest changes 👍 Thanks again for such a great and detailed PR.

PASS
ok      github.com/hashicorp/packer/builder/azure/arm   2682.121s

ghost · 2021-04-18T01:50:33Z

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

clean up temporary managed os disk

4388695

alexeldeib requested a review from paulmey as a code owner March 2, 2021 21:05

CecileRobertMichon mentioned this pull request Mar 3, 2021

Update Packer to 1.6.6 deis/docker-go-dev#263

Closed

nywilken reviewed Mar 4, 2021

View reviewed changes

alexeldeib and others added 3 commits March 4, 2021 13:56

improve message for skipping disk deletion

ab88422

Co-authored-by: Wilken Rivera <dev@wilkenrivera.com>

re-arrange osdisk/additional disk cleanup

6ef856a

Co-authored-by: Wilken Rivera <dev@wilkenrivera.com>

remove additional disk cleanup

3847c86

This was referenced Mar 5, 2021

packer doesn't clean up disk after failure #10268

Closed

[azure-arm] OS disk preserves after packer build in versions 1.6.2 and above #10416

Closed

Packer build not deleting OS disks in Azure after completing the build. #10686

Closed

add some validation on scenarios

d93d0ac

alexeldeib added 2 commits March 11, 2021 17:30

alway clean up resources inside template cleanup

a007898

tidy to master

0e769ca

nywilken approved these changes Mar 16, 2021

View reviewed changes

CecileRobertMichon mentioned this pull request Mar 16, 2021

Updates packer patch version. kubernetes-sigs/image-builder#548

Merged

alexeldeib and others added 2 commits March 16, 2021 18:33

clarify naming and comments

c385f24

Co-authored-by: Wilken Rivera <dev@wilkenrivera.com>

test: add acceptance testing for azure cleanup

a96d4d4

Merge branch 'ace/removeOsDisk' of github.com:alexeldeib/packer into …

ac10575

…ace/removeOsDisk

alexeldeib commented Mar 17, 2021

View reviewed changes

builder/azure/arm/builder_acc_test.go Outdated Show resolved Hide resolved

alexeldeib added 2 commits March 17, 2021 14:22

revert template changes

74f6e1f

fix err check

bae16d9

delete resources in parallel with retry to avoid serial deletion

1b2c36a

alexeldeib commented Mar 18, 2021

View reviewed changes

nywilken approved these changes Mar 18, 2021

View reviewed changes

alexeldeib added 3 commits March 17, 2021 19:24

nit: improve log message for transient deletion errors

bc39c32

fix: typo

2f72108

remove unused func to make lint happy

a2957d2

nywilken merged commit 3227d3d into hashicorp:master Mar 18, 2021

ghost locked as resolved and limited conversation to collaborators Apr 18, 2021

	"resource_group_name": "packer-acceptance-test"",
	"resource_group_name": "packer-acceptance-test",

Conversation

alexeldeib commented Mar 2, 2021 • edited by SwampDragons Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

codecov bot commented Mar 2, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

alexeldeib commented Mar 3, 2021

Uh oh!

nywilken left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nywilken Mar 4, 2021

Choose a reason for hiding this comment

Uh oh!

alexeldeib Mar 4, 2021

Choose a reason for hiding this comment

Uh oh!

nywilken commented Mar 9, 2021

Uh oh!

alexeldeib commented Mar 9, 2021

Uh oh!

alexeldeib commented Mar 11, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

alexeldeib commented Mar 11, 2021

Uh oh!

SwampDragons commented Mar 16, 2021

Uh oh!

nywilken left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nywilken Mar 16, 2021

Choose a reason for hiding this comment

Uh oh!

Uh oh!

alexeldeib commented Mar 17, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

alexeldeib commented Mar 17, 2021

Uh oh!

alexeldeib Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

alexeldeib Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

alexeldeib Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

nywilken left a comment

Choose a reason for hiding this comment

Uh oh!

nywilken Mar 18, 2021

Choose a reason for hiding this comment

Uh oh!

alexeldeib commented Mar 18, 2021

Uh oh!

nywilken commented Mar 18, 2021

Uh oh!

ghost commented Apr 18, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

alexeldeib commented Mar 2, 2021 •

edited by SwampDragons

Loading

codecov bot commented Mar 2, 2021 •

edited

Loading

alexeldeib commented Mar 11, 2021 •

edited

Loading

alexeldeib commented Mar 17, 2021 •

edited

Loading