BackupController: do as much as possible #250

ncdc · 2017-12-14T17:32:13Z

When running a backup, try to do as much as possible, collecting errors
along the way, and return an aggregate at the end. This way, if a backup
fails for most reasons, we'll be able to upload the backup log file to
object storage, which wasn't happening before.

Signed-off-by: Andy Goldstein [email protected]

ncdc · 2017-12-19T17:17:21Z

@skriss PTAL at these changes and let me know if you think this is minimally sufficient. If so, I'll fix the tests.

skriss · 2017-12-19T17:44:26Z

Changes LGTM. 2 peripheral things to think about - I think the backup service's GetAllBackups will now log an error any time it's called if there are failed backup logs in obj store (although the func should still return valid backups without error); and also, in the current state, I don't think failed backup logs will ever get GC'ed. I think with my changes to GC in #252, though, they will.

ncdc · 2017-12-19T17:52:53Z

I think the backup service's GetAllBackups will now log an error any time it's called if there are failed backup logs in obj store (although the func should still return valid backups without error)

Is it worth uploading the metadata for Failed backups? That would mean they would sync from object storage to kube.

in the current state, I don't think failed backup logs will ever get GC'ed. I think with my changes to GC in #252, though, they will.

I'm ok putting this PR in knowing that yours will take care of that.

skriss · 2017-12-19T18:49:47Z

I don't see any downside to uploading the metadata - do you? I think it makes sense. I guess the alternative could be to put failed backup logs in a special top-level dir that we ignore when listing backups.

ncdc · 2017-12-19T18:52:26Z

I'll go ahead an upload it

ncdc · 2017-12-20T16:30:35Z

@skriss ready for review. @nrb would be nice if you could run this through e2e :)

nrb · 2017-12-20T17:17:13Z

I'll give it a shot this afternoon

When running a backup, try to do as much as possible, collecting errors along the way, and return an aggregate at the end. This way, if a backup fails for most reasons, we'll be able to upload the backup log file to object storage, which wasn't happening before. Signed-off-by: Andy Goldstein <[email protected]>

nrb · 2017-12-20T21:57:39Z

e2e test on my local machine passed.

Relevant console output snipped:

customresourcedefinition "backups.ark.heptio.com" created
customresourcedefinition "schedules.ark.heptio.com" created
customresourcedefinition "restores.ark.heptio.com" created
customresourcedefinition "configs.ark.heptio.com" created
customresourcedefinition "downloadrequests.ark.heptio.com" created
namespace "heptio-ark" created
serviceaccount "ark" created
clusterrolebinding "ark" created
role "ark" created
rolebinding "ark" created
creating user...
created user.
secret "cloud-credentials" created
config "default" created
deployment "ark" created
Running pre-test
Reading nginx example file...
Read nginx example file.
Ensuring _artifacts dir exists...
Ensured _artifacts dir exists.
Writing evaluated example file...
Wrote evaluated example file.
Applying example file...
namespace "nginx-example" created
deployment "nginx-deployment" created
service "my-nginx" created
Applied example file.
Running test
Creating backup...
Backup "nginx-backup" created successfully.
Created backup.
Waiting for backup to complete...
Backup completed.
Deleting nginx namespace & PV to simulate disaster...
namespace "nginx-example" deleted
Waiting for nginx namespace to terminate...
Namespace terminated.
Creating restore...
Restore "nginx-backup-20171220164846" created successfully.
Created restore.
Verifying nginx install...
deployments.apps "nginx-deployment" not found
Verified nginx install.
PASS: true
Running post-test
Fetching logs for namespace  heptio-ark
Ensuring log directory exists
-- Getting logs for ark-5b775789d4-hcvjq
Creating  _artifacts/logs/ark-5b775789d4-hcvjq.log
Wrote log  _artifacts/logs/ark-5b775789d4-hcvjq.log
Fetching logs for namespace  nginx-example
Ensuring log directory exists
-- Getting logs for nginx-deployment-569477d6d8-knhsq
Creating  _artifacts/logs/nginx-deployment-569477d6d8-knhsq.log
Wrote log  _artifacts/logs/nginx-deployment-569477d6d8-knhsq.log
-- Getting logs for nginx-deployment-569477d6d8-shzkh
Creating  _artifacts/logs/nginx-deployment-569477d6d8-shzkh.log
Wrote log  _artifacts/logs/nginx-deployment-569477d6d8-shzkh.log
namespace "nginx-example" deleted
backup "nginx-backup" deleted
restore "nginx-backup-20171220164846" deleted
namespace "heptio-ark" deleted
Waiting for Ark namespace to delete...
Ark namespace deleted.
Deleting Ark IAM user...

The deployments.apps "nginx-deployment" not found output comes from how diluvian's looping to verify the deployment; it will keep looking til a timeout happens and fail if it wasn't found. So I'm not too concerned about that line unless others are.

The branch of the e2e tests I ran didn't grab the backup logs, but I can look at adding that tomorrow.

skriss · 2017-12-20T23:29:35Z

pkg/cloudprovider/backup_service.go

+		// upload tar file
+		if err := br.seekAndPutObject(bucket, getBackupContentsKey(backupName), backup); err != nil {
+			// try to delete the metadata file since the data upload failed
+			deleteErr := br.objectStore.DeleteObject(bucket, metadataKey)


Probably makes sense to leave the metadata file here, right? For the same reason that we're uploading it for failed backups.

Yeah, I was planning to ask you about that. I'm happy to remove this code!

How do we handle the situation where the backup completed successfully, we were able to upload the metadata, but uploading the tarball failed for some reason? What you'd see is a completed backup, with logs, but no ability to restore it... TODO for later, or fix now?

Wouldn't you see Failed on the API object? this func would return an error to the controller, and then runBackup would return an error to processBackup which would log it and mark it as failed. If this is true, I think it's still not ideal but reasonably obvious enough that no further changes would be needed for now,

In my scenario, the json file in object storage has Completed

right, but the backup obj in etcd has Failed, right? so ark backup get would show Failed?

I'd have to go back through the various places where status is changed to failed to confirm. Also, if you were to sync from object storage into a new kube cluster, it would come in as completed...

true. idk, what do you think makes sense? we could remove everything from obj storage on error, or re-upload metadata with a Failed status, or...

thoughts on where to leave this for now?

Forgot about this thread. Let me page it back in and think about it.

skriss · 2017-12-20T23:30:29Z

pkg/cloudprovider/backup_service_test.go

 			metadata:      newStringReadSeeker("foo"),
 			metadataError: errors.New("md"),
+			log:           newStringReadSeeker("baz"),
 			expectedErr:   "md",
 		},
 		{
 			name:                 "error on data upload deletes metadata",


see prev comment re: whether to delete metadata file in this case

ncdc · 2017-12-21T16:55:52Z

Don't merge yet... with 1 small change locally, this is what you see in a backup's log if you have a plugin that returns an error on every single item in the backup:

[snip]

time="2017-12-21T16:47:04Z" level=info msg="Backing up resource" backup=heptio-ark/fail5 group=v1 groupResource=secrets logSource="pkg/backup/item_backupper.go:163" name=default-token-s5wfb namespace=heptio-ark
time="2017-12-21T16:47:04Z" level=info msg="Executing custom action" backup=heptio-ark/fail5 group=v1 groupResource=secrets logSource="pkg/backup/item_backupper.go:188" name=default-token-s5wfb namespace=heptio-ark
time="2017-12-21T16:47:04Z" level=info msg="Hello from MyPlugin!" backup=heptio-ark/fail5 group=v1 groupResource=secrets logSource="/go/src/github.com/ncdc/ark-backupitemaction-always-fail/ark-backupitemaction-always-fail/myplugin.go:42" name=default-token-s5wfb namespace=heptio-ark pluginName=ark-backupitemaction-always-fail

[snip]

time="2017-12-21T16:47:05Z" level=info msg="Backup completed with errors: [error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: rpc error: code = Unknown desc = oh no!, error executing custom action: ......] backup=heptio-ark/fail5 logSource="pkg/backup/backup.go:253"

Some things to point out:

You can see that my plugin executed (Hello from MyPlugin!) but you don't know that it failed (there's no immediate logging about the returned error)
All the errors are aggregated into a giant array, with each error having the (rather uninformative) text error executing custom action: rpc error: code = Unknown desc = oh no!

Do you think it makes sense to log the succeeded/failed status after each custom action runs?

How do you think we should handle the error that the item backupper returns when a custom action fails? Should we include more details (group, resource, namespace, name, plugin name)? Just print out the description so we don't show the rpc details? What is the best way to show the errors to users to they're easy to find and discern?

Signed-off-by: Andy Goldstein <[email protected]>

skriss · 2017-12-21T17:45:10Z

Do you think it makes sense to log the succeeded/failed status after each custom action runs?

I'd say it makes sense to immediately log on an error (at the error level). I think logging success might make the log too noisy.

How do you think we should handle the error that the item backupper returns when a custom action fails? Should we include more details (group, resource, namespace, name, plugin name)? Just print out the description so we don't show the rpc details? What is the best way to show the errors to users to they're easy to find and discern?

I think it makes sense to include more context in the errors (can Wrapf the error from the custom action with the additional fields). I'm not overly concerned about the RPC details as long as we include the additional context.

nrb · 2017-12-21T17:46:56Z

I'd say it makes sense to immediately log on an error (at the error level). I think logging success might make the log too noisy.

Agreed.

If a backup item action errors, log the error as soon as it occurs, so it's clear when the error happened. Also include information about the groupResource, namespace, and name of the item in the error. Signed-off-by: Andy Goldstein <[email protected]>

skriss · 2018-01-03T20:10:42Z

LGTM.

ncdc added enhancement labels Dec 14, 2017

ncdc added this to the v0.7.0 milestone Dec 14, 2017

ncdc assigned skriss Dec 14, 2017

ncdc force-pushed the backup-controller-do-as-much-as-possible branch from 6740070 to db22f6a Compare December 19, 2017 17:16

ncdc force-pushed the backup-controller-do-as-much-as-possible branch from db22f6a to 8ef213a Compare December 20, 2017 16:13

ncdc force-pushed the backup-controller-do-as-much-as-possible branch from 8ef213a to 1e581f1 Compare December 20, 2017 19:49

skriss reviewed Dec 20, 2017

View reviewed changes

Flatten aggregated errors

0fc087c

Signed-off-by: Andy Goldstein <[email protected]>

ncdc changed the title ~~BackupController: do as much as possible~~ [WIP] BackupController: do as much as possible Dec 22, 2017

ncdc changed the title ~~[WIP] BackupController: do as much as possible~~ BackupController: do as much as possible Jan 3, 2018

skriss merged commit 656428d into vmware-tanzu:master Jan 3, 2018

ncdc deleted the backup-controller-do-as-much-as-possible branch March 5, 2018 15:58

dymurray pushed a commit to dymurray/velero that referenced this pull request Jul 3, 2023

chore(release): Update version on release-1.7.9 (vmware-tanzu#250)

db31440

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BackupController: do as much as possible #250

BackupController: do as much as possible #250

ncdc commented Dec 14, 2017

ncdc commented Dec 19, 2017

skriss commented Dec 19, 2017

ncdc commented Dec 19, 2017

skriss commented Dec 19, 2017

ncdc commented Dec 19, 2017

ncdc commented Dec 20, 2017

nrb commented Dec 20, 2017

nrb commented Dec 20, 2017

skriss Dec 20, 2017

ncdc Dec 21, 2017

ncdc Dec 21, 2017

skriss Dec 21, 2017

ncdc Dec 21, 2017

skriss Dec 21, 2017

ncdc Dec 21, 2017

skriss Dec 21, 2017

skriss Jan 3, 2018

ncdc Jan 3, 2018

skriss Dec 20, 2017

ncdc commented Dec 21, 2017

skriss commented Dec 21, 2017

nrb commented Dec 21, 2017

skriss commented Jan 3, 2018

BackupController: do as much as possible #250

BackupController: do as much as possible #250

Conversation

ncdc commented Dec 14, 2017

ncdc commented Dec 19, 2017

skriss commented Dec 19, 2017

ncdc commented Dec 19, 2017

skriss commented Dec 19, 2017

ncdc commented Dec 19, 2017

ncdc commented Dec 20, 2017

nrb commented Dec 20, 2017

nrb commented Dec 20, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ncdc commented Dec 21, 2017

skriss commented Dec 21, 2017

nrb commented Dec 21, 2017

skriss commented Jan 3, 2018