-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
BackupController: do as much as possible #250
BackupController: do as much as possible #250
Conversation
6740070
to
db22f6a
Compare
@skriss PTAL at these changes and let me know if you think this is minimally sufficient. If so, I'll fix the tests. |
Changes LGTM. 2 peripheral things to think about - I think the backup service's |
Is it worth uploading the metadata for Failed backups? That would mean they would sync from object storage to kube.
I'm ok putting this PR in knowing that yours will take care of that. |
I don't see any downside to uploading the metadata - do you? I think it makes sense. I guess the alternative could be to put failed backup logs in a special top-level dir that we ignore when listing backups. |
I'll go ahead an upload it |
db22f6a
to
8ef213a
Compare
I'll give it a shot this afternoon |
When running a backup, try to do as much as possible, collecting errors along the way, and return an aggregate at the end. This way, if a backup fails for most reasons, we'll be able to upload the backup log file to object storage, which wasn't happening before. Signed-off-by: Andy Goldstein <[email protected]>
8ef213a
to
1e581f1
Compare
e2e test on my local machine passed. Relevant console output snipped:
The The branch of the e2e tests I ran didn't grab the backup logs, but I can look at adding that tomorrow. |
// upload tar file | ||
if err := br.seekAndPutObject(bucket, getBackupContentsKey(backupName), backup); err != nil { | ||
// try to delete the metadata file since the data upload failed | ||
deleteErr := br.objectStore.DeleteObject(bucket, metadataKey) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Probably makes sense to leave the metadata file here, right? For the same reason that we're uploading it for failed backups.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I was planning to ask you about that. I'm happy to remove this code!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How do we handle the situation where the backup completed successfully, we were able to upload the metadata, but uploading the tarball failed for some reason? What you'd see is a completed backup, with logs, but no ability to restore it... TODO for later, or fix now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't you see Failed
on the API object? this func would return an error to the controller, and then runBackup
would return an error to processBackup
which would log it and mark it as failed. If this is true, I think it's still not ideal but reasonably obvious enough that no further changes would be needed for now,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my scenario, the json file in object storage has Completed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, but the backup obj in etcd has Failed
, right? so ark backup get
would show Failed
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd have to go back through the various places where status is changed to failed to confirm. Also, if you were to sync from object storage into a new kube cluster, it would come in as completed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
true. idk, what do you think makes sense? we could remove everything from obj storage on error, or re-upload metadata with a Failed status, or...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thoughts on where to leave this for now?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Forgot about this thread. Let me page it back in and think about it.
metadata: newStringReadSeeker("foo"), | ||
metadataError: errors.New("md"), | ||
log: newStringReadSeeker("baz"), | ||
expectedErr: "md", | ||
}, | ||
{ | ||
name: "error on data upload deletes metadata", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see prev comment re: whether to delete metadata file in this case
Don't merge yet... with 1 small change locally, this is what you see in a backup's log if you have a plugin that returns an error on every single item in the backup: [snip]
[snip]
Some things to point out:
Do you think it makes sense to log the succeeded/failed status after each custom action runs? How do you think we should handle the error that the item backupper returns when a custom action fails? Should we include more details (group, resource, namespace, name, plugin name)? Just print out the description so we don't show the rpc details? What is the best way to show the errors to users to they're easy to find and discern? |
Signed-off-by: Andy Goldstein <[email protected]>
I'd say it makes sense to immediately log on an error (at the error level). I think logging success might make the log too noisy.
I think it makes sense to include more context in the errors (can Wrapf the error from the custom action with the additional fields). I'm not overly concerned about the RPC details as long as we include the additional context. |
Agreed. |
If a backup item action errors, log the error as soon as it occurs, so it's clear when the error happened. Also include information about the groupResource, namespace, and name of the item in the error. Signed-off-by: Andy Goldstein <[email protected]>
LGTM. |
When running a backup, try to do as much as possible, collecting errors
along the way, and return an aggregate at the end. This way, if a backup
fails for most reasons, we'll be able to upload the backup log file to
object storage, which wasn't happening before.
Signed-off-by: Andy Goldstein [email protected]