-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save logs per backup #40
Conversation
EDIT: now using pre-signed URLs from object storage instead of a web server
|
@jrnt30 this isn't the cleanest, but I've tested that it works at least for minio. I need to test aws/azure/gcp, but that'll have to wait for the weekend or next week. The way gcp library for doing pre-signed URLs is a bit odd b/c it requires the serviceaccount email and private key directly, unlike any of the other places where we talk to gcp. I need to look around and see if there's a better way, or if switching to this newer/higher level gcp api makes more sense. tl;dr still a WIP, needs more unit tests, needs more godoc, needs some cleanup, needs manual e2e testing. |
Thanks, I'll see what merging that into the download backup looks like and if timer permits look at some AWS testing |
So I took a bit of a look into this. For AWS specifically there is a max TTL for presigned requests of a week http://docs.aws.amazon.com/AmazonS3/latest/API/sigv4-query-string-auth.html I haven't looked at the other providers yet, but I imagine it will be somewhat similar. Depending on how we want to deal with TTLs, we may need to be a bit more explicit about the URLs and go the BackupDownloadRequest option but that does complicate the interaction with the CLI in general as we then have a resource we need to have propagate before we can actually pull the signed URL. Another possibly is to add in additional TTL attributes (similar to the expiry) for the presigned links themselves and then augment the gc collector (or create a new LinkSync controller) that would be responsible for upserting that information. There may be a few other options (like having a child CRD to the Backup that has it's own lifecycle for refresh and defaults to min of the the TTL provider vs. the providers limit) but I would need to give that some more thought. I will work on updating the Download command based upon what is here as I think in general it will be fairly transparent whichever way we go. |
@jrnt30 I have decided to go with a |
@jrnt30 now with DownloadRequest. PTAL. I've gotta go for now, but I'll keep working on it tonight/tomorrow. |
pkg/cmd/cli/backup/logs.go
Outdated
|
||
listOptions := metav1.ListOptions{ | ||
//TODO: once kube-apiserver http://issue.k8s.io/51046 is fixed, uncomment | ||
//FieldSelector: "metadata.name=" + args[0], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should probably be dr.ObjectMeta.Name
in the comment since you are appending the timestamp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, thanks. I also need to make sure the name of the DownloadRequest received from w.ResultChan matches that too.
Thanks for the updates. I have started an integration with what you have w/ a bit of refactoring to extract some very common things which you could see https://github.com/ncdc/ark/compare/backup-logs...jrnt30:backup-download-request?expand=1 It's rough at this point and I just ran into the issue with Minio not being supported again due to AZ validation so I am throwing in the towel for tonight. In general I think this should work pretty well and be extensible. There were a few small test/build things I had to fix, but I'm sure you would have found those as well. |
820c8fd
to
a7d66d2
Compare
@jrnt30 the latest code now has extracted the download request streaming into a helper function (I put it pkg/cmd/util/downloadrequest because we're also going to use it for restore logs) and I've fixed the test/compile issues. @skriss is wrapping up #43 which will fix the minio issue. I've tested this locally with minio (commenting out the AZ validation) and on AWS. Next up, I'll be testing GCP. @skriss is going to test Azure for us. |
👍 This looks good, the util w/ test is a superior solution. I did a local rebase and can work in these changes. I'm going to hold off until it stabilizes a bit as the force pushes make the rebase particularly annoying. |
d49aea0
to
4262e42
Compare
examples/gcp/00-ark-config.yaml
Outdated
@@ -20,11 +20,13 @@ metadata: | |||
name: default | |||
persistentVolumeProvider: | |||
gcp: | |||
credentialsFile: /credentials/cloud |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this actually needed for the PVProvider?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes and no. The PVProvider doesn't use it directly; it needs to have GOOGLE_APPLICATION_CREDENTIALS
set to this path. I feel like this is getting a bit messy... In my branch now, you could have your backup provider be something other than GCE, but if you want to handle PVs, then you have to specify the credentialsFile for the gcp PVProvider, because the code looks for that and sets the env var for you (so you don't have to put it in the deployment.yaml).
I feel like this is getting messy and I'm getting more and more in favor of having a config that looks like this, although I know @jbeda doesn't like it:
clouds:
myCoolGCP:
credentialsFile: /credentials/cloud
project: foo
zone: us-east1-b
someOtherCloud:
someDetails: goHere
persistentVolumeProvider:
cloud: myCoolGCP
backupStorageProvider:
cloud: someOtherCloud
bucket: mybucket
This way, we load each cloud provider one time, instead of twice like we're doing now...
Let's discuss when you're in.
pkg/apis/ark/v1/download_request.go
Outdated
|
||
import metav1 "k8s.io/apimachinery/pkg/apis/meta/v1" | ||
|
||
type DownloadRequestSpec struct { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this file needs godoc
"reflect" | ||
"sync" | ||
"time" | ||
|
||
"golang.org/x/oauth2/google" | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove blank line
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm going to change my stance on imports. I don't want us to spend time thinking about them. If the code compiles, that's good enough for me, regardless of order, whitespace, grouping, etc.
return c.downloadRequestClient.DownloadRequests(downloadRequest.Namespace).Delete(downloadRequest.Name, nil) | ||
} | ||
|
||
func (c *downloadRequestController) resync() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it's slightly confusing that the only way that items with Phase = Processed actually get processed is because they're add to the queue in the re-sync. WDYT about not filtering them out in the informer AddFunc & having an UpdateFunc? Else, probably at least worth a comment somewhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it this way because I intend for the ark
client to manage the lifespan of a DownloadRequest
:
- client creates
DownloadRequest
- client waits for it to be
Processed
- controller processes it
- client sees updated
DownloadRequest
and downloads the URL - client deletes the
DownloadRequest
The resync is a failsafe that is meant to clean up any DownloadRequest
s that the client failed to delete. I'm happy to put in a comment to this effect.
Does this make sense?
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Switch to dep. Update the following: - azure-sdk-for-go to 10.2.1-beta - go-autorest to 8.1.1 - client-go to 4.0.0 Signed-off-by: Andy Goldstein <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general LGTM. A few logging things we can probably remove.
I have a few general questions that are more for educating myself. If this isn't a good forum for that, I can take it elsewhere next time.
pkg/backup/backup.go
Outdated
func (kb *kubernetesBackupper) Backup(backup *api.Backup, data io.Writer) error { | ||
gzw := gzip.NewWriter(data) | ||
defer gzw.Close() | ||
func (kb *kubernetesBackupper) Backup(backup *api.Backup, data, log io.Writer) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Slightly picky, but personally seeing log makes me assume that is actually the vehicle this func should be using to log it's output. Took me a few passes to figure it out via the context. Perhaps results?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't really like "results". We could call them "outputFile" and "logFile" if that helps (even though they're io.Writers)?
} | ||
|
||
func (l *logger) log(msg string, args ...interface{}) { | ||
// TODO use a real logger that supports writing to files |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since the logger
is provided an io.Writer
, is this TODO still valid?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because we'll be switching to logrus for 0.5.0
pkg/backup/backup_test.go
Outdated
@@ -439,7 +444,8 @@ func TestBackupMethod(t *testing.T) { | |||
require.NoError(t, err) | |||
|
|||
output := new(bytes.Buffer) | |||
err = backupper.Backup(backup, output) | |||
log := new(bytes.Buffer) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More for my own education, but is this somewhere that we could have used iotuil.Discard
as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that's better. I'll switch.
@@ -240,7 +281,7 @@ func (kb *kubernetesBackupper) backupResource( | |||
} else { | |||
other = appsDeploymentsResource | |||
} | |||
glog.V(4).Infof("Skipping resource %q because it's a duplicate of %q", grString, other) | |||
ctx.log("Skipping resource %q because it's a duplicate of %q", grString, other) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are a mixture of V(2)
and V(4)
mappings here. Are we looking to just simplify logging and err towards providing more verbose logging?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We plan to revisit log levels as part of the switch to logrus (another motivating factor).
pkg/cloudprovider/backup_service.go
Outdated
metadataFileFormatString string = "%s/ark-backup.json" | ||
backupFileFormatString string = "%s/%s.tar.gz" | ||
metadataFileFormatString = "%s/ark-backup.json" | ||
backupFileFormatString = "%s/%s.tar.gz" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since both of these are actually tar.gz
should we distinguish via the base filename? %s/%s-data.tar.gz
& %s/%s-logs.tar.gz
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that. I'm also planning on putting the restore log files in here, probably restore-<restore name>-log.gz
.
pkg/cloudprovider/backup_service.go
Outdated
var errs []error | ||
for _, key := range objects { | ||
glog.V(4).Infof("Trying to delete bucket=%s, key=%s", bucket, key) | ||
fmt.Printf("Trying to delete bucket=%s, key=%s\n", bucket, key) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fmt
should be removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
pkg/cloudprovider/backup_service.go
Outdated
for _, key := range objects { | ||
glog.V(4).Infof("Trying to delete bucket=%s, key=%s", bucket, key) | ||
fmt.Printf("Trying to delete bucket=%s, key=%s\n", bucket, key) | ||
if err := br.objectStorage.DeleteObject(bucket, key); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would it be worth adding a DeleteDir
to the ObjectStorage interface? I know for AWS at least you can add the recursive
option so you don't have N+1 calls. I suppose at this point it's such a small number of nested files we may not have an issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't see a recursive option for aws - where is that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking closer, this didn't make the cut for the GO SDK version and the others are doing something similar already... this is not relevant, sorry
@@ -84,6 +92,36 @@ func (op *objectStorageAdapter) ListCommonPrefixes(bucket string, delimiter stri | |||
return ret, nil | |||
} | |||
|
|||
func (op *objectStorageAdapter) ListObjects(bucket, prefix string) ([]string, error) { | |||
res, err := op.gcs.Objects.List(bucket).Prefix(prefix).Do() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like the max results here is 1k, which is what we added the paging support to S3 to allow more than. Do we want to implement something here? Azure looks to have a default of 5k.
Is there a threshold where we just say "N is good enough"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to support paging everywhere we can
arkClient, err := f.Client() | ||
cmd.CheckError(err) | ||
|
||
err = downloadrequest.Stream(arkClient.ArkV1(), args[0], v1.DownloadTargetKindBackupLog, os.Stdout, timeout) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very useful abstraction, thanks!
return nil | ||
} | ||
|
||
const signedURLTTL = 10 * time.Minute |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One general question I have, which I can test if you aren't sure is: If the TTL expires on the actual signed URL in the middle of an in process download (slow connection, large download, etc.) will it proceed or be terminated?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know
LGTM! |
key := fmt.Sprintf(backupFileFormatString, backupName, backupName) | ||
glog.V(4).Infof("Trying to delete bucket=%s, key=%s", bucket, key) | ||
return br.objectStorage.DeleteObject(bucket, key) | ||
func (br *backupService) DeleteBackupDir(bucket, backupName string) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We want to delete metadata file last here (and only if all other deletes succeed), right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've waffled on this, trying to think about what happens in different scenarios. If we manage to delete the data tarball but fail to delete a logfile, we're left with a logfile and the metadata file. You can't restore, but you could view the logs (not super useful). I'm wondering if it's worth the effort to special case the metadata file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The thinking around deleting the metadata file last was to not orphan any Ark-created object-storage files (or snapshots). As it is now, we could theoretically fail to delete both the tarball and log file, but delete the metadata file, leaving garbage around with no refs to it. I don't think it's a huge deal because the likelihood of this scenario is pretty small IMHO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe for GC we also need to call ListCommonPrefixes and try to delete anything in object storage where the metadata file is missing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add an issue and keep moving.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #77
} | ||
defer resp.Body.Close() | ||
|
||
if resp.StatusCode != http.StatusOK { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we log the status code here as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary. Then we'd print out something like "request failed (401 Unauthorized): AccessDenied". WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I'm fine with it as-is.
@ncdc finished reviewing, the object storage file deletion order/dependency is the only material question i have. |
@@ -90,12 +91,13 @@ func NewBlockStorageAdapter(location string, apiTimeout time.Duration) (cloudpro | |||
disksClient := disk.NewDisksClient(cfg[azureSubscriptionIDKey]) | |||
snapsClient := disk.NewSnapshotsClient(cfg[azureSubscriptionIDKey]) | |||
|
|||
disksClient.Authorizer = spt | |||
snapsClient.Authorizer = spt | |||
authorizer := autorest.NewBearerAuthorizer(&spt.Token) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this should just be authorizer := autorest.NewBearerAuthorizer(spt)
-- hit this bug when testing Azure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it just this 1 line change? Any other Azure changes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tested backup with snapshots, and GC of said backup, with this change -- all looks good
Signed-off-by: Andy Goldstein <[email protected]>
Due to a change in glog, if you don't call flag.Parse, every log line prints out 'ERROR: logging before flag.Parse'. This works around that change. Ultimately we need to switch to a different logging library. Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Delete all objects in backup "dir" when deleting a backup, instead of hard-coding individual file names/types. This way, we'll be able to delete log files and anything else we add without having to update our deletion code. Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Signed-off-by: Andy Goldstein <[email protected]>
Update nmap-ncat line to clean up; necessary in some envs
Save logs per backup in object storage.
Add
ark backup logs
command for retrieval.Fixes #39
TODO
dep