Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pga: safe downloads #69

Merged
merged 3 commits into from
Jul 20, 2018
Merged

pga: safe downloads #69

merged 3 commits into from
Jul 20, 2018

Conversation

smola
Copy link
Contributor

@smola smola commented Jun 15, 2018

  • pga: download files to temporary file first, fixes Make PGA downloader safe to cancel #39
  • make get/list cancellable: Control+C will make pga to shutdown gracefully and clean up after itself, leaving no traces of temporary downloads

@smola smola requested a review from campoy June 15, 2018 09:57
@smola
Copy link
Contributor Author

smola commented Jun 15, 2018

Note that I have not tested this with HDFS yet.
@bzz If you happen to use pga for a new download, could you test from this branch?

@bzz
Copy link
Contributor

bzz commented Jun 15, 2018

Sounds awesome, going to start full download using this branch's pga get to the same cluster and then let's compare results after a single run and report the results here on Mon 18.06.2018

@bzz
Copy link
Contributor

bzz commented Jun 18, 2018

High-level overview

hdfs dfs -du -h hdfs://hdfs-namenode/pga2
2.5 T  hdfs://hdfs-namenode/pga2/siva

root@hadoop-alex:/# hdfs dfs -du -h hdfs://hdfs-namenode/pga
2.4 T  hdfs://hdfs-namenode/pga/siva

😕 and seems like a pod has been re-started so there are no logs ....

@smola number of files seems to be different, when using this branch

$ hdfs dfs -ls -R hdfs://hdfs-namenode/pga2 | grep -c "\.siva$"
239733

$ hdfs dfs -ls -R hdfs://hdfs-namenode/pga/ | grep -c "\.siva$"
239807

but number of .siva file in index is

zgrep -o "[0-9a-z]*\.siva" ~/.pga/latest.csv.gz | sort | uniq | wc -l
239807

@bzz
Copy link
Contributor

bzz commented Jun 18, 2018

So, the second run using this branch resulted in:

  • overall bigger results size in FS
  • 74 .siva files missing and were not downloaded

Copy link
Contributor

@campoy campoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm concerned with the way you're handling the errors from the deferred functions.

Seems a bit too "smart" for me.

@@ -97,3 +143,9 @@ func getIndex() (io.ReadCloser, error) {
}
return gzip.NewReader(f)
}

func checkClose(name string, c io.Closer, err *error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure about this piece of code, tbh.
This is changing the value of the error in an almost magical way.

I understand why you're doing this, but I'd rather check have slightly more verbose code than this subtle interaction.

}

func cancelableCopy(ctx context.Context, dst io.Writer, src io.Reader) (
int64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

put all in a line

default:
}

w, err := io.CopyN(dst, src, 512*1024)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why 512*1024? should this be a named constant?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that was an arbitrary number for testing. I'll change it.

@smola
Copy link
Contributor Author

smola commented Jun 19, 2018

@campoy Done.

#69 (comment)

The CheckClose idiom is something we use frequently in our codebase to ensure we always handle close errors properly. Handling it explicitly can be quite cumbersome and error prone. Anyway, for this case, since there's not so many calls, I've changed to explicit handling.

const copyBufferSize = 512 * 1024

func cancelableCopy(ctx context.Context, dst io.Writer, src io.Reader) (int64, error) {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop blank line


const copyBufferSize = 512 * 1024

func cancelableCopy(ctx context.Context, dst io.Writer, src io.Reader) (int64, error) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm almost tempted to propose this function to Go's stdlib.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be great, but I doubt it gets accepted, since nothing in io or io/ioutil is cancellable. Also, there are prominent voices against that usage: Context isn’t for cancellation.

Copy link
Contributor

@campoy campoy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small detail, but otherwise LGTM

if err := wc.Close(); err != nil {
return fmt.Errorf("could not close %s: %v", dest.Abs(name), err)

if err := rc.Close(); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say that failing to close a reader is not important, so it's ok to simply do defer rc.Close(),
but I'll let you decide on this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That depends on the reader implementation. Some of them signal unexpected EOF on Close. You could argue that it is their fault to not use UnexpectedEOF or some other errors on Read though...

Control+C will make pga to shutdown gracefully and clean up after itself, leaving no traces of temporary downloads.

Signed-off-by: Santiago M. Mola <[email protected]>
Signed-off-by: Santiago M. Mola <[email protected]>
@smola
Copy link
Contributor Author

smola commented Jul 20, 2018

Merging, since we already verified that with this PR we avoid corrupt files.

@smola smola merged commit ce68993 into src-d:master Jul 20, 2018
@smola smola deleted the safe-downloads branch July 20, 2018 09:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make PGA downloader safe to cancel
3 participants