Skip to content

Conversation

@rdettai
Copy link
Contributor

@rdettai rdettai commented Apr 15, 2020

A fix to 7681 that does not use nightly (as oposed to #6280).

@github-actions
Copy link

@rdettai
Copy link
Contributor Author

rdettai commented Apr 16, 2020

I would like to add a test on a parquet file with many lines because my first commit passed all tests but was buggy... Where should I add this file ?

@paddyhoran
Copy link
Contributor

Where should I add this file ?

The short answer is here but you should mention it on the mailing list first. There is reluctance to adding binary data I believe. The preference is to make progress on the writer implementation so that any test files can be generated. There has been some initial progress on this here.

If there is reluctance to add the file you can comment out the test and open up a JIRA to follow up once the writer is implemented I guess.

@paddyhoran
Copy link
Contributor

paddyhoran commented Apr 16, 2020

Changes look reasonable to me BTW, but they need to be reviewed by @sunchao or @sadikovi in general as they are the experts on Parquet.

@rdettai
Copy link
Contributor Author

rdettai commented Apr 16, 2020

In the end I did not go for the end to end test with a Parquet file as this is kind of a more general concern. I added a "byte level" test though to maintain good coverage!

Hope the MacOS CI will not go nuts this time ;-)

Copy link
Member

@sunchao sunchao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry just noticed this! LGTM with a few nits

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not super useful as it requires ppl to jump into BufReader to check the documentation - maybe add additional comments on what these two fields are for?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe we can extract this into a util function?

@rdettai
Copy link
Contributor Author

rdettai commented Apr 23, 2020

I'll work on your comments today.

What about the problem we are trying to fix here? Do you agree with the benefits of this fix ?

Also, I'm not sure why a Mutex was used here. Do we agree that Refcell is better as the whole reader is not thread safe, right ?

@sunchao
Copy link
Member

sunchao commented Apr 24, 2020

Yes I think it is beneficial to avoid dropping buffers with seek, although it will be nice if the seek_relative will be stabilized soon so we can just use that.

Also, I'm not sure why a Mutex was used here. Do we agree that Refcell is better as the whole reader is not thread safe, right ?

Originally we designed it this way so that we can concurrently read multiple column chunks after obtaining file handle from a single row group. Since the file handle is shared between these we wanted to provide thread safety on top of it.

@rdettai
Copy link
Contributor Author

rdettai commented Apr 24, 2020

Originally we designed it this way so that we can concurrently read multiple column chunks after obtaining file handle from a single row group. Since the file handle is shared between these we wanted to provide thread safety on top of it.

I'm not sure to understand how this could be the concern of the FileSource. It's the reader (file handle) that is passed to it that should be thread safe, and FileSource should not create the Mutex himself.

@sunchao
Copy link
Member

sunchao commented Apr 24, 2020

It's the reader (file handle) that is passed to it that should be thread safe

Is file thread-safe? it's not obvious when reading the doc. Plus, here type parameterR can be anything that implements ParquetReader (with read, seek, length and try_clone capabilities) so you cannot assume all of them guarantees thread-safety.

@rdettai
Copy link
Contributor Author

rdettai commented Apr 27, 2020

I completely agree ! What I am saying is that the layer that makes the handle thread safe cannot be the FileSource which is in charge of tracking the reading position of one specific reading thread.

But this question of multi-threading seems to be a whole other concern :-) I've made the nit changes you've proposed and improved the tests.

@rdettai rdettai requested a review from sunchao April 27, 2020 08:54
@sunchao sunchao closed this in d094631 Apr 27, 2020
@sunchao
Copy link
Member

sunchao commented Apr 27, 2020

Merged. Thanks @rdettai !

@jorisvandenbossche
Copy link
Member

It seems that linting for rust was failing here? (and now on master)

@nevi-me
Copy link
Contributor

nevi-me commented Apr 28, 2020

It seems that linting for rust was failing here? (and now on master)

I'll fix it shortly @jorisvandenbossche

@sunchao
Copy link
Member

sunchao commented Apr 28, 2020

Oops. Sorry my bad. Thanks @nevi-me !

@rdettai
Copy link
Contributor Author

rdettai commented Apr 30, 2020

My bad, I thought it was an issue with the linting CI !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants