-
Notifications
You must be signed in to change notification settings - Fork 20
Fix file buffer reuse when getting page readers #49
Conversation
|
@sunchao Could you review this PR? I am interested in your thoughts on the issue and if you know any other way of fixing this - my approach requires updating start position for each read, I assume it is not particularly performant. |
|
Looks like issue with compiling thrift. Tests pass locally. |
|
@sunchao Meanwhile, when you have time, could you have look at this pull request? Thanks. |
|
Yes. Will take a look at this soon. Sorry for the delay. |
|
No worries, it is all good. Whenever you have time:) |
|
Thanks @sadikovi ! This is an very interesting finding. The new |
|
Hi! I thought about it, but was not sure if it is okay to have several File::open calls, seek vs having separate handles. I am curious how parquet-cop version works, and whether or not they have the same issue. We could employ the same technique. I can fix it either way. I tend to agree that FileHandle might be a better approach with minimum amount of changes. Otherwise, we would have to open file for every column we read. Again, I will look on how parquet-cpp handles it, and report back. |
|
Parquet-cpp use input stream with start and length tracking. My understanding is that what they do is similar to what this patch is proposing. @sunchao Could you help me to figure out the code below and whether it is the same as what we are doing in this code?:) Thanks! See (file_reader -> properties -> input stream): The only concern I have is whether or not this approach of resetting start position has any performance implications. If not - then it is okay. |
|
Thanks @sadikovi . Yes, I think our approach is very similar to what Parquet-cpp/Arrow does. One thing we should be careful is thread safety between multiple shared |
|
Yes, you are right. I will try updating the code similar to parquet-cpp/arrow for concurrent access. |
|
Will get back to it ASAP - needed for read API! |
|
@sunchao Could you review this PR again? I renamed the struct into I also added couple of tests, but I had to copy |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @sadikovi . This looks good - just a few minor comments.
src/util/io.rs
Outdated
| fn read(&mut self, buf: &mut [u8]) -> Result<usize> { | ||
| // We unwrap() the return value to assert that we are not expecting failures while | ||
| // holding the lock. | ||
| let mut reader = self.reader.lock().unwrap(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think this error is still possible to occur - should we convert to a IO Result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, you are right. We should turn it into Result.
src/file/reader.rs
Outdated
| } | ||
|
|
||
| #[test] | ||
| fn test_reuse_file_handle() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we rename this to test_reuse_file_chunk?
src/file/reader.rs
Outdated
|
|
||
| #[test] | ||
| fn test_reuse_file_handle() { | ||
| // This test covers case when initialising column readers and buffer them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: initialising -> initializing
|
@sunchao I addressed your comments. Can you have a look again when you have time? Thanks. |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
|
I've merged the change. Thanks @sadikovi ! |
|
@sunchao Thanks a lot for merging! |
This PR fixes an interesting issue with file handle reuse, and results in not being able to buffer values from column readers, unless reading entire column in a row group at a time.
Problem description
Discovered this problem when building a batch of column vectors. The basic idea is that when initialising column readers (page readers), we clone file handles for each page reader (line 212 in
src/file/reader.rs), so they can be traversed independently from each other.Unfortunately,
try_clonehas an interesting side-effect: The returned File is a reference to the same state that this object references. Both handles will read and write with the same cursor position (https://doc.rust-lang.org/std/fs/struct.File.html#method.try_clone). So when updating one, we update the other. More info here: issue 46578 (https://github.com/rust-lang/rust/issues).This bug is observed when reading more than one column. I added a simple test to reproduce the problem (fails on master, passes with this patch). The position will be overwritten and all column readers will be using handle of the last one essentially. This results in different weird bugs, one of them is
underlying Thrift error: end of file.Fix
It looks like standard library does not have any handles that maintain positions separately, unless you want to open file again, which I assume we do not want to. I also considered buffering each column into memory (essentially the whole row group) - copying bytes from a file (col_start, col_length) into some vector of bytes. This also does not seem like a way to go.
So I added a simple struct that takes file handle from
file.try_clone()and maintains start and end of the chunk of the file and updates them as we read it. Every time when we readFileChunk, we update file position on the current file handle position - also addedMutexfor concurrent access.This seems to work and passes unit tests. I also tested manually with different files - all works correctly.