-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-10387: [Rust][Parquet] Avoid call for file size metadata to read footer #8525
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
@alamb has raised doubts in #8300 about the necessity for this as the file size could be known from an other source (list operation or metadata catalog). I still want to propose this for discussion as I believe it is more flexible than the current interface. There is one catch I encountered though when implementing the S3 source: the http range header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) allows you to read from end, but not from end with a given offset. So implementing the ChunkReader as it is specified here forces you to query more bytes than required (up until the end) and stop reading the stream once you have the data you need. |
alamb
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I read the code carefully and it looks correct and like an improvement to me. 👍
| let mut first_end_cursor = Cursor::new(first_len_end_buf); | ||
| let metadata_read: Box<dyn Read>; | ||
| if footer_metadata_len > file_size as usize { | ||
| if first_end_len < DEFAULT_FOOTER_READ_SIZE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to check my understanding -- this branch is checking if the total file was smaller than DEFAULT_FOOTER_READ_SIZE (b/c it was all read in a single chunk) and the metadata size claims to be larger than this chunk
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
exactly, I should add a comment to explicit this
| use crate::column::reader::ColumnReaderImpl; | ||
|
|
||
| /// Parquet files must be read from end for the footer then from start for columns | ||
| pub enum ChunkMode { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One suggestion I have is to call this ReadFrom rather than ChunkModeto make the name more specific.
Alternatively, given its similarity, I wonder if it would make sense to use std::io::SeekFrom here https://doc.rust-lang.org/std/io/enum.SeekFrom.html rather than a custom enum
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
std::io::SeekFrom is similar but it has also the SeekFrom::Current state that we are not interested in.
About the name, I agree that ChunkMode is not ideal, but I wanted it to explicitly relate to ChunkReader. What about the more verbose but probably also more explicit ReadChunkFrom ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thinkReadChunkFrom is reasonable. Or maybe ChunkFrom ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
| } | ||
|
|
||
| impl ChunkMode { | ||
| /// FromStart offset can always be computed if you know the length |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pls update to make it a more proper doc for the method, e.g., what should the caller pass for len?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
| .get_read(chunk_reader.len() - default_end_len as u64, default_end_len)?; | ||
| let mut default_len_end_buf = vec![0; default_end_len]; | ||
| default_end_reader.read_exact(&mut default_len_end_buf)?; | ||
| let mut first_len_end_buf = vec![0; first_end_len]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: perhaps a better name first_len_end_buf is confusing.
| metadata_read = Box::new(first_end_cursor); | ||
| } else { | ||
| // the end of file read by default is not long enough, read missing bytes | ||
| let complementary_end_len = FOOTER_SIZE + metadata_len as usize - first_end_len; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: reuse footer_metadata_len
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good idea
| file_size as i64 - footer_metadata_len as i64 | ||
| "Invalid Parquet file. Metadata size exceeds file size." | ||
| )); | ||
| } else if footer_metadata_len < DEFAULT_FOOTER_READ_SIZE { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should this be <=?
| let file_size = chunk_reader.len(); | ||
| if file_size < (FOOTER_SIZE as u64) { | ||
| // read and cache up to DEFAULT_FOOTER_READ_SIZE bytes from the end and process the footer | ||
| let mut first_end_read = chunk_reader.get_read( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's weird that we are handling the "chunked read" logic in the footer reader - should this be inside the ChunkReader implementation? e.g., the reader itself should do lazy loading on the input stream based on how application decide to seek & read the data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. I copied this logic from the C++ implem but it's not great in terms of seperation of concerns. If I understand you point:
- here we should request a reader with
get_read(ChunkMode::FromEnd(FOOTER_SIZE)) - it should be up to the
ChunkReaderimplem to buffer extra bytes if it founds thatFOOTER_SIZEis too small and it is cheap for him to get more bytes (and for instance the S3 ChunkReader will get 16kB instead)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes exactly, I feel the footer reader should not be aware of how the input stream is processed and also the logic can vary depending on the remote storage so the DEFAULT_FOOTER_READ_SIZE may not fit for all.
| { | ||
| return Err(Error::new(ErrorKind::InvalidInput, "out of bound")); | ||
| /// Create a slice cursor backed by the same data as a current one. | ||
| /// If the slice length is larger than the remaining bytes in the source, the slice is clamped. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we replace clamp with something else more plain? like "shorten"? as a non-native English speaker, I had to check dictionary for the meaning of this word :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the word but you are right 😄
|
Thanks for the review @sunchao but before fixing them, I would like your point of view on the general value of this. Do you think it is really worth it adding this extra complexity to make it possible to read a file for which we ignore the length or should we stick with a simpler |
|
@rdettai I don't know enough about object storage to comment about the necessity of this change, but yeah for HDFS getting the file length is cheap. Also, seems we are still going to read from the end of the file so the length info is still required? how |
|
@sunchao with most object storage, you will select the bytes you want to read with the http Range header, which can read from end. You can use this to implement Getting the length is expensive as it adds an extra GET request, and I guess that with HDFS it also implies a network round trip which is not free. But as @alamb mentioned earlier, you will often have the length around beforehand because you get it at the same time as you list your files/objects. I would be in favor of stalling this PR until at least one other person expresses his interest for the ability to read without knowing the length of the file. Meanwhile, one of your comments on the PR made me think, and it's true that there is some buffering logic that is managed in the footer parser that should be left to the |
|
Closing because nobody showed interest in this feature |
PR for https://issues.apache.org/jira/browse/ARROW-10387