Skip to content

Conversation

@rdettai
Copy link
Contributor

@rdettai rdettai commented Oct 26, 2020

PR for https://issues.apache.org/jira/browse/ARROW-10387

When reading a Parquet file, you first need to read its footer and metadata. If you only have "read from start" capability, this means you need the size of the file to read relatively to end. On some storages, getting the size metadata can be expensive (e.g extra http call for blob storage).
The proposition is to add the capability to "read from end" to the ChunkReader trait as most file storages will have this feature (file storage as well as blob storages).

@rdettai
Copy link
Contributor Author

rdettai commented Oct 26, 2020

@alamb has raised doubts in #8300 about the necessity for this as the file size could be known from an other source (list operation or metadata catalog). I still want to propose this for discussion as I believe it is more flexible than the current interface.

There is one catch I encountered though when implementing the S3 source: the http range header (https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Range) allows you to read from end, but not from end with a given offset. So implementing the ChunkReader as it is specified here forces you to query more bytes than required (up until the end) and stop reading the stream once you have the data you need.

@github-actions
Copy link

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I read the code carefully and it looks correct and like an improvement to me. 👍

let mut first_end_cursor = Cursor::new(first_len_end_buf);
let metadata_read: Box<dyn Read>;
if footer_metadata_len > file_size as usize {
if first_end_len < DEFAULT_FOOTER_READ_SIZE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to check my understanding -- this branch is checking if the total file was smaller than DEFAULT_FOOTER_READ_SIZE (b/c it was all read in a single chunk) and the metadata size claims to be larger than this chunk

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, I should add a comment to explicit this

use crate::column::reader::ColumnReaderImpl;

/// Parquet files must be read from end for the footer then from start for columns
pub enum ChunkMode {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One suggestion I have is to call this ReadFrom rather than ChunkModeto make the name more specific.

Alternatively, given its similarity, I wonder if it would make sense to use std::io::SeekFrom here https://doc.rust-lang.org/std/io/enum.SeekFrom.html rather than a custom enum

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

std::io::SeekFrom is similar but it has also the SeekFrom::Current state that we are not interested in.
About the name, I agree that ChunkMode is not ideal, but I wanted it to explicitly relate to ChunkReader. What about the more verbose but probably also more explicit ReadChunkFrom ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thinkReadChunkFrom is reasonable. Or maybe ChunkFrom ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

@andygrove andygrove requested a review from sunchao October 29, 2020 17:13
}

impl ChunkMode {
/// FromStart offset can always be computed if you know the length
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls update to make it a more proper doc for the method, e.g., what should the caller pass for len?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

.get_read(chunk_reader.len() - default_end_len as u64, default_end_len)?;
let mut default_len_end_buf = vec![0; default_end_len];
default_end_reader.read_exact(&mut default_len_end_buf)?;
let mut first_len_end_buf = vec![0; first_end_len];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: perhaps a better name first_len_end_buf is confusing.

metadata_read = Box::new(first_end_cursor);
} else {
// the end of file read by default is not long enough, read missing bytes
let complementary_end_len = FOOTER_SIZE + metadata_len as usize - first_end_len;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: reuse footer_metadata_len

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good idea

file_size as i64 - footer_metadata_len as i64
"Invalid Parquet file. Metadata size exceeds file size."
));
} else if footer_metadata_len < DEFAULT_FOOTER_READ_SIZE {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be <=?

let file_size = chunk_reader.len();
if file_size < (FOOTER_SIZE as u64) {
// read and cache up to DEFAULT_FOOTER_READ_SIZE bytes from the end and process the footer
let mut first_end_read = chunk_reader.get_read(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's weird that we are handling the "chunked read" logic in the footer reader - should this be inside the ChunkReader implementation? e.g., the reader itself should do lazy loading on the input stream based on how application decide to seek & read the data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a good point. I copied this logic from the C++ implem but it's not great in terms of seperation of concerns. If I understand you point:

  • here we should request a reader with get_read(ChunkMode::FromEnd(FOOTER_SIZE))
  • it should be up to the ChunkReader implem to buffer extra bytes if it founds that FOOTER_SIZE is too small and it is cheap for him to get more bytes (and for instance the S3 ChunkReader will get 16kB instead)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes exactly, I feel the footer reader should not be aware of how the input stream is processed and also the logic can vary depending on the remote storage so the DEFAULT_FOOTER_READ_SIZE may not fit for all.

{
return Err(Error::new(ErrorKind::InvalidInput, "out of bound"));
/// Create a slice cursor backed by the same data as a current one.
/// If the slice length is larger than the remaining bytes in the source, the slice is clamped.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we replace clamp with something else more plain? like "shorten"? as a non-native English speaker, I had to check dictionary for the meaning of this word :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the word but you are right 😄

@rdettai
Copy link
Contributor Author

rdettai commented Nov 1, 2020

Thanks for the review @sunchao but before fixing them, I would like your point of view on the general value of this. Do you think it is really worth it adding this extra complexity to make it possible to read a file for which we ignore the length or should we stick with a simpler get_read(&self, start: u64, length: usize) ? I think that this interface is very important, so I really don't want to make it complicated for the sake of it 😄

@sunchao
Copy link
Member

sunchao commented Nov 2, 2020

@rdettai I don't know enough about object storage to comment about the necessity of this change, but yeah for HDFS getting the file length is cheap. Also, seems we are still going to read from the end of the file so the length info is still required? how ChunkMode is going to be used without from_start?

@rdettai
Copy link
Contributor Author

rdettai commented Nov 3, 2020

@sunchao with most object storage, you will select the bytes you want to read with the http Range header, which can read from end. You can use this to implement ChunkMode::FromEnd without knowing the length of the file.

Getting the length is expensive as it adds an extra GET request, and I guess that with HDFS it also implies a network round trip which is not free. But as @alamb mentioned earlier, you will often have the length around beforehand because you get it at the same time as you list your files/objects.

I would be in favor of stalling this PR until at least one other person expresses his interest for the ability to read without knowing the length of the file.

Meanwhile, one of your comments on the PR made me think, and it's true that there is some buffering logic that is managed in the footer parser that should be left to the ChunkReader implementation. I'll try to see if I can find an interface that better separates concerns.

@nevi-me nevi-me added the needs-rebase A PR that needs to be rebased by the author label Nov 7, 2020
@rdettai
Copy link
Contributor Author

rdettai commented Dec 2, 2020

Closing because nobody showed interest in this feature

@rdettai rdettai closed this Dec 2, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Component: Parquet Component: Rust needs-rebase A PR that needs to be rebased by the author

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants