-
Notifications
You must be signed in to change notification settings - Fork 4k
ARROW-9280: [Rust] [Parquet] Calculate page and column statistics #7586
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
zeevm
commented
Jun 30, 2020
- Calculate page and column statistics
- Use pre-calculated statistics when available to speed-up when writing data from other formats like ORC.
|
Thanks for opening a pull request! Could you open an issue for this pull request on JIRA? Then could you also rename pull request title in the following format? See also: |
|
Hi @zeevm, may you please kindly rebase (to fix the Rust failures) and open a JIRA for this PR |
Use pre-calculated statistics when available
f6f96e6 to
45293d6
Compare
|
EDIT: Looks like the failures are from parquet. I had seen 4 failures when skimming through, and assumed it was the ones that I fixed, but this is not the case. I think the failures are related to your changes Thanks @zeevm, the remaining failure can be fixed by running |
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zeevm - left some comments.
| values: &[T::T], | ||
| def_levels: Option<&[i16]>, | ||
| rep_levels: Option<&[i16]>, | ||
| min: &Option<T::T>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that among these 4 parameters, ppl only provide, say, nulls_count but leave the rest as None? will this result to partial stats and yield to issues when compute engines want to rely on them? If so do we want to enforce that either all of these 4 are None OR all of these are Some?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC the format specifies that the various stats are optional so it seems reasonable to allow the caller to specify only some of the values isn't it?
| min_page_value: None, | ||
| max_page_value: None, | ||
| num_page_nulls: 0, | ||
| page_distinct_count: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems this is not used at all?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are used to track page level stats and write those stats when writing a page and to update the column level stats when writing the page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. But it is not updated nor used at all. Can you double check?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you specifically referring to 'page_distinct_count', or all 4 page level vars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All of them are used, page_distinct_count isn't being calculated in this PR though, probably a following PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They're used here:
flush_data_pages()
make_typed_statistics()
update_page_min_max()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I meant page_distinct_count. It is fine to do this in a follow upo.
…y spec https://issues.apache.org/jira/browse/ARROW-7285 Closes apache#7544 from liyafan82/fly_0619_ipc Lead-authored-by: liyafan82 <[email protected]> Co-authored-by: Antoine Pitrou <[email protected]> Signed-off-by: Antoine Pitrou <[email protected]>
…7347) This commit moves all Netty specific calls into a few classes. This is the precursor to splitting the netty and unsafe allocators out to their own modules
sunchao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
| min_page_value: None, | ||
| max_page_value: None, | ||
| num_page_nulls: 0, | ||
| page_distinct_count: None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes I meant page_distinct_count. It is fine to do this in a follow upo.
Use pre-calculated statistics when available
…rrow into write_parquet_statistics
|
@zeevm once approved, a committer will help merge this. Seems the PR now is a little messed up, can you clean it up so I can merge it? |