-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use LevelHistogram
in PageIndex
#6135
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @etseidl I think this looks like an improvment
I had one comment about potentially not using LevelHistogram
in the ColumnIndexBuilder
but I think we could address that in a follow on PR (or never)
@@ -569,7 +569,7 @@ pub struct ColumnChunkMetaData { | |||
/// For example, `vec[0]` is the number of rows with level 0, `vec[1]` is the | |||
/// number of rows with level 1, and so on. | |||
/// | |||
#[derive(Debug, Clone, PartialEq)] | |||
#[derive(Debug, Clone, PartialEq, Eq, Hash, Default)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
parquet/src/file/metadata/mod.rs
Outdated
@@ -620,6 +620,12 @@ impl LevelHistogram { | |||
} | |||
} | |||
|
|||
/// Appends values from the other histogram to this histogram |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found the "append" notion quite confusing which is why I didn't put it in this struct at first
Basically without append we have the invariant that histogram[i]
represents the counts of level i
Using append breaks that invariant. If we are going to keep it, maybe we can add some more comments like
/// Appends values from the other histogram to this histogram | |
/// Appends values from the other histogram to this histogram | |
/// | |
/// For example, given `[1,2,3]` and `[4,5]` calling append will result | |
/// in `[1,2,3,4,5]` (not `[5,7,3]`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, appending to a histogram makes it no longer a histogram :P I'll revert this part.
parquet/src/file/metadata/mod.rs
Outdated
@@ -1169,9 +1175,9 @@ pub struct ColumnIndexBuilder { | |||
null_counts: Vec<i64>, | |||
boundary_order: BoundaryOrder, | |||
/// contains the concatenation of the histograms of all pages | |||
repetition_level_histograms: Option<Vec<i64>>, | |||
repetition_level_histograms: Option<LevelHistogram>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because ColumnIndexBuilder
is a builder for the format::metadata
structs (aka it is basically building up the serialized form of several LevelHistograms
, and needs append) I would find it clearer to use Vec<i64>
here
@@ -40,13 +41,13 @@ pub struct PageIndex<T> { | |||
/// | |||
/// `repetition_level_histogram[i]` is a count of how many values are at repetition level `i`. | |||
/// For example, `repetition_level_histogram[0]` indicates how many rows the page contains. | |||
pub repetition_level_histogram: Option<Vec<i64>>, | |||
pub repetition_level_histogram: Option<LevelHistogram>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 this is a good one -- I missed this
LevelHistogram
in PageIndex
and ColumnIndexBuilder
LevelHistogram
in PageIndex
Thanks @alamb, you are correct here. I was too fixated on which side of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks again @etseidl
Which issue does this PR close?
Closes #6134.
Rationale for this change
Makes all histograms outside the
format
module share the same interface.What changes are included in this PR?
Changes remaining
Vec<i64>
histograms toLevelHistogram
.Are there any user-facing changes?
Yes, this contains API changes.