Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use LevelHistogram in PageIndex #6135

Merged
merged 2 commits into from
Jul 29, 2024
Merged

Conversation

etseidl
Copy link
Contributor

@etseidl etseidl commented Jul 26, 2024

Which issue does this PR close?

Closes #6134.

Rationale for this change

Makes all histograms outside the format module share the same interface.

What changes are included in this PR?

Changes remaining Vec<i64> histograms to LevelHistogram.

Are there any user-facing changes?

Yes, this contains API changes.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Jul 26, 2024
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @etseidl I think this looks like an improvment

I had one comment about potentially not using LevelHistogram in the ColumnIndexBuilder but I think we could address that in a follow on PR (or never)

@@ -569,7 +569,7 @@ pub struct ColumnChunkMetaData {
/// For example, `vec[0]` is the number of rows with level 0, `vec[1]` is the
/// number of rows with level 1, and so on.
///
#[derive(Debug, Clone, PartialEq)]
#[derive(Debug, Clone, PartialEq, Eq, Hash, Default)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@@ -620,6 +620,12 @@ impl LevelHistogram {
}
}

/// Appends values from the other histogram to this histogram
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found the "append" notion quite confusing which is why I didn't put it in this struct at first

Basically without append we have the invariant that histogram[i] represents the counts of level i

Using append breaks that invariant. If we are going to keep it, maybe we can add some more comments like

Suggested change
/// Appends values from the other histogram to this histogram
/// Appends values from the other histogram to this histogram
///
/// For example, given `[1,2,3]` and `[4,5]` calling append will result
/// in `[1,2,3,4,5]` (not `[5,7,3]`)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, appending to a histogram makes it no longer a histogram :P I'll revert this part.

@@ -1169,9 +1175,9 @@ pub struct ColumnIndexBuilder {
null_counts: Vec<i64>,
boundary_order: BoundaryOrder,
/// contains the concatenation of the histograms of all pages
repetition_level_histograms: Option<Vec<i64>>,
repetition_level_histograms: Option<LevelHistogram>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because ColumnIndexBuilder is a builder for the format::metadata structs (aka it is basically building up the serialized form of several LevelHistograms, and needs append) I would find it clearer to use Vec<i64> here

@@ -40,13 +41,13 @@ pub struct PageIndex<T> {
///
/// `repetition_level_histogram[i]` is a count of how many values are at repetition level `i`.
/// For example, `repetition_level_histogram[0]` indicates how many rows the page contains.
pub repetition_level_histogram: Option<Vec<i64>>,
pub repetition_level_histogram: Option<LevelHistogram>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 this is a good one -- I missed this

@etseidl etseidl changed the title Use LevelHistogram in PageIndex and ColumnIndexBuilder Use LevelHistogram in PageIndex Jul 27, 2024
@etseidl
Copy link
Contributor Author

etseidl commented Jul 27, 2024

I had one comment about potentially not using LevelHistogram in the ColumnIndexBuilder but I think we could address that in a follow on PR (or never)

Thanks @alamb, you are correct here. I was too fixated on which side of the format <--> metadata divide the vectors were on. I've reverted the changes to ColumnIndexBuilder.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again @etseidl

@alamb alamb merged commit 80ed712 into apache:master Jul 29, 2024
18 checks passed
@etseidl etseidl deleted the page_index_histograms branch August 8, 2024 17:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Use LevelHistogram throughout Parquet metadata
2 participants