Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Write null counts in parquet files when they are present #6257

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 15, 2024

Which issue does this PR close?

Closes #6256

Note this PR contains some of the code that was originally in @Michael-J-Ward 's PR #6256

Rationale for this change

See #6256.

Current behavior:

  • parquet-rs writer always has the null count when writing statistics, but writes None to thrift when the null count is zero
  • parquet-rs reader treats a missing null count (None) as Some(0) (aka that it is known there are no nulls)
  • parquet-rs will write negative numbers if the null count or distinct count is greater than what fits in i64 (e.g. u64::MAX) -- this is likely a theoretical concern only

This is inconsistent with the parquet spec as well as what parquet-java and parquet-cpp do

What changes are included in this PR?

  • Update parquet reader/writer to follow the spec
  • Add error checking for values that are too large to fit into i64
  • documented that older versions of parquet-rs wrote None.
  • added tests

Are there any user-facing changes?

Yes

Changes

  • parquet-rs writer always writes Some(..) to thrift
  • parquet-rs reader returns None (aka that it is unknown if there are nulls) if there are no null counts in the thrift
  • parquet-rs writer writes None if the null count / distinct count is too large to fit in i64
  • documented that older versions of parquet-rs wrote none.
  • Changed the StatisticsConverter code to read statistics consistently with older versions of parquet-rs (treat missing null counts as known zero) and added a flag to alter the behavior

This change means the generated parquet files are slightly larger (as now they encode Some(0) for null counts) but the behavior is more correct and consistent.

@github-actions github-actions bot added the parquet Changes to the parquet crate label Aug 15, 2024
@@ -189,7 +189,7 @@ fn test_primitive() {
pages: (0..8)
.map(|_| Page {
rows: 250,
page_header_size: 36,
page_header_size: 38,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The page headers (and files) are larger now because Some(0) takes more space than None


#[test]
fn test_count_encoding() {
statistics_count_test(None, None);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test fails like this without the changes in this PR:

assertion `left == right` failed
  left: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: Some(0), ...
 right: Boolean({min: Some(true), max: Some(false), distinct_count: None, null_count: None, ...

@@ -1195,6 +1197,23 @@ impl<'a> StatisticsConverter<'a> {
self.arrow_field
}

/// Set the statistics converter to treat missing null counts as missing
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By default reading null counts will work with files written with older versions of parquet-rs

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For breaking / major release, is it acceptable to include an upgrade instruction of "add this config to maintain to old behavior"?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely

It is important to note that the default behavior in this PR is the old behavior (in other words there should be changes needed in downstream consumers of this code)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in this PR is missing_null_counts_as_zero = true, which maintains the old behavior, right?

If "add this config to maintain old behavior" is acceptable for a breaking release, then I would expect the default to be the new behavior.

IOW, I'd expect what you said on the parquet mailing list

Applications that use parquet-rs to read parquet_files and interpret the
null_count will need to be changed after the upgrade to explicitly continue
the old behavior of "treat no null_count as 0" which is also documented
now.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The default in this PR is missing_null_counts_as_zero = true, which maintains the old behavior, right?

Yes

then I would expect the default to be the new behavior.

My thinking was

Since there are two different apis:

Statistics::null_count would now return Option<..> so users of the library will ned to update their code anyways and thus can choose at that time which behavior they wanted

StatisticsConverter's API didn't change and thus it keeps the previous behavior. This is what I would persoanlly want for a system -- no change for reading parquet files that were written with previous versions of the rust writer.

// Number of nulls recorded, when it is not available, we just mark it as 0.
// TODO this should be `None` if there is no information about NULLS.
// see https://github.com/apache/arrow-rs/pull/6216/files
let null_count = stats.null_count.unwrap_or(0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the removal of this unwrap_or is what changes the semantics while reading

let null_count = stats
.null_count_opt()
.map(|value| value as i64)
.filter(|&x| x > 0);
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of this filter is what fixes the statistics while writing

Maybe it was intended to be x >= 0 originally 🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was intrigued, so went to do some code archaeology, but that filter was only introduced a few days ago in #6216, by you! 😄

For 6 years before it was:

null_count: if stats.has_nulls() {
    Some(stats.null_count() as 
} else {
    None
},

Your commit:
7d4e650

Prior code:
https://github.com/apache/arrow-rs/blame/25bfccca58ff219d9f59ba9f4d75550493238a4f/parquet/src/file/statistics.rs#L228-L242

@alamb alamb marked this pull request as ready for review August 15, 2024 18:40
@alamb
Copy link
Contributor Author

alamb commented Aug 20, 2024

@etseidl I wonder if you have any thoughts on this code / the writing of null counts

@etseidl
Copy link
Contributor

etseidl commented Aug 20, 2024

@etseidl I wonder if you have any thoughts on this code / the writing of null counts

Sorry, I was away when this discussion started (interesting things always happen when I'm on vacation 😉). I think this PR is heading the right direction. Writing Some(0) when the null count is known is the right behavior IMO. On the read side I've always treated missing null counts as unknown rather than 0, so the changes here are welcome. I think it's fine for the read side to continue with the old behavior for some time.

@alamb
Copy link
Contributor Author

alamb commented Aug 28, 2024

Unless someone else has time to review this PR and thinks it should go into 53.0.0 (#6016) my personal preference would be to merge this PR early in the next major release cycle (e.g. 54.0.0) so it gets maximum bake / testing time before release

@etseidl
Copy link
Contributor

etseidl commented Aug 28, 2024

Unless someone else has time to review this PR and thinks it should go into 53.0.0 (#6016) my personal preference would be to merge this PR early in the next major release cycle (e.g. 54.0.0) so it gets maximum bake / testing time before release

I've looked this PR over several times and haven't found any issues with it. The only thing giving me pause is this apache/parquet-format#449 (comment)

It seems java and cpp ignore the definition levels and write Some(0) regardless, so I'm fine with merging this and worrying about micro optimizations down the road (in step with the rest of the parquet community).

@alamb
Copy link
Contributor Author

alamb commented Aug 31, 2024

Thank you @etseidl -- I think

  1. given the potential for this PR to cause unintended consequences
  2. we haven't acutally had any bug reports related to this issue,
  3. the 53.0.0 release is imminent and has several people waiting on features

I am not going to merge this PR until after we have released 53.0.0

@alamb
Copy link
Contributor Author

alamb commented Sep 18, 2024

I am depressed about the large review backlog in this crate. We are looking for more help from the community reviewing PRs -- see #6418 for more

Copy link
Contributor

@etseidl etseidl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Took another look and found no nits. I'd say ship it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
parquet Changes to the parquet crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

parquet writer does not encode null count = 0 correctly
3 participants