Skip to content

Conversation

@kou
Copy link
Member

@kou kou commented Aug 7, 2024

Rationale for this change

We don't need "unknown" state. If they aren't set, we can process they are not exact.

What changes are included in this PR?

Remove std::optional from arrow::ArrayStatistics::is_{min,max}_exact.

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@github-actions
Copy link

github-actions bot commented Aug 7, 2024

⚠️ GitHub issue #43594 has been automatically assigned in GitHub to PR creator.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me, but IMO is_min_exact = false might still has exact statistics but the reader cannot gurantee that, since apache/parquet-format#216 is new in 2.10 :-(

Copy link
Member Author

@kou kou Aug 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the followings correct?

  1. Parquet 2.9 or earlier data don't have is_min_value_exact/is_max_value_exact
  2. Parquet 2.9 or earlier data use only exact min/max
  3. Parquet 2.10 or later data use exact min/max or non-exact min/max
  4. Parquet 2.10 or later data may use exact min/max without is_min_value_exact/is_max_value_exact

You're focusing on the 2. case, right? Can our Parquet reader detect Parquet version? If so, can we always set true to is_min_exact/is_max_exact for Parquet 2.9 or earlier?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for late replying

Parquet 2.9 or earlier data don't have is_min_value_exact/is_max_value_exact

Yes

Parquet 2.9 or earlier data use only exact min/max

I guess no, I'll send a mail to maillist to make it sure

Parquet 2.10 or later data use exact min/max or non-exact min/max

Yes

Parquet 2.10 or later data may use exact min/max without is_min_value_exact/is_max_value_exact

right

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If so, can we always set true to is_min_exact/is_max_exact for Parquet 2.9 or earlier?

Hmmm I'll try to make it clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Copy link
Member

@mapleFU mapleFU Aug 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anyway we can mention "exact=false" can also means is exact, lol

Or we can denote that the parquet-c++ output is exact.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let's use exact=true for Apache Parquet C++ output.

…s::is_{min,max}_exact

We don't need "unknown" state. If they aren't set, we can process they
are not exact.
@kou kou force-pushed the cpp-array-statistics-exact branch from abcfb72 to 8331bf5 Compare August 10, 2024 05:53
@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting committer review Awaiting committer review labels Aug 10, 2024
@kou kou merged commit b80a51a into apache:main Aug 16, 2024
@kou kou removed the awaiting changes Awaiting changes label Aug 16, 2024
@kou kou deleted the cpp-array-statistics-exact branch August 16, 2024 05:23
@github-actions github-actions bot added the awaiting changes Awaiting changes label Aug 16, 2024
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 7 benchmarking runs that have been run so far on merge-commit b80a51a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 1 possible false positive for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants