Add parquet `StatisticsConverter` for arrow reader #6046

efredine · 2024-07-11T22:06:19Z

Which issue does this PR close?

Closes #4328.

Rationale for this change

Ports StatisticsConverter implementation and tests from Data Fusion.

What changes are included in this PR?

The StatisticsConverter and all tests. It is functionally unchanged from the DataFusion implementation.

Changes:

removed all log::debug statements as it seemed to me these aren't used in arrow crate?
converted all errors to use the arrow_err macro.

For the tests I only moved over the the code actually used by the statistics tests.

Are there any user-facing changes?

Yes, exposes the StatisticsConverter.

efredine · 2024-07-11T22:39:38Z

There is a pending PR in DataFusion that will need to be ported: https://github.com/apache/datafusion/pull/11289/files#diff-7110f4709c105a18ef74a212396444d62052179a735d148fb62470a8b157fb40

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

alamb · 2024-07-13T12:26:46Z

Amazing @efredine -- thank you. I am working through this but may not finish until tomorrow

alamb · 2024-07-13T12:28:26Z

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )

My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

Lordworms · 2024-07-13T15:55:36Z

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )

My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

I agree, should I port the struct related function now?

efredine · 2024-07-13T18:37:40Z

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )
My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

I agree, should I port the struct related function now?

@Lordworms I don't think it should be part of this PR which is already huge. I think the easiest thing to do is to wait until this PR is merged (which should happen soon) then open a new PR in this repository with the struct changes.

efredine · 2024-07-13T18:40:04Z

parquet/src/arrow/arrow_reader/statistics.rs

+/// underlying statistics value (stored as a parquet value) into the
+/// corresponding Arrow  value. For example, Decimals are stored as binary in
+/// parquet files.
+///


As part of the port, I changed the visibility of parquet_column from pub(crate) to pub because the pub(crate) caused a failure in documentation tests. But I'm not sure this was the right way to resolve that issue.

I think having parquet_column pub is a good change and it will be useful for others.

However, since it is more applicable than just statistics, think it should be moved to the main arrow.rs (I will do so shortly)

efredine · 2024-07-13T18:43:18Z

parquet/src/arrow/arrow_reader/statistics.rs

+        // in the parquet schema
+        return None;
+    }
+


How important is addressing the efficiency consideration here? For a table with many columns it would be a lot of linear searches.

I think we should file a follow on ticket to improve the situation. I think we have something functional and then we can always make it better as a follow on

Lordworms · 2024-07-13T21:12:26Z

We could hold off merging this until that work is complete and I'll update this PR or I can do it as a separate PR.

I recommend we merge this PR, and then port/fix up the struct array statistics directly in arrow-rs apache/datafusion#11289 (cc @Lordworms )
My rationale is that we are more likely to find some struct array expertise in the arrow-rs repo than the datafusion repo.

I agree, should I port the struct related function now?

@Lordworms I don't think it should be part of this PR which is already huge. I think the easiest thing to do is to wait until this PR is merged (which should happen soon) then open a new PR in this repository with the struct changes.

Got it, I'll wait for it to be merged, thanks for your work.

…ics-converter

alamb

First of all, thank you so much @efredine -- this is epic (even though I know most of it was just moving code). It was quite easy to read and I found nothing in need of changes.

I know @tustvold has concerns about ignoring / not handling the ColumnOrder statistics correctly: apache/datafusion#10586 Once we sort out what the practical implementations (which is not at all clear to me know) I will make a PR to update the documentation

The one final thing I want to do before merging this PR is to make a draft PR in DataFusion to use it and verify that everything works. Doing so now

alamb · 2024-07-15T20:13:44Z

parquet/src/arrow/arrow_reader/statistics.rs

+        // in the parquet schema
+        return None;
+    }
+


I think we should file a follow on ticket to improve the situation. I think we have something functional and then we can always make it better as a follow on

alamb · 2024-07-15T20:22:52Z

I merged this branch up from master and moved where parquet_column went.

I plan to draft a "use upstream arrow version" PR in DataFusion, file a follow on PR for improving the performance of parquet_column and then I think this PR will be good to go

alamb

I also found that the benchmarks in https://github.com/apache/datafusion/blob/main/datafusion/core/benches/parquet_statistic.rs were not ported. I will do so now

alamb · 2024-07-15T20:34:41Z

FYI @marvinlanhenke

alamb · 2024-07-15T21:08:29Z

Oh darn - yes - I missed those. If you run out of time I’m happy to port them.

No worries, I already pushed it to your branch (i had the code checked out anyways)

alamb · 2024-07-15T21:08:58Z

Updates

PR to update to use this code Use upstream StatisticsConverter from arrow-rs in DataFusion datafusion#11479 is looking good
Doing that made me realize we haven't hooked up the DataPage statistics extraction yet, so I will make a PR now Update the parquet code prune_pages_in_one_row_group to use the StatisticsExtractor datafusion#11480 (mostly to ensure that the API for extracting them is sufficient)

alamb · 2024-07-16T14:01:58Z

Ok, I pushed some commits to this branch:

5c8c1ba: adds API I found I needed in Update parquet page pruning code to use the StatisticsExtractor datafusion#11483
f993b08: overly obsessive doc editing

alamb · 2024-07-16T14:02:20Z

I think this PR is ready to go -- I am just going to try to make sure I can get apache/datafusion#11479 working with this as one final test

alamb · 2024-07-16T14:23:11Z

Thanks again so much @efredine -- I plan to merge this later today unless anyone else would like time to review or comment

alamb · 2024-07-16T19:50:17Z

🚀 -- thanks again

Eric Fredine added 2 commits July 9, 2024 07:32

Adds arrow statistics converter for parquet stastistics.

b835f8d

Adds integration tests for arrow statsistics converter.

c2e1ce2

github-actions bot added the parquet Changes to the parquet crate label Jul 11, 2024

efredine marked this pull request as draft July 11, 2024 22:09

Fix linting, remove todo, re-use arrow code.

3103651

efredine marked this pull request as ready for review July 11, 2024 22:40

Remove commented out debug::log statements.

1a2a893

alamb mentioned this pull request Jul 12, 2024

DataFusion weekly project plan (Andrew Lamb) - July 8, 2024 apache/datafusion#11334

Closed

9 tasks

efredine commented Jul 13, 2024

View reviewed changes

This was referenced Jul 15, 2024

[EPIC] Continued correct and improved extracting Parquet statistics into ArrayRefs apache/datafusion#10922

Closed

DataFusion weekly project plan (Andrew Lamb) - July 15, 2024 apache/datafusion#11474

Closed

alamb added 3 commits July 15, 2024 15:00

Merge remote-tracking branch 'apache/master' into add-parquet-statist…

5a902ff

…ics-converter

Move parquet_column to lib.rs

fa5bd31

doc tweaks

1a0f23b

alamb mentioned this pull request Jul 15, 2024

Add function that converts from parquet statistics ParquetStatistics to arrow arrays ArrayRef #4328

Closed

alamb approved these changes Jul 15, 2024

View reviewed changes

alamb reviewed Jul 15, 2024

View reviewed changes

Add benchmark

3438746

alamb mentioned this pull request Jul 15, 2024

Use upstream StatisticsConverter from arrow-rs in DataFusion apache/datafusion#11479

Merged

alamb mentioned this pull request Jul 16, 2024

Update parquet page pruning code to use the StatisticsExtractor apache/datafusion#11483

Merged

alamb added 2 commits July 16, 2024 09:52

Add parquet_column_index and arrow_field accessors + test

5c8c1ba

Copy edit docs obsessively

f993b08

clippy

469fed9

alamb changed the title ~~Add parquet statistics converter for arrow reader~~ Add parquet StatisticsConverter for arrow reader Jul 16, 2024

alamb merged commit 66390ff into apache:master Jul 16, 2024
18 checks passed

This was referenced Jul 18, 2024

Extract parquet statistics for StructArray apache/datafusion#11289

Closed

DataFusion weekly project plan (Andrew Lamb) - July 22, 2024 apache/datafusion#11601

Closed

This was referenced Jul 31, 2024

Add support for StringView and BinaryView statistics in StatisticsConverter #6164

Closed

Remove test duplication in parquet statistics tets #6185

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add parquet `StatisticsConverter` for arrow reader #6046

Add parquet `StatisticsConverter` for arrow reader #6046

efredine commented Jul 11, 2024 •

edited

Loading

efredine commented Jul 11, 2024

alamb commented Jul 13, 2024

alamb commented Jul 13, 2024

Lordworms commented Jul 13, 2024

efredine commented Jul 13, 2024

efredine Jul 13, 2024

alamb Jul 15, 2024

efredine Jul 13, 2024

alamb Jul 15, 2024

Lordworms commented Jul 13, 2024

alamb left a comment

alamb Jul 15, 2024

alamb commented Jul 15, 2024

alamb left a comment

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024 •

edited

Loading

alamb commented Jul 15, 2024

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

Add parquet StatisticsConverter for arrow reader #6046

Add parquet StatisticsConverter for arrow reader #6046

Conversation

efredine commented Jul 11, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are there any user-facing changes?

efredine commented Jul 11, 2024

alamb commented Jul 13, 2024

alamb commented Jul 13, 2024

Lordworms commented Jul 13, 2024

efredine commented Jul 13, 2024

efredine Jul 13, 2024

Choose a reason for hiding this comment

alamb Jul 15, 2024

Choose a reason for hiding this comment

efredine Jul 13, 2024

Choose a reason for hiding this comment

alamb Jul 15, 2024

Choose a reason for hiding this comment

Lordworms commented Jul 13, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb Jul 15, 2024

Choose a reason for hiding this comment

alamb commented Jul 15, 2024

alamb left a comment

Choose a reason for hiding this comment

alamb commented Jul 15, 2024

alamb commented Jul 15, 2024 • edited Loading

alamb commented Jul 15, 2024

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

alamb commented Jul 16, 2024

Add parquet `StatisticsConverter` for arrow reader #6046

Add parquet `StatisticsConverter` for arrow reader #6046

efredine commented Jul 11, 2024 •

edited

Loading

alamb commented Jul 15, 2024 •

edited

Loading