feat: API for collecting statistics/index for metadata of a parquet file + tests #10537

NGA-TRAN · 2024-05-15T21:41:28Z

Which issue does this PR close?

First PR of ##10453. This PR does:

Add The API
Good coverage tests but still more to at in follow-on PR to avoid to much code

Rationale for this change

The API will help us to prune more files and make it more effectively

What changes are included in this PR?

Create a new API RequestedStatistics
Convert Metadata of parquet file into RequestedStatistics form
Some tests4.

Are these changes tested?

Yes

Are there any user-facing changes?

New API

datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs

NGA-TRAN · 2024-05-15T21:50:41Z

datafusion/core/tests/parquet/arrow_statistics.rs

+    }
+
+    //////////////// WRITE STATISTICS ///////////////////////
+    let file_meta = writer.close().unwrap();


With my surprise, the write and read statistics even though have the same content are stored in different structures. Write is parquet::format::statistics and read is parquet::file::statistics::Statistics. Why?

One is the native parquet encoding, one is the Rust version of it. In general you should avoid using the format::statistics directly

tustvold · 2024-05-16T09:44:57Z

datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs

+pub fn parquet_stats_to_arrow<'a>(
+    arrow_datatype: &DataType,
+    statistics: impl IntoIterator<Item = Option<&'a ParquetStatistics>>,
+) -> Result<ArrowStatistics> {
+    todo!() // MY TODO next
+}


To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column.

Additionally you need to actually have the parquet schema as the arrow datatype may not match what the parquet data is encoded as. The parquet schema is authoritative when reading parquet data, the arrow datatype is purely what the data should be coerced to once read from parquet.

In terms of "column order" I think we initially should do what DataFusion currently does with ColumnOrder (which is ignore it) and file a ticket to handle it longer term

Including the parquet schema is a good idea. I think this will become more obvious as we begin writing these tests

Yeah, it is very easy to get FileMetaData from the parquet reader. I agree the sort order is not needed (yet) but I will see what we needs we we go and add them in

Filed #10586

alamb

Thank you @NGA-TRAN

I also filed #10546 and will get that example in place so we can show how this API will be used (in addition to hooking it into the existing ListingTable and ParquetExec) which I think will help design the API

As @tustvold mentions, the actual signature of parquet_stats_to_arrow will likely need to change, but I think that is ok

datafusion/core/tests/parquet/arrow_statistics.rs

datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs

Refine statistics extraction API and tests

NGA-TRAN · 2024-05-17T21:43:26Z

@alamb : The PR is ready for review. Some notes:

The tests in datafusion/core/src/datasource/physical_plan/parquet/statistics.rs are tests for different dat types. Instead of using exact test inputs that do not cover everything, I am reusing a general available test inputs. Since I have not added all data types (which we agreed to do in a follow-on PR to avoid too much code), I decided not to remove any tests from that file yet. Will remove them when I am happy with the test coverage.
I have a feeling there will be a lot of tests of we want to cover all data types. I am looking froward to your feedback of tests in this PR for what we should not add.
I found a bug (the same for 3 tests in 3 different data types). I did a quick investigation but not very clear the root cause yet. I will do it next week

alamb · 2024-05-20T17:58:29Z

datafusion/core/tests/parquet/arrow_statistics.rs

+
+    Test {
+        reader,
+        // mins are [-5, -4, 0, 5]


This may be related to #9779 (something related to how parquet handles signed integers)

This is a good hint for me to understand the issue more

Filed #10585

alamb

Thank you very much @NGA-TRAN -- I think this PR is a significant step forward and shows well how some of the APIs can be used and the limitations

Here are my suggested next steps:

I will file a ticket for incorrect statistics from Int8 / Int16
I will file a ticket about ignoring ColumnOrder as mentioned by @tustvold in feat: API for collecting statistics/index for metadata of a parquet file + tests #10537 (comment)
I will file a ticket about potentially incorrect date statistics being read from parquet files
I'll try and make a PR that switches the existing code over to using this new API (to verify it basically fits)

alamb · 2024-05-20T18:01:54Z

datafusion/core/tests/parquet/arrow_statistics.rs

+    Test {
+        reader,
+        // mins are [18262, 18565,]
+        expected_min: Arc::new(Int32Array::from(vec![18262, 18565])),


I would actually expect the returned type to be Date32Array as the underlying arrow type is Date32 -- I don't think we need to make this change as part of this PR

alamb · 2024-05-20T18:02:16Z

datafusion/core/tests/parquet/arrow_statistics.rs

+    let reader = parquet_file_many_columns(Scenario::Dates, row_per_group).await;
+    Test {
+        reader,
+        expected_min: Arc::new(Int64Array::from(vec![18262, 18565])), // panic here because the actual data is Int32Array


Likewise, I would expect this to be Date64Array (not Int64Array).

alamb · 2024-05-20T18:02:28Z

datafusion/core/tests/parquet/arrow_statistics.rs

+        expected_null_counts: UInt64Array::from(vec![2, 2]),
+        expected_row_counts: UInt64Array::from(vec![13, 7]),
+    }
+    .run_col_not_found("not_a_column");


datafusion/core/tests/parquet/arrow_statistics.rs

Co-authored-by: Andrew Lamb <[email protected]>

datafusion/core/tests/parquet/arrow_statistics.rs

alamb · 2024-05-20T18:28:18Z

datafusion/core/tests/parquet/arrow_statistics.rs

+    Test {
+        reader,
+        // mins are [18262, 18565,]
+        expected_min: Arc::new(Int32Array::from(vec![18262, 18565])),


alamb · 2024-05-20T18:31:46Z

I have filed the following tickets

I think this PR is now ready to go. I plan to merge it in when the CI passes

…ile + tests (apache#10537) * test: some tests to write data to a parquet file and read its metadata * feat: API to convert parquet stats to arrow stats * Refine statistics extraction API and tests * Implement null counts * port test * test: add more tests for the arrow statistics * chore: fix format and test output * chore: rename test helpers * chore: Apply suggestions from code review Co-authored-by: Andrew Lamb <[email protected]> * Apply suggestions from code review * Apply suggestions from code review --------- Co-authored-by: Andrew Lamb <[email protected]>

test: some tests to write data to a parquet file and read its metadata

15af5be

NGA-TRAN marked this pull request as draft May 15, 2024 21:41

github-actions bot added the core Core DataFusion crate label May 15, 2024

NGA-TRAN commented May 15, 2024

View reviewed changes

datafusion/core/src/datasource/physical_plan/parquet/arrow_statistics.rs Outdated Show resolved Hide resolved

NGA-TRAN commented May 15, 2024

View reviewed changes

tustvold reviewed May 16, 2024

View reviewed changes

alamb reviewed May 16, 2024

View reviewed changes

datafusion/core/tests/parquet/arrow_statistics.rs Outdated Show resolved Hide resolved

feat: API to convert parquet stats to arrow stats

4b107f8

NGA-TRAN commented May 16, 2024

View reviewed changes

datafusion/core/tests/parquet/arrow_statistics.rs Outdated Show resolved Hide resolved

NGA-TRAN marked this pull request as ready for review May 16, 2024 22:27

NGA-TRAN commented May 17, 2024

View reviewed changes

alamb added 3 commits May 17, 2024 09:53

Refine statistics extraction API and tests

ab32178

Implement null counts

47e9d92

port test

f86841e

alamb mentioned this pull request May 17, 2024

Refine statistics extraction API and tests NGA-TRAN/arrow-datafusion#118

Merged

NGA-TRAN added 3 commits May 17, 2024 11:59

Merge pull request #118 from alamb/alamb/stats_api_refine

71ca4b1

Refine statistics extraction API and tests

test: add more tests for the arrow statistics

a535b75

chore: merge main to branch

3eeefe8

NGA-TRAN changed the title ~~test: some tests to write data to a parquet file and read its metadata~~ feat: API for collecting statistics/index for metadata of a parquet file May 17, 2024

chore: fix format and test output

44fcb6f

chore: rename test helpers

f3ae219

alamb reviewed May 20, 2024

View reviewed changes

alamb approved these changes May 20, 2024

View reviewed changes

alamb mentioned this pull request May 20, 2024

[EPIC] Efficiently and correctly extract parquet statistics into ArrayRefs #10453

Closed

23 tasks

alamb changed the title ~~feat: API for collecting statistics/index for metadata of a parquet file~~ feat: API for collecting statistics/index for metadata of a parquet file + tests May 20, 2024

alamb reviewed May 20, 2024

View reviewed changes

datafusion/core/tests/parquet/arrow_statistics.rs Outdated Show resolved Hide resolved

alamb mentioned this pull request May 20, 2024

Incorrect statistics read for i8 i16 columns in parquet #10585

Closed

alamb reviewed May 20, 2024

View reviewed changes

datafusion/core/tests/parquet/arrow_statistics.rs Show resolved Hide resolved

datafusion/core/tests/parquet/arrow_statistics.rs Show resolved Hide resolved

chore: Apply suggestions from code review

5aea4bf

Co-authored-by: Andrew Lamb <[email protected]>

Apply suggestions from code review

5b91f2c

alamb mentioned this pull request May 20, 2024

DataFusion reads Date32 and Date64 parquet statistics in as Int32Array #10587

Closed

alamb reviewed May 20, 2024

View reviewed changes

Apply suggestions from code review

bd7437c

alamb merged commit b716c09 into apache:main May 20, 2024
23 checks passed

This was referenced May 21, 2024

Implement a benchmark for extracting arrow statistics from parquet #10606

Closed

Refactor parquet row group pruning into a struct (use new statistics API, part 1) #10607

Merged

alamb mentioned this pull request Jun 5, 2024

Prune Parquet RowGroup in a single call to PruningPredicate::prune, update StatisticsExtractor API #10802

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: API for collecting statistics/index for metadata of a parquet file + tests #10537

feat: API for collecting statistics/index for metadata of a parquet file + tests #10537

NGA-TRAN commented May 15, 2024 •

edited

Loading

NGA-TRAN May 15, 2024

tustvold May 16, 2024

tustvold May 16, 2024 •

edited

Loading

alamb May 16, 2024

NGA-TRAN May 16, 2024

alamb May 20, 2024

alamb left a comment

NGA-TRAN commented May 17, 2024

alamb May 20, 2024

NGA-TRAN May 20, 2024

alamb May 20, 2024

alamb left a comment

alamb May 20, 2024

alamb May 20, 2024

alamb May 20, 2024

alamb May 20, 2024

alamb May 20, 2024

alamb commented May 20, 2024

feat: API for collecting statistics/index for metadata of a parquet file + tests #10537

feat: API for collecting statistics/index for metadata of a parquet file + tests #10537

Conversation

NGA-TRAN commented May 15, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

tustvold May 16, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

NGA-TRAN commented May 17, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

alamb commented May 20, 2024

NGA-TRAN commented May 15, 2024 •

edited

Loading

tustvold May 16, 2024 •

edited

Loading