Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: API for collecting statistics/index for metadata of a parquet file + tests #10537

Merged
merged 13 commits into from
May 20, 2024

Conversation

NGA-TRAN
Copy link
Contributor

@NGA-TRAN NGA-TRAN commented May 15, 2024

Which issue does this PR close?

First PR of ##10453. This PR does:

  1. Add The API
  2. Good coverage tests but still more to at in follow-on PR to avoid to much code

Rationale for this change

The API will help us to prune more files and make it more effectively

What changes are included in this PR?

  1. Create a new API RequestedStatistics
  2. Convert Metadata of parquet file into RequestedStatistics form
  3. Some tests4.

Are these changes tested?

Yes

Are there any user-facing changes?

New API

@NGA-TRAN NGA-TRAN marked this pull request as draft May 15, 2024 21:41
@github-actions github-actions bot added the core Core DataFusion crate label May 15, 2024
}

//////////////// WRITE STATISTICS ///////////////////////
let file_meta = writer.close().unwrap();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With my surprise, the write and read statistics even though have the same content are stored in different structures. Write is parquet::format::statistics and read is parquet::file::statistics::Statistics. Why?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One is the native parquet encoding, one is the Rust version of it. In general you should avoid using the format::statistics directly

Comment on lines 23 to 28
pub fn parquet_stats_to_arrow<'a>(
arrow_datatype: &DataType,
statistics: impl IntoIterator<Item = Option<&'a ParquetStatistics>>,
) -> Result<ArrowStatistics> {
todo!() // MY TODO next
}
Copy link
Contributor

@tustvold tustvold May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column.

Additionally you need to actually have the parquet schema as the arrow datatype may not match what the parquet data is encoded as. The parquet schema is authoritative when reading parquet data, the arrow datatype is purely what the data should be coerced to once read from parquet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of "column order" I think we initially should do what DataFusion currently does with ColumnOrder (which is ignore it) and file a ticket to handle it longer term

Including the parquet schema is a good idea. I think this will become more obvious as we begin writing these tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is very easy to get FileMetaData from the parquet reader. I agree the sort order is not needed (yet) but I will see what we needs we we go and add them in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10586

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @NGA-TRAN

I also filed #10546 and will get that example in place so we can show how this API will be used (in addition to hooking it into the existing ListingTable and ParquetExec) which I think will help design the API

As @tustvold mentions, the actual signature of parquet_stats_to_arrow will likely need to change, but I think that is ok

datafusion/core/tests/parquet/arrow_statistics.rs Outdated Show resolved Hide resolved
@NGA-TRAN NGA-TRAN marked this pull request as ready for review May 16, 2024 22:27
@NGA-TRAN NGA-TRAN changed the title test: some tests to write data to a parquet file and read its metadata feat: API for collecting statistics/index for metadata of a parquet file May 17, 2024
@NGA-TRAN
Copy link
Contributor Author

@alamb : The PR is ready for review. Some notes:

  1. The tests in datafusion/core/src/datasource/physical_plan/parquet/statistics.rs are tests for different dat types. Instead of using exact test inputs that do not cover everything, I am reusing a general available test inputs. Since I have not added all data types (which we agreed to do in a follow-on PR to avoid too much code), I decided not to remove any tests from that file yet. Will remove them when I am happy with the test coverage.
  2. I have a feeling there will be a lot of tests of we want to cover all data types. I am looking froward to your feedback of tests in this PR for what we should not add.
  3. I found a bug (the same for 3 tests in 3 different data types). I did a quick investigation but not very clear the root cause yet. I will do it next week


Test {
reader,
// mins are [-5, -4, 0, 5]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This may be related to #9779 (something related to how parquet handles signed integers)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good hint for me to understand the issue more

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10585

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much @NGA-TRAN -- I think this PR is a significant step forward and shows well how some of the APIs can be used and the limitations

Here are my suggested next steps:

  1. I will file a ticket for incorrect statistics from Int8 / Int16
  2. I will file a ticket about ignoring ColumnOrder as mentioned by @tustvold in feat: API for collecting statistics/index for metadata of a parquet file + tests #10537 (comment)
  3. I will file a ticket about potentially incorrect date statistics being read from parquet files
  4. I'll try and make a PR that switches the existing code over to using this new API (to verify it basically fits)

Test {
reader,
// mins are [18262, 18565,]
expected_min: Arc::new(Int32Array::from(vec![18262, 18565])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually expect the returned type to be Date32Array as the underlying arrow type is Date32 -- I don't think we need to make this change as part of this PR

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let reader = parquet_file_many_columns(Scenario::Dates, row_per_group).await;
Test {
reader,
expected_min: Arc::new(Int64Array::from(vec![18262, 18565])), // panic here because the actual data is Int32Array
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise, I would expect this to be Date64Array (not Int64Array).

expected_null_counts: UInt64Array::from(vec![2, 2]),
expected_row_counts: UInt64Array::from(vec![13, 7]),
}
.run_col_not_found("not_a_column");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb alamb changed the title feat: API for collecting statistics/index for metadata of a parquet file feat: API for collecting statistics/index for metadata of a parquet file + tests May 20, 2024
datafusion/core/tests/parquet/arrow_statistics.rs Outdated Show resolved Hide resolved
Test {
reader,
// mins are [18262, 18565,]
expected_min: Arc::new(Int32Array::from(vec![18262, 18565])),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Copy link
Contributor

alamb commented May 20, 2024

I have filed the following tickets

I think this PR is now ready to go. I plan to merge it in when the CI passes

@alamb alamb merged commit b716c09 into apache:main May 20, 2024
23 checks passed
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
…ile + tests (apache#10537)

* test: some tests to write data to a parquet file and read its metadata

* feat: API to convert parquet stats to arrow stats

* Refine statistics extraction API and tests

* Implement null counts

* port test

* test: add more tests for the arrow statistics

* chore: fix format and test output

* chore: rename test helpers

* chore: Apply suggestions from code review

Co-authored-by: Andrew Lamb <[email protected]>

* Apply suggestions from code review

* Apply suggestions from code review

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants