Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: API for collecting statistics/index for metadata of a parquet file + tests #10537

Merged
merged 13 commits into from
May 20, 2024
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
use arrow_array::ArrayRef;
NGA-TRAN marked this conversation as resolved.
Show resolved Hide resolved
use arrow_schema::DataType;
use datafusion_common::Result;
use parquet::file::statistics::Statistics as ParquetStatistics;

/// statistics extracted from `Statistics` as Arrow `ArrayRef`s
///
/// # Note:
/// If the corresponding `Statistics` is not present, or has no information for
/// a column, a NULL is present in the corresponding array entry
pub struct ArrowStatistics {
/// min values
min: ArrayRef,
/// max values
max: ArrayRef,
/// Row counts (UInt64Array)
row_count: ArrayRef,
/// Null Counts (UInt64Array)
null_count: ArrayRef,
}

/// Extract `ArrowStatistics` from the parquet [`Statistics`]
pub fn parquet_stats_to_arrow<'a>(
arrow_datatype: &DataType,
statistics: impl IntoIterator<Item = Option<&'a ParquetStatistics>>,
) -> Result<ArrowStatistics> {
todo!() // MY TODO next
}
Copy link
Contributor

@tustvold tustvold May 16, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To emphasise the point I made when this API was originally proposed, you need more than just the ParquetStatistics in order to correctly interpret the data. You need at least the FileMetadata to get the https://docs.rs/parquet/latest/parquet/file/metadata/struct.FileMetaData.html#method.column_order in order to be able to even interpret what the statistics mean for a given column.

Additionally you need to actually have the parquet schema as the arrow datatype may not match what the parquet data is encoded as. The parquet schema is authoritative when reading parquet data, the arrow datatype is purely what the data should be coerced to once read from parquet.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In terms of "column order" I think we initially should do what DataFusion currently does with ColumnOrder (which is ignore it) and file a ticket to handle it longer term

Including the parquet schema is a good idea. I think this will become more obvious as we begin writing these tests

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is very easy to get FileMetaData from the parquet reader. I agree the sort order is not needed (yet) but I will see what we needs we we go and add them in

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #10586


// TODO: Now I had tests that read parquet metadata, I will use that to implement the below
// struct ParquetStatisticsExtractor {
// ...
// }

// // create an extractor that can extract data from parquet files
// let extractor = ParquetStatisticsExtractor::new(arrow_schema, parquet_schema)

// // get parquet statistics (one for each row group) somehow:
// let parquet_stats: Vec<&Statistics> = ...;

// // extract min/max values for column "a" and "b";
// let col_a stats = extractor.extract("a", parquet_stats.iter());
// let col_b stats = extractor.extract("b", parquet_stats.iter());
Original file line number Diff line number Diff line change
Expand Up @@ -63,6 +63,7 @@ use parquet::file::{metadata::ParquetMetaData, properties::WriterProperties};
use parquet::schema::types::ColumnDescriptor;
use tokio::task::JoinSet;

mod arrow_statistics;
mod metrics;
mod page_filter;
mod row_filter;
Expand Down
Loading
Loading