Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support extracting Int8, Int16, Int32 statistics from Parquet Data Pages #10928

Closed
Tracked by #10922
alamb opened this issue Jun 15, 2024 · 2 comments · Fixed by #10931
Closed
Tracked by #10922

Support extracting Int8, Int16, Int32 statistics from Parquet Data Pages #10928

alamb opened this issue Jun 15, 2024 · 2 comments · Fixed by #10931
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 15, 2024

Is your feature request related to a problem or challenge?

Part of #10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significiantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

pub(crate) fn min_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{
get_data_page_statistics!(Min, data_type, iterator)
}
/// Extracts the max statistics from an iterator
/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn max_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{
get_data_page_statistics!(Max, data_type, iterator)

Describe alternatives you've considered

  1. Update the test for the listed data types following the model of test_int64 (note this API will change slightly in Minor: Improve arrow_statistics tests #10927)
  2. Add any required implementation in
    make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, Index::INT64, i64);
    make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, Index::INT64, i64);
    macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
    paste! {
    match $data_type {
    Some(DataType::Int64) => Ok(Arc::new(Int64Array::from_iter([<$stat_type_prefix Int64DataPageStatsIterator>]::new($iterator).flatten()))),
    _ => unimplemented!()
    }
    }
    }
    (follow the model of the row counts, )

Additional context

No response

@alamb alamb added the enhancement New feature or request label Jun 15, 2024
@alamb alamb changed the title Support extracting Int8, Int16, Int32 statistics from Parquet Datapages Support extracting Int8, Int16, Int32 statistics from Parquet Data Pages Jun 15, 2024
@alamb
Copy link
Contributor Author

alamb commented Jun 15, 2024

FYI @marvinlanhenke

@Weijun-H
Copy link
Member

take

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants