Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support String/LargeString and Binary/LargeBinary Parquet Data Page Statistics #11026

Closed
Tracked by #10922
alamb opened this issue Jun 20, 2024 · 5 comments
Closed
Tracked by #10922
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@alamb
Copy link
Contributor

alamb commented Jun 20, 2024

Is your feature request related to a problem or challenge?

Part of #10922

We are adding APIs to efficiently convert the data stored in Parquet's "PageIndex" into ArrayRefs -- which will make it significiantly easier to use this information for pruning and other tasks.

Describe the solution you'd like

Add support to StatisticsConverter::min_page_statistics and StatisticsConverter::max_page_statistics for the types above

/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn min_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{
get_data_page_statistics!(Min, data_type, iterator)
}
/// Extracts the max statistics from an iterator
/// of parquet page [`Index`]'es to an [`ArrayRef`]
pub(crate) fn max_page_statistics<'a, I>(
data_type: Option<&DataType>,
iterator: I,
) -> Result<ArrayRef>
where
I: Iterator<Item = (usize, &'a Index)>,
{

Describe alternatives you've considered

You can follow the model from @Weijun-H in #10931

  1. Update the test for the listed data types (I think it is test_binary) following the model of test_int64

    async fn test_int_64() {
    // This creates a parquet files of 4 columns named "i8", "i16", "i32", "i64"
    let reader = TestReader {
    scenario: Scenario::Int,
    row_per_group: 5,
    }
    .build()
    .await;
    // since each row has only one data page, the statistics are the same
    Test {
    reader: &reader,
    // mins are [-5, -4, 0, 5]
    expected_min: Arc::new(Int64Array::from(vec![-5, -4, 0, 5])),
    // maxes are [-1, 0, 4, 9]
    expected_max: Arc::new(Int64Array::from(vec![-1, 0, 4, 9])),
    // nulls are [0, 0, 0, 0]
    expected_null_counts: UInt64Array::from(vec![0, 0, 0, 0]),
    // row counts are [5, 5, 5, 5]
    expected_row_counts: UInt64Array::from(vec![5, 5, 5, 5]),
    column_name: "i64",
    check: Check::Both,
    }
    .run();

  2. Add any required implementation in

    make_data_page_stats_iterator!(MinInt64DataPageStatsIterator, min, Index::INT64, i64);
    make_data_page_stats_iterator!(MaxInt64DataPageStatsIterator, max, Index::INT64, i64);
    macro_rules! get_data_page_statistics {
    ($stat_type_prefix: ident, $data_type: ident, $iterator: ident) => {
    paste! {
    match $data_type {
    Some(DataType::Int64) => Ok(Arc::new(Int64Array::from_iter([<$stat_type_prefix Int64DataPageStatsIterator>]::new($iterator).flatten()))),
    _ => unimplemented!()
    }
    }
    }
    (follow the model of the row counts, )

Additional context

No response

@PsiACE
Copy link
Member

PsiACE commented Jun 21, 2024

take

@alamb
Copy link
Contributor Author

alamb commented Jun 24, 2024

Hi @PsiACE -- I was wondering how you were faring with this ticket?

@PsiACE
Copy link
Member

PsiACE commented Jun 25, 2024

Hi @PsiACE -- I was wondering how you were faring with this ticket?

Today I will submit a PR.

@alamb
Copy link
Contributor Author

alamb commented Jun 25, 2024

BTW @tshauck added a PR apache/arrow-rs#5949 upstream in parquet-rs (not yet available in datafusion) that might make this easier

@alamb
Copy link
Contributor Author

alamb commented Jun 30, 2024

I filed #11184 to track fixed size binary, so let's close this one

@alamb alamb closed this as completed Jun 30, 2024
@alamb alamb changed the title Support String/LargeString and Binary/LargeBinary and FixedSizeBinary Parquet Data Page Statistics Support String/LargeString and Binary/LargeBinaryParquet Data Page Statistics Jun 30, 2024
@alamb alamb changed the title Support String/LargeString and Binary/LargeBinaryParquet Data Page Statistics Support String/LargeString and Binary/LargeBinary Parquet Data Page Statistics Jun 30, 2024
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
None yet
Development

No branches or pull requests

2 participants