Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve performance of extracting statistics from parquet files #10626

Closed
Tracked by #10453
alamb opened this issue May 22, 2024 · 1 comment · Fixed by #10711
Closed
Tracked by #10453

Improve performance of extracting statistics from parquet files #10626

alamb opened this issue May 22, 2024 · 1 comment · Fixed by #10711
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented May 22, 2024

Is your feature request related to a problem or challenge?

Part of #10453

@Lordworms added a benchmark for extracting statistics from parquet files in #10610

As this code can be used to extract statistics from parquet files, we would like to make sure it is efficient (especially if we are going to extract statistics for many files at once)

The idea here is to improve the speed of the statistics extraction

Describe the solution you'd like

Make this go faster

cargo bench --bench parquet_statistic

Describe alternatives you've considered

I did some brief profiling:

Screenshot 2024-05-22 at 3 37 30 PM

I think they key would be to change these loops so they built the required Arrow Arrays directly from primitive values rather than from ScalarValue:

pub(crate) fn min_statistics<'a, I: Iterator<Item = Option<&'a ParquetStatistics>>>(
data_type: &DataType,
iterator: I,
) -> Result<ArrayRef> {
let scalars = iterator
.map(|x| x.and_then(|s| get_statistic!(s, min, min_bytes, Some(data_type))));
collect_scalars(data_type, scalars)

Additional context

No response

@alamb
Copy link
Contributor Author

alamb commented May 23, 2024

I was thinking about this last night -- what I would suggest for this is to

make functions like this for each type of https://docs.rs/parquet/latest/parquet/file/statistics/struct.ValueStatistics.html

/// Returns an iterator over min values stored in `ValueStatistics<i32>`
fn extract_i32_mins(stats: impl IntoIterator<&Statistics>) -> impl Iterator<Item = Option<i32>> {
...
}

And then with those iterators, we can make the arrays directly

something like

let Int32ArrayMins = Int32Aray::from_iter(extract_i32_mins(stats));

I think that would be both simple and fast.

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
1 participant