Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a benchmark for extracting parquet data page statistics #10934

Closed
Tracked by #10922
alamb opened this issue Jun 16, 2024 · 1 comment · Fixed by #10950
Closed
Tracked by #10922

Add a benchmark for extracting parquet data page statistics #10934

alamb opened this issue Jun 16, 2024 · 1 comment · Fixed by #10950
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jun 16, 2024

Is your feature request related to a problem or challenge?

As we work to make extracting statistics from parquet data pages more correct and performant in #10922 one thing that would be good is to have benchmark overage

Describe the solution you'd like

Add a benchmark for extracting page statistics

Describe alternatives you've considered

Add a benchmark (source) for extracting data page statistics

These are run via

cargo bench --bench parquet_statistic

In order to create a reasonable number of data page staistics, it would be good to configure the parquet writer to limit the sizez of data pages

let props = WriterProperties::builder().build();

And use https://docs.rs/parquet/latest/parquet/file/properties/struct.WriterProperties.html#method.data_page_row_count_limit to set the the limit to 1 and then send the data in row by row as we did in the test:

if let Some(data_page_row_count_limit) = self.data_page_row_count_limit {
builder = builder.set_data_page_row_count_limit(data_page_row_count_limit);
}
let props = builder.build();
let batches = vec![self.make_int64_batches_with_null()];
let schema = batches[0].schema();
let mut writer =
ArrowWriter::try_new(&mut output_file, schema, Some(props)).unwrap();
// if we have a datapage limit send the batches in one at a time to give
// the writer a chance to be split into multiple pages
if self.data_page_row_count_limit.is_some() {
for batch in batches {
for i in 0..batch.num_rows() {
writer.write(&batch.slice(i, 1)).expect("writing batch");
}
}
} else {
for batch in batches {
writer.write(&batch).expect("writing batch");
}
}

Additional context

The need for a benchmark also came up in #10932

@alamb alamb added the enhancement New feature or request label Jun 16, 2024
@alamb alamb changed the title Add a benchmark for extracting data page statistics Add a benchmark for extracting parquet data page statistics Jun 16, 2024
@marvinlanhenke
Copy link
Contributor

take

This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants