Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bench for data page statistics parquet extraction #10950

Merged
merged 4 commits into from
Jul 3, 2024

Conversation

marvinlanhenke
Copy link
Contributor

Which issue does this PR close?

Closes #10934.

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Jun 17, 2024
@marvinlanhenke
Copy link
Contributor Author

marvinlanhenke commented Jun 17, 2024

@alamb
...since not all DataTypes are supported yet, some types panic in the bench.
Should we simply wait until the missing data types are supported and then merge this PR?

@alamb
Copy link
Contributor

alamb commented Jun 17, 2024

Should we simply wait until the missing data types are supported and then merge this PR?

I think that makes sense to me

The other thing we could do is somehow ignore those columns until the support has been added

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me -- thank you @marvinlanhenke

datafusion/core/benches/parquet_statistic.rs Show resolved Hide resolved
datafusion/core/benches/parquet_statistic.rs Outdated Show resolved Hide resolved
datafusion/core/benches/parquet_statistic.rs Outdated Show resolved Hide resolved
@marvinlanhenke
Copy link
Contributor Author

The other thing we could do is somehow ignore those columns until the support has been added

yes, but this would require another PR to revert back those changes. I think adding the bench is not so urgent; so merging once other dataypes are ready might be the easiest thing to do.

I've also addressed your other comments; should be fine now. Thanks for the review.

@alamb alamb marked this pull request as draft June 17, 2024 20:24
@alamb
Copy link
Contributor

alamb commented Jun 17, 2024

Thanks again @marvinlanhenke -- this PR looks good to me. Per your suggestion, let's wait until the required type support has been added

Copy link
Contributor

@efredine efredine left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked this out and merged main and it runs without errors now, so it should be safe to merge.

@alamb alamb marked this pull request as ready for review July 3, 2024 21:06
@alamb
Copy link
Contributor

alamb commented Jul 3, 2024

Thanks for checking this out @efredine -- I merged up from main and updated this PR to get it moving. Once it passes CI I think it is good to go

Thanks again @marvinlanhenke

Like @efredine I verified the benchmark now works without error:

cargo bench --bench parquet_statistic
   Compiling bigdecimal v0.4.1
   Compiling datafusion v39.0.0 (/Users/andrewlamb/Software/datafusion2/datafusion/core)
    Finished `bench` profile [optimized] target(s) in 1m 34s
     Running benches/parquet_statistic.rs (target/release/deps/parquet_statistic-c6fce472dea5abe8)
Gnuplot not found, using plotters backend
Extract row group statistics for Int64/extract_statistics/Int64
                        time:   [594.98 ns 596.23 ns 597.66 ns]
Found 10 outliers among 100 measurements (10.00%)
  4 (4.00%) high mild
  6 (6.00%) high severe

Extract data page statistics for Int64/extract_statistics/Int64
                        time:   [6.5665 µs 6.5848 µs 6.6047 µs]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Extract row group statistics for UInt64/extract_statistics/UInt64
                        time:   [576.78 ns 578.78 ns 581.09 ns]
Found 3 outliers among 100 measurements (3.00%)
  2 (2.00%) high mild
  1 (1.00%) high severe

Extract data page statistics for UInt64/extract_statistics/UInt64
                        time:   [6.8120 µs 6.8332 µs 6.8559 µs]
Found 7 outliers among 100 measurements (7.00%)
  6 (6.00%) high mild
  1 (1.00%) high severe

Extract row group statistics for F64/extract_statistics/F64
                        time:   [588.96 ns 592.68 ns 596.62 ns]

Extract data page statistics for F64/extract_statistics/F64
                        time:   [7.5959 µs 7.6334 µs 7.6650 µs]

Extract row group statistics for String/extract_statistics/String
                        time:   [897.07 ns 901.70 ns 907.19 ns]
Found 5 outliers among 100 measurements (5.00%)
  3 (3.00%) high mild
  2 (2.00%) high severe

Extract data page statistics for String/extract_statistics/String
                        time:   [25.507 µs 25.555 µs 25.609 µs]
Found 2 outliers among 100 measurements (2.00%)
  1 (1.00%) high mild
  1 (1.00%) high severe

Benchmarking Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...: Collecting 100 samples in estimated 5.00
Extract row group statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...
                        time:   [947.78 ns 954.30 ns 960.82 ns]
Found 8 outliers among 100 measurements (8.00%)
  1 (1.00%) low severe
  7 (7.00%) low mild

Benchmarking Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...: Collecting 100 samples in estimated 5.04
Extract data page statistics for Dictionary(Int32, String)/extract_statistics/Dictionary(Int32, Stri...
                        time:   [25.602 µs 25.812 µs 26.109 µs]
Found 8 outliers among 100 measurements (8.00%)
  7 (7.00%) high mild
  1 (1.00%) high severe

@alamb alamb merged commit b4afa18 into apache:main Jul 3, 2024
23 checks passed
comphead pushed a commit to comphead/arrow-datafusion that referenced this pull request Jul 8, 2024
* feat: add data page bench

* chore: add comment

* fix: row_groups + shorten row_group_indices

---------

Co-authored-by: Andrew Lamb <[email protected]>
findepi pushed a commit to findepi/datafusion that referenced this pull request Jul 16, 2024
* feat: add data page bench

* chore: add comment

* fix: row_groups + shorten row_group_indices

---------

Co-authored-by: Andrew Lamb <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a benchmark for extracting parquet data page statistics
3 participants