-
Notifications
You must be signed in to change notification settings - Fork 4k
Closed
Description
Looking at some of the Parquet read benchmarks we have (https://conbench.ursa.dev/c-benchmarks/file-read), there are two small regressions to be observed the last months.
For example, zooming in on that time for nyctaxi_2010 with snappy compression (https://conbench.ursa.dev/benchmark-results/0653775157e07b6980005bce6e59a2ad/):
And for fanniemae (https://conbench.ursa.dev/benchmark-results/0653774cc429733d8000b2db91fa63fa/); here only the first bump is noticeable:
There are two small bumps, which I think are caused by the following two PRs, in order of the bumps:
- GH-32863: [C++][Parquet] Add DELTA_BYTE_ARRAY encoder to Parquet writer #14341
This PR added a new encoding option, but so my expectation is that it shouldn't have to slow down the default (the new encoding isn't used)? (cc @rok) - GH-36765: [Python][Dataset] Change default of pre_buffer to True for reading Parquet files #37854
While this change is mostly useful for cloud filesystems, and could be detrimental for fast local disk, the general assumption was that this effect for local disk would not be significant, so we could simply enable it for all filesystems. But based on this benchmark it seems the impact might actually not be fully insignificant? Should we consider making the default forpre_bufferdepend on the type of filesystem?
austin3dickey

