[C++] Parquet reading performance regressions

Looking at some of the Parquet read benchmarks we have (https://conbench.ursa.dev/c-benchmarks/file-read), there are two small regressions to be observed the last months.

For example, zooming in on that time for nyctaxi_2010 with snappy compression (https://conbench.ursa.dev/benchmark-results/0653775157e07b6980005bce6e59a2ad/):

![image](https://github.com/apache/arrow/assets/1020496/c95ff04b-6a53-4d09-916d-d55b13e50a9d)

And for fanniemae (https://conbench.ursa.dev/benchmark-results/0653774cc429733d8000b2db91fa63fa/); here only the first bump is noticeable:

![image](https://github.com/apache/arrow/assets/1020496/3267c309-155f-4691-a3de-8f845f990510)


There are two small bumps, which I think are caused by the following two PRs, in order of the bumps:

- https://github.com/apache/arrow/pull/14341 
  This PR added a new encoding option, but so my expectation is that it shouldn't have to slow down the default (the new encoding isn't used)? (cc @rok)
- https://github.com/apache/arrow/pull/37854
  While this change is mostly useful for cloud filesystems, and could be detrimental for fast local disk, the general assumption was that this effect for local disk would not be significant, so we could simply enable it for all filesystems. But based on this benchmark it seems the impact might actually not be fully insignificant? Should we consider making the default for `pre_buffer` depend on the type of filesystem?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[C++] Parquet reading performance regressions #38432

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[C++] Parquet reading performance regressions #38432

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions