Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable datafusion.execution.parquet.schema_force_string_view by default #11682

Open
Tracked by #11752 ...
alamb opened this issue Jul 27, 2024 · 3 comments · May be fixed by #12092
Open
Tracked by #11752 ...

Enable datafusion.execution.parquet.schema_force_string_view by default #11682

alamb opened this issue Jul 27, 2024 · 3 comments · May be fixed by #12092
Assignees
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jul 27, 2024

Part of #11752

Is your feature request related to a problem or challenge?

As part of #10918, @XiangpengHao has threaded the use of StringView through parquet, arrow-rs and then into DataFusion

When the datafusion.execution.parquet.schema_force_string_view option is enabled, the DataFusion Parquet reader will read all Utf8 columns as StringView instead, which results in significantly faster performance (details TBD but we will write it down in #11603 )

However, when initially merged #11667 this setting will be off by default

This ticket tracks what it would take to turn the setting on by default

Describe the solution you'd like

Change the default value of datafusion.execution.parquet.schema_force_string_view to true

Describe alternatives you've considered

Basically we should enable the flag by default and then run some benchmarks to ensure performance doesn't change by too much

Additional context

No response

@XiangpengHao
Copy link
Contributor

XiangpengHao commented Aug 7, 2024

Want to share my numbers here:

┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃   Baseline ┃ StringView┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.40ms │    0.41ms │     no change │
│ QQuery 1     │    46.60ms │   42.69ms │ +1.09x faster │
│ QQuery 2     │    76.16ms │   78.07ms │     no change │
│ QQuery 3     │    87.25ms │   85.44ms │     no change │
│ QQuery 4     │   774.81ms │  770.28ms │     no change │
│ QQuery 5     │   888.38ms │  916.04ms │     no change │
│ QQuery 6     │    41.07ms │   40.60ms │     no change │
│ QQuery 7     │    44.55ms │   44.30ms │     no change │
│ QQuery 8     │  1229.92ms │ 1220.85ms │     no change │
│ QQuery 9     │   891.96ms │  873.84ms │     no change │
│ QQuery 10    │   490.90ms │  220.19ms │ +2.23x faster │
│ QQuery 11    │   513.23ms │  241.88ms │ +2.12x faster │
│ QQuery 12    │  1130.10ms │  950.93ms │ +1.19x faster │
│ QQuery 13    │  2371.24ms │ 2204.60ms │ +1.08x faster │
│ QQuery 14    │  1499.27ms │ 1377.36ms │ +1.09x faster │
│ QQuery 15    │   888.89ms │  878.98ms │     no change │
│ QQuery 16    │  2602.96ms │ 2638.78ms │     no change │
│ QQuery 17    │  2515.57ms │ 2580.58ms │     no change │
│ QQuery 18    │  5577.86ms │ 5814.67ms │     no change │
│ QQuery 19    │    76.79ms │   77.22ms │     no change │
│ QQuery 20    │  1133.65ms │  850.76ms │ +1.33x faster │
│ QQuery 21    │  1532.25ms │ 1049.88ms │ +1.46x faster │
│ QQuery 22    │  3490.42ms │ 2880.90ms │ +1.21x faster │
│ QQuery 23    │ 10056.49ms │ 9152.26ms │ +1.10x faster │
│ QQuery 24    │   649.17ms │  494.05ms │ +1.31x faster │
│ QQuery 25    │   567.48ms │  449.79ms │ +1.26x faster │
│ QQuery 26    │   690.33ms │  555.21ms │ +1.24x faster │
│ QQuery 27    │  1771.53ms │ 1526.66ms │ +1.16x faster │
│ QQuery 28    │  9406.74ms │ 8802.03ms │ +1.07x faster │
│ QQuery 29    │   353.43ms │  362.44ms │     no change │
│ QQuery 30    │  1186.41ms │ 1067.44ms │ +1.11x faster │
│ QQuery 31    │  1617.60ms │ 1515.93ms │ +1.07x faster │
│ QQuery 32    │  7992.19ms │ 7823.69ms │     no change │
│ QQuery 33    │  4809.44ms │ 3374.01ms │ +1.43x faster │
│ QQuery 34    │  4779.28ms │ 3405.84ms │ +1.40x faster │
│ QQuery 35    │  1504.81ms │ 1505.37ms │     no change │
│ QQuery 36    │   150.45ms │  145.05ms │     no change │
│ QQuery 37    │   112.40ms │   97.54ms │ +1.15x faster │
│ QQuery 38    │   101.89ms │   95.53ms │ +1.07x faster │
│ QQuery 39    │   515.90ms │  455.99ms │ +1.13x faster │
│ QQuery 40    │    52.28ms │   49.23ms │ +1.06x faster │
│ QQuery 41    │    46.79ms │   46.00ms │     no change │
│ QQuery 42    │    52.62ms │   53.40ms │     no change │
└──────────────┴────────────┴───────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary      ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (results)   │ 74321.46ms │
│ Total Time (results)   │ 66816.69ms │
│ Average Time (results) │  1728.41ms │
│ Average Time (results) │  1553.88ms │
│ Queries Faster         │         23 │
│ Queries Slower         │          0 │
│ Queries with No Change │         20 │
└────────────────────────┴────────────┘

@XiangpengHao
Copy link
Contributor

take

@alamb
Copy link
Contributor Author

alamb commented Aug 7, 2024

Want to share my numbers here:

200w

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants