Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable reading StringViewArray by default from Parquet #12092

Draft
wants to merge 7 commits into
base: main
Choose a base branch
from

Conversation

alamb
Copy link
Contributor

@alamb alamb commented Aug 20, 2024

Draft as it builds on:

  • Debug tests
  • File issues found
  • File / fix slowdown with querying hits_partitioned

Which issue does this PR close?

Closes #11682

Rationale for this change

Reading data as StringViewArray is significantly faster than StringArray. We have been testing this behind a feature flag but it is now stable enough to enable by default.

See blog post #11603:

Benchmark Results

--------------------
Benchmark clickbench_1.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_enable_string_view_by_def… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     0.66ms │                           0.67ms │     no change │
│ QQuery 1     │    72.07ms │                          70.10ms │     no change │
│ QQuery 2     │   119.92ms │                         124.84ms │     no change │
│ QQuery 3     │   128.29ms │                         134.03ms │     no change │
│ QQuery 4     │   964.03ms │                         974.14ms │     no change │
│ QQuery 5     │  1088.23ms │                        1060.44ms │     no change │
│ QQuery 6     │    66.53ms │                          64.39ms │     no change │
│ QQuery 7     │    83.84ms │                          80.44ms │     no change │
│ QQuery 8     │  1484.01ms │                        1466.81ms │     no change │
│ QQuery 9     │  1366.17ms │                        1378.33ms │     no change │
│ QQuery 10    │   465.55ms │                         362.07ms │ +1.29x faster │
│ QQuery 11    │   514.19ms │                         398.94ms │ +1.29x faster │
│ QQuery 12    │  1195.47ms │                        1098.46ms │ +1.09x faster │
│ QQuery 13    │  2183.91ms │                        2083.70ms │     no change │
│ QQuery 14    │  1641.36ms │                        1551.03ms │ +1.06x faster │
│ QQuery 15    │  1126.56ms │                        1132.80ms │     no change │
│ QQuery 16    │  3050.64ms │                        3088.31ms │     no change │
│ QQuery 17    │  2746.65ms │                        2791.27ms │     no change │
│ QQuery 18    │  5845.46ms │                        6064.71ms │     no change │
│ QQuery 19    │   121.17ms │                         123.79ms │     no change │
│ QQuery 20    │  1628.64ms │                        1387.74ms │ +1.17x faster │
│ QQuery 21    │  2048.94ms │                        1738.74ms │ +1.18x faster │
│ QQuery 22    │  4994.39ms │                        4252.83ms │ +1.17x faster │
│ QQuery 23    │ 11893.20ms │                        9996.52ms │ +1.19x faster │
│ QQuery 24    │   775.43ms │                         669.76ms │ +1.16x faster │
│ QQuery 25    │   684.66ms │                         601.04ms │ +1.14x faster │
│ QQuery 26    │   873.53ms │                         738.14ms │ +1.18x faster │
│ QQuery 27    │  2576.02ms │                        2098.18ms │ +1.23x faster │
│ QQuery 28    │ 15637.06ms │                       14166.05ms │ +1.10x faster │
│ QQuery 29    │   565.58ms │                         551.64ms │     no change │
│ QQuery 30    │  1305.23ms │                        1209.25ms │ +1.08x faster │
│ QQuery 31    │  1389.60ms │                        1265.74ms │ +1.10x faster │
│ QQuery 32    │  4741.08ms │                        4853.34ms │     no change │
│ QQuery 33    │  5306.91ms │                        4193.22ms │ +1.27x faster │
│ QQuery 34    │  5167.11ms │                        4222.33ms │ +1.22x faster │
│ QQuery 35    │  1863.80ms │                        1859.61ms │     no change │
│ QQuery 36    │   317.80ms │                         291.40ms │ +1.09x faster │
│ QQuery 37    │   216.31ms │                         184.00ms │ +1.18x faster │
│ QQuery 38    │   200.83ms │                         182.21ms │ +1.10x faster │
│ QQuery 39    │  1038.73ms │                         851.30ms │ +1.22x faster │
│ QQuery 40    │    86.40ms │                          88.35ms │     no change │
│ QQuery 41    │    80.99ms │                          78.34ms │     no change │
│ QQuery 42    │    93.00ms │                          97.02ms │     no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                             │ 87749.93ms │
│ Total Time (alamb_enable_string_view_by_default)   │ 79626.03ms │
│ Average Time (main_base)                           │  2040.70ms │
│ Average Time (alamb_enable_string_view_by_default) │  1851.77ms │
│ Queries Faster                                     │         21 │
│ Queries Slower                                     │          0 │
│ Queries with No Change                             │         22 │
└────────────────────────────────────────────────────┴────────────┘
--------------------
Benchmark clickbench_partitioned.json
--------------------
┏━━━━━━━━━━━━━━┳━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Query        ┃  main_base ┃ alamb_enable_string_view_by_def… ┃        Change ┃
┡━━━━━━━━━━━━━━╇━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ QQuery 0     │     2.37ms │                           1.97ms │ +1.21x faster │
│ QQuery 1     │    38.68ms │                          36.78ms │     no change │
│ QQuery 2     │    95.07ms │                          94.11ms │     no change │
│ QQuery 3     │    98.75ms │                         100.86ms │     no change │
│ QQuery 4     │   927.07ms │                         931.67ms │     no change │
│ QQuery 5     │   964.71ms │                        1043.29ms │  1.08x slower │
│ QQuery 6     │    36.40ms │                          33.21ms │ +1.10x faster │
│ QQuery 7     │    42.34ms │                          40.83ms │     no change │
│ QQuery 8     │  1440.57ms │                        1434.71ms │     no change │
│ QQuery 9     │  1343.13ms │                        1345.13ms │     no change │
│ QQuery 10    │   357.26ms │                         432.73ms │  1.21x slower │
│ QQuery 11    │   404.40ms │                         478.21ms │  1.18x slower │
│ QQuery 12    │  1094.88ms │                        1057.96ms │     no change │
│ QQuery 13    │  1829.88ms │                        2013.06ms │  1.10x slower │
│ QQuery 14    │  1519.03ms │                        1516.29ms │     no change │
│ QQuery 15    │  1085.44ms │                        1068.30ms │     no change │
│ QQuery 16    │  2905.66ms │                        3086.20ms │  1.06x slower │
│ QQuery 17    │  2744.69ms │                        2810.55ms │     no change │
│ QQuery 18    │  5845.13ms │                        6043.80ms │     no change │
│ QQuery 19    │    95.15ms │                          89.50ms │ +1.06x faster │
│ QQuery 20    │  1751.84ms │                        2165.42ms │  1.24x slower │
│ QQuery 21    │  2019.88ms │                        2379.20ms │  1.18x slower │
│ QQuery 22    │  5167.42ms │                        5642.45ms │  1.09x slower │
│ QQuery 23    │ 10400.41ms │                       14124.07ms │  1.36x slower │
│ QQuery 24    │   584.30ms │                         656.67ms │  1.12x slower │
│ QQuery 25    │   498.02ms │                         519.50ms │     no change │
│ QQuery 26    │   651.84ms │                         719.41ms │  1.10x slower │
│ QQuery 27    │  2525.75ms │                        2694.77ms │  1.07x slower │
│ QQuery 28    │ 14905.83ms │                       26662.02ms │  1.79x slower │
│ QQuery 29    │   529.85ms │                         524.36ms │     no change │
│ QQuery 30    │  1086.70ms │                        1212.89ms │  1.12x slower │
│ QQuery 31    │  1163.36ms │                        1291.25ms │  1.11x slower │
│ QQuery 32    │  4764.64ms │                        4782.92ms │     no change │
│ QQuery 33    │  5149.48ms │                        4480.69ms │ +1.15x faster │
│ QQuery 34    │  5162.88ms │                        4481.70ms │ +1.15x faster │
│ QQuery 35    │  1811.72ms │                        1812.12ms │     no change │
│ QQuery 36    │   282.31ms │                         268.12ms │ +1.05x faster │
│ QQuery 37    │   121.31ms │                         104.96ms │ +1.16x faster │
│ QQuery 38    │   147.57ms │                         176.74ms │  1.20x slower │
│ QQuery 39    │   975.45ms │                         848.76ms │ +1.15x faster │
│ QQuery 40    │    61.35ms │                          57.33ms │ +1.07x faster │
│ QQuery 41    │    50.15ms │                          47.45ms │ +1.06x faster │
│ QQuery 42    │    63.67ms │                          62.35ms │     no change │
└──────────────┴────────────┴──────────────────────────────────┴───────────────┘
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━┓
┃ Benchmark Summary                                  ┃            ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━┩
│ Total Time (main_base)                             │ 82746.33ms │
│ Total Time (alamb_enable_string_view_by_default)   │ 99374.32ms │
│ Average Time (main_base)                           │  1924.33ms │
│ Average Time (alamb_enable_string_view_by_default) │  2311.03ms │
│ Queries Faster                                     │         10 │
│ Queries Slower                                     │         16 │
│ Queries with No Change                             │         17 │
└────────────────────────────────────────────────────┴────────────┘

What changes are included in this PR?

  1. Set schema_force_view_types to true

Are these changes tested?

Yes, by CI tests

Are there any user-facing changes?

  1. Faster reading of data from Parquet files

If you see an error related to StringView use, you can disable this feature using the schema_force_string_view option

> set datafusion.execution.parquet.schema_force_view_types = false;
0 row(s) fetched.
Elapsed 0.000 seconds.

Context

@XiangpengHao debugged these tests previously using #11862

@alamb alamb changed the title Enable reading string view by default from Parquet Enable reading StringViewArray by default from Parquet Aug 20, 2024
@github-actions github-actions bot added sql SQL Planner physical-expr Physical Expressions core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) substrait common Related to common crate proto Related to proto crate functions labels Aug 20, 2024
Copy link
Contributor

@comphead comphead left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Arrow-rs skips now the interval parts with 0? so interval tests are broken now

@alamb
Copy link
Contributor Author

alamb commented Aug 22, 2024

Arrow-rs skips now the interval parts with 0? so interval tests are broken now

That is due to the arrow upgrade for sure -- you can see the changes needed here (in their own PR): #12032

@alamb alamb force-pushed the alamb/enable_string_view_by_default branch 2 times, most recently from bc5d7f7 to 27f7d1e Compare August 22, 2024 19:43
@alamb
Copy link
Contributor Author

alamb commented Aug 22, 2024

I think the last remaining failure is due to #12123

@alamb alamb force-pushed the alamb/enable_string_view_by_default branch from 27f7d1e to ce5470d Compare September 12, 2024 19:44
@github-actions github-actions bot removed sql SQL Planner physical-expr Physical Expressions core Core DataFusion crate proto Related to proto crate functions labels Sep 12, 2024
@github-actions github-actions bot added the core Core DataFusion crate label Sep 12, 2024
@@ -264,8 +264,12 @@ impl PruningStatistics for BloomFilterStatistics {
.iter()
.map(|value| {
match value {
ScalarValue::Utf8(Some(v)) => sbbf.check(&v.as_str()),
ScalarValue::Binary(Some(v)) => sbbf.check(v),
ScalarValue::Utf8(Some(v)) | ScalarValue::Utf8View(Some(v)) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs its own ticket / test I think. Will do it shortly

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #12499

@github-actions github-actions bot added the documentation Improvements or additions to documentation label Sep 12, 2024
@alamb
Copy link
Contributor Author

alamb commented Sep 12, 2024

Still working on benchmarks, but the code is looking good

@alamb alamb force-pushed the alamb/enable_string_view_by_default branch from 2275216 to e8f7384 Compare September 13, 2024 14:17
@alamb

This comment was marked as outdated.

@alamb
Copy link
Contributor Author

alamb commented Sep 13, 2024

I ran benchmarks and posted results, -- clickbench looks great 🚀 . However, interesting clickbench_partitioned got slower. I am pretty sure I know why (because the schema is resolved to BinaryView rather than Binary). Will review that later

@alamb
Copy link
Contributor Author

alamb commented Sep 16, 2024

I figured out what is going on (different than I thought). I believe StringView::slice() is quite a bit slower than StringArray::slice due to the fact it has a buffers field

The query is

SELECT REGEXP_REPLACE("Referer", '^https?://(?:www\\.)?([^/]+)/.*$', '\\1') AS k, AVG(length("Referer")) AS l, COUNT(*) AS c, MIN("Referer")
FROM hits_partitioned
WHERE "Referer" <> '' GROUP BY k HAVING COUNT(*) > 100000 ORDER BY l DESC LIMIT 25;

Some flamegraphs:
flamegraph-main
flamegraph-string-view

Screenshot 2024-09-16 at 4 44 50 PM

I will think about the best way to proceed here

@XiangpengHao
Copy link
Contributor

I plan to take a closer look at this on Saturday if no one has beat me to it, cc @alamb

@@ -1034,14 +1034,22 @@ fn binary_to_string_coercion(
) -> Option<DataType> {
use arrow::datatypes::DataType::*;
match (lhs_type, rhs_type) {
// Note: added rules to coerce from BinaryView --> Utf8View
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #12500

@@ -264,8 +264,12 @@ impl PruningStatistics for BloomFilterStatistics {
.iter()
.map(|value| {
match value {
ScalarValue::Utf8(Some(v)) => sbbf.check(&v.as_str()),
ScalarValue::Binary(Some(v)) => sbbf.check(v),
ScalarValue::Utf8(Some(v)) | ScalarValue::Utf8View(Some(v)) => {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Filed #12499

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

I plan to take a closer look at this on Saturday if no one has beat me to it, cc @alamb

Thanks @XiangpengHao -- I

@alamb
Copy link
Contributor Author

alamb commented Sep 17, 2024

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
common Related to common crate core Core DataFusion crate documentation Improvements or additions to documentation sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Enable datafusion.execution.parquet.schema_force_string_view by default
3 participants