Specialize ASCII case for substr() #12444

2010YOUY01 · 2024-09-12T13:39:04Z

Which issue does this PR close?

Rationale for this change

See the issue for the background.

Function arguments start and count in substr(s, start, count) are character-based, if string is utf-8 encoded, it would decode start + count characters. However if given strings are all ASCII encoded, function can calculate byte indices in constant time.

One tricky condition for this function is: taking a small prefix of a long string (e.g. substr(long_str_with_1k_chars, 1, 20)).In this case the ASCII validation overhead can be greater than decoding a small number of characters. As a result, in some micro benchmarks (taking the first 6 bytes from 128 bytes string), this PR introduced ~5% slowdown.
To avoid big regression for similar patterns, the implementation will check the approximate string length, and skip ASCII validation if strings are too long (See code comment for more detail)

The micro-benchmark result:
substr_baseline - No optimization
substr_before - StringView optimization (take substr by only modifying views and avoid copy the whole string), introduced by #12044
substr_after - StringView optimization + (this PR)ASCII fast path

What changes are included in this PR?

If input string is ASCII-only, use function arguments directly as byte indices to compute substring

Are these changes tested?

Existing sqllogictests have enough coverage for ASCII/NonASCII/Mixed test cases for substr() function

Are there any user-facing changes?

goldmedal

Thanks @2010YOUY01, this PR makes sense to me. 👍

findepi · 2024-09-13T15:18:20Z

datafusion/functions/src/unicode/substr.rs

+    // A common pattern to call `substr()` is taking a small prefix of a long
+    // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+    // In such case the overhead of ASCII-validation may not be worth it, so
+    // skip the validation for long strings for now.


Why not check only the requested string prefix for being ascii?
could string_view_array.is_ascii variant validate string prefixes of given length why still being vectorized?

I am not quite sure if it is the same question that @findepi is asking, but I wonder if we could get back the performance loss by also using the information on the # bytes are we requesting? Like if the prefix length is less than 32 say, don't bother checking for ascii. 🤔

I thinking short prefixes are likely common (looking for http:// as a url prefix, for example). 🤔

Why not check only the requested string prefix for being ascii? could string_view_array.is_ascii variant validate string prefixes of given length why still being vectorized?

I think it's a good idea for the current situation
However in the long term we might use an alternative approach: do validation when reading arrays from storage to memory, and cache this is_ascii property within the arrow array (as suggested by @alamb #12444 (review))

alamb

Thank you @2010YOUY01 and @goldmedal

I am in general somewhat lukewarm on adding optimizations that make some queries faster and some slower (as it then becomes a tradeoff, and different users might have different tradeoffs).

It would be great to figure out how to avoid this tradeoff (I left one suggestion)

The other thing I keep thinking is how can we avoid this 'is_ascii' check at runtime (so things get faster regardless). Maybe it is time to consider starting to propage the is_ascii flag on the arrays themselves

The parquet reader, for example, knows when it has only ascii data

datafusion/functions/src/unicode/substr.rs

alamb · 2024-09-13T20:23:40Z

datafusion/functions/src/unicode/substr.rs

+    // A common pattern to call `substr()` is taking a small prefix of a long
+    // string, such as `substr(long_str_with_1k_chars, 1, 32)`.
+    // In such case the overhead of ASCII-validation may not be worth it, so
+    // skip the validation for long strings for now.


I am not quite sure if it is the same question that @findepi is asking, but I wonder if we could get back the performance loss by also using the information on the # bytes are we requesting? Like if the prefix length is less than 32 say, don't bother checking for ascii. 🤔

I thinking short prefixes are likely common (looking for http:// as a url prefix, for example). 🤔

2010YOUY01 · 2024-09-14T07:50:21Z

I am in general somewhat lukewarm on adding optimizations that make some queries faster and some slower (as it then becomes a tradeoff, and different users might have different tradeoffs).

It would be great to figure out how to avoid this tradeoff (I left one suggestion)

I think this regression is fixable in the long term (by making ASCII check more efficient, currently especially for StringView ASCII check is not the most efficient way), but it's a good idea to be more conservative and skip ASCII validation for small prefix for now.
I applied this suggestion and benched again and I think there is no noticeable ASCII check overhead:

Result:
substr_before is current main already with StringView optimization to avoid copy
susbtr_after is this PR with additional ASCII fast path

group                                                                              substr_after                           substr_before
-----                                                                              ------------                           -------------
LONGER THAN 12/substr_large_string [size=1024, count=64, strlen=128]               1.00     74.1±1.13µs        ? ?/sec    2.65    196.4±1.32µs        ? ?/sec
LONGER THAN 12/substr_large_string [size=4096, count=64, strlen=128]               1.00    290.6±1.16µs        ? ?/sec    2.68   779.1±17.07µs        ? ?/sec
LONGER THAN 12/substr_string [size=1024, count=64, strlen=128]                     1.00     72.9±0.25µs        ? ?/sec    2.91   212.2±13.48µs        ? ?/sec
LONGER THAN 12/substr_string [size=4096, count=64, strlen=128]                     1.00    285.0±1.72µs        ? ?/sec    2.99   852.6±67.06µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=1024, count=64, strlen=128]                1.00     29.7±0.17µs        ? ?/sec    5.61   166.5±24.98µs        ? ?/sec
LONGER THAN 12/substr_string_view [size=4096, count=64, strlen=128]                1.00    117.8±0.92µs        ? ?/sec    5.29   623.4±29.53µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=1024, strlen=12]                         1.00     59.0±0.67µs        ? ?/sec    1.15     67.8±1.30µs        ? ?/sec
SHORTER THAN 12/substr_large_string [size=4096, strlen=12]                         1.00    228.5±2.10µs        ? ?/sec    1.26   289.0±25.86µs        ? ?/sec
SHORTER THAN 12/substr_string [size=1024, strlen=12]                               1.00     55.3±0.46µs        ? ?/sec    1.06     58.5±3.18µs        ? ?/sec
SHORTER THAN 12/substr_string [size=4096, strlen=12]                               1.00    214.8±1.59µs        ? ?/sec    1.04    222.4±4.55µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=1024, strlen=12]                          1.00     18.2±0.09µs        ? ?/sec    1.27     23.0±0.49µs        ? ?/sec
SHORTER THAN 12/substr_string_view [size=4096, strlen=12]                          1.00     73.5±1.79µs        ? ?/sec    1.44   105.8±11.82µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=1024, count=6, strlen=128]    1.00     75.9±0.40µs        ? ?/sec    1.04     78.8±3.79µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_large_string [size=4096, count=6, strlen=128]    1.00    297.4±2.70µs        ? ?/sec    1.01    299.3±8.54µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=1024, count=6, strlen=128]          1.00     77.8±0.24µs        ? ?/sec    1.07    83.4±10.36µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string [size=4096, count=6, strlen=128]          1.04    300.9±1.48µs        ? ?/sec    1.00    289.1±3.56µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=1024, count=6, strlen=128]     1.06     33.3±0.63µs        ? ?/sec    1.00     31.5±0.15µs        ? ?/sec
SRC_LEN > 12, SUB_LEN < 12/substr_string_view [size=4096, count=6, strlen=128]     1.00    129.8±2.23µs        ? ?/sec    1.01   130.8±13.20µs        ? ?/sec

The other thing I keep thinking is how can we avoid this 'is_ascii' check at runtime (so things get faster regardless). Maybe it is time to consider starting to propage the is_ascii flag on the arrays themselves

The parquet reader, for example, knows when it has only ascii data

I think it's a good idea.
I'm curious (and also to justify the extra complexity), is your (InfluxDB) real workload dominated by String data? I saw somewhere Databricks and Tableau said their production workload has >50% string data, many are the substitute for UDT, and also uncleaned raw data, for such case it should be worth the effort

alamb

Thank you @2010YOUY01 -- I think this PR and code are now looking quite good 👌

Thank you @goldmedal and @findepi for the review

alamb · 2024-09-16T19:13:02Z

datafusion/functions/src/unicode/substr.rs

+// However, checking if a string is ASCII-only is relatively cheap.
+// If strings are ASCII only, use byte-based indices instead.
+//
+// A common pattern to call `substr()` is taking a small prefix of a long


alamb · 2024-09-16T19:14:01Z

datafusion/functions/src/unicode/substr.rs

+            let short_prefix_threshold = 32.0;
+            let n_sample = 10;
+
+            // HACK: can be simplified if function has specialized


its a good point this could be faster if it had a specialization for ScalarValue

Any chance you can file a ticket for this?

Specialize ASCII case for substr()

4166f89

github-actions bot added the functions label Sep 12, 2024

goldmedal approved these changes Sep 13, 2024

View reviewed changes

findepi reviewed Sep 13, 2024

View reviewed changes

alamb reviewed Sep 13, 2024

View reviewed changes

cleanup + don't validate ASCII for short prefix

9e9dddc

alamb approved these changes Sep 16, 2024

View reviewed changes

alamb merged commit 55707dc into apache:main Sep 17, 2024
24 checks passed

This pull request was closed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specialize ASCII case for substr() #12444

Specialize ASCII case for substr() #12444

2010YOUY01 commented Sep 12, 2024 •

edited

Loading

goldmedal left a comment

findepi Sep 13, 2024

alamb Sep 13, 2024

2010YOUY01 Sep 14, 2024

alamb left a comment •

edited

Loading

alamb Sep 13, 2024

2010YOUY01 commented Sep 14, 2024

alamb left a comment •

edited

Loading

alamb Sep 16, 2024

alamb Sep 16, 2024

Specialize ASCII case for substr() #12444

Specialize ASCII case for substr() #12444

Conversation

2010YOUY01 commented Sep 12, 2024 • edited Loading

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

goldmedal left a comment

Choose a reason for hiding this comment

findepi Sep 13, 2024

Choose a reason for hiding this comment

alamb Sep 13, 2024

Choose a reason for hiding this comment

2010YOUY01 Sep 14, 2024

Choose a reason for hiding this comment

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Sep 13, 2024

Choose a reason for hiding this comment

2010YOUY01 commented Sep 14, 2024

alamb left a comment • edited Loading

Choose a reason for hiding this comment

alamb Sep 16, 2024

Choose a reason for hiding this comment

alamb Sep 16, 2024

Choose a reason for hiding this comment

2010YOUY01 commented Sep 12, 2024 •

edited

Loading

alamb left a comment •

edited

Loading

alamb left a comment •

edited

Loading