-
Notifications
You must be signed in to change notification settings - Fork 738
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
improvements to (i)starts_with
and (i)ends_with
performance
#6118
improvements to (i)starts_with
and (i)ends_with
performance
#6118
Conversation
8d6c7cc
to
079b4b1
Compare
(i)starts_with
and (i)ends_with
(i)starts_with
and (i)ends_with
performance
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @samuelcolvin
I also ran the benchmarks and here is what I got -- it seems a bit mixed where some go faster and some go slower. On the whole it seems an improvement to me.
I am rerunning the numbers to see how consistent it is from run to run
++ critcmp master starts_with-ends_with-improvements
group master starts_with-ends_with-improvements
----- ------ ----------------------------------
ilike_utf8 scalar complex 1.00 302.5±1.19µs ? ?/sec 1.00 302.1±1.54µs ? ?/sec
ilike_utf8 scalar contains 1.00 1558.3±6.61µs ? ?/sec 1.03 1599.0±5.82µs ? ?/sec
ilike_utf8 scalar ends with 1.00 218.7±0.63µs ? ?/sec 1.13 246.2±0.45µs ? ?/sec
ilike_utf8 scalar equals 1.16 248.7±0.63µs ? ?/sec 1.00 214.6±0.53µs ? ?/sec
ilike_utf8 scalar starts with 1.00 279.5±0.57µs ? ?/sec 1.07 298.3±0.55µs ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4]) 1.00 88.1±0.14µs ? ?/sec 1.00 88.1±0.24µs ? ?/sec
like_utf8 scalar complex 1.00 283.4±0.99µs ? ?/sec 1.02 287.8±4.98µs ? ?/sec
like_utf8 scalar contains 1.01 349.8±0.65µs ? ?/sec 1.00 348.1±0.41µs ? ?/sec
like_utf8 scalar ends with 1.43 221.4±0.44µs ? ?/sec 1.00 154.7±0.23µs ? ?/sec
like_utf8 scalar equals 1.01 219.5±0.44µs ? ?/sec 1.00 217.7±0.65µs ? ?/sec
like_utf8 scalar starts with 1.40 242.7±0.89µs ? ?/sec 1.00 173.5±0.54µs ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4]) 1.00 88.2±0.21µs ? ?/sec 1.00 88.1±0.16µs ? ?/sec
like_utf8view scalar complex 1.00 533.0±1.24ms ? ?/sec 1.01 538.1±9.86ms ? ?/sec
like_utf8view scalar contains 1.00 379.1±0.34ms ? ?/sec 1.00 380.7±0.48ms ? ?/sec
like_utf8view scalar ends with 1.14 60.0±0.27ms ? ?/sec 1.00 52.7±0.21ms ? ?/sec
like_utf8view scalar equals 1.00 37.1±0.12ms ? ?/sec 1.00 37.0±0.11ms ? ?/sec
like_utf8view scalar starts with 1.06 60.4±0.36ms ? ?/sec 1.00 56.8±0.23ms ? ?/sec
nilike_utf8 scalar complex 1.00 302.9±1.46µs ? ?/sec 1.00 302.3±1.97µs ? ?/sec
nilike_utf8 scalar contains 1.00 1556.6±6.92µs ? ?/sec 1.03 1601.2±7.82µs ? ?/sec
nilike_utf8 scalar ends with 1.00 218.7±0.54µs ? ?/sec 1.12 246.0±0.72µs ? ?/sec
nilike_utf8 scalar equals 1.16 249.4±3.30µs ? ?/sec 1.00 214.9±0.98µs ? ?/sec
nilike_utf8 scalar starts with 1.00 279.7±0.72µs ? ?/sec 1.07 298.3±0.58µs ? ?/sec
nlike_utf8 scalar complex 1.00 283.4±1.60µs ? ?/sec 1.00 283.6±2.73µs ? ?/sec
nlike_utf8 scalar contains 1.00 350.1±0.66µs ? ?/sec 1.00 348.4±2.19µs ? ?/sec
nlike_utf8 scalar ends with 1.43 221.4±0.70µs ? ?/sec 1.00 155.0±0.47µs ? ?/sec
nlike_utf8 scalar equals 1.01 219.6±1.12µs ? ?/sec 1.00 217.7±1.46µs ? ?/sec
nlike_utf8 scalar starts with 1.35 234.9±0.65µs ? ?/sec 1.00 173.7±0.67µs ? ?/sec
Here is my next run
BTW I am running this on a Script pushd ~/arrow-rs
#git remote add samuelcolvin https://github.com/samuelcolvin/arrow-rs.git
git fetch -p samuelcolvin
BENCH_COMMAND="cargo bench -p arrow --bench comparison_kernels -F test_utils"
BENCH_FILTER="like"
REPO_NAME="samuelcolvin"
BRANCH_NAME="starts_with-ends_with-improvements"
# remove old test runs
rm -rf target/criterion/
git checkout $BRANCH_NAME
git reset --hard "$REPO_NAME/$BRANCH_NAME"
# Run on test branch
$BENCH_COMMAND -- --save-baseline ${BRANCH_NAME} ${BENCH_FILTER}
# Run on master
MERGE_BASE=$(git merge-base HEAD apache/master)
echo "** Comparing to ${MERGE_BASE}"
git checkout ${MERGE_BASE}
$BENCH_COMMAND -- --save-baseline master ${BENCH_FILTER}
critcmp master ${BRANCH_NAME}
popd |
My reading of this is that (contrary to what I found) my "istarts_with" and
"iends_with" are slower that what was here before.
Very easy to revert, but we should just check the benchmarks are really
representative of the most common queries.
Samuel Colvin
…On Sat, 27 Jul 2024, 12:55 Andrew Lamb, ***@***.***> wrote:
Here is my next run
++ critcmp master starts_with-ends_with-improvements
group master starts_with-ends_with-improvements
----- ------ ----------------------------------
ilike_utf8 scalar complex 1.00 301.7±1.10µs ? ?/sec 1.00 301.5±1.04µs ? ?/sec
ilike_utf8 scalar contains 1.00 1554.5±6.20µs ? ?/sec 1.03 1595.8±4.49µs ? ?/sec
ilike_utf8 scalar ends with 1.00 218.5±0.39µs ? ?/sec 1.13 247.1±4.53µs ? ?/sec
ilike_utf8 scalar equals 1.16 248.8±0.40µs ? ?/sec 1.00 214.8±0.66µs ? ?/sec
ilike_utf8 scalar starts with 1.00 279.4±0.36µs ? ?/sec 1.07 298.5±0.46µs ? ?/sec
ilike_utf8_scalar_dyn dictionary[10] string[4]) 1.00 88.1±0.14µs ? ?/sec 1.00 88.2±0.18µs ? ?/sec
like_utf8 scalar complex 1.01 283.4±0.82µs ? ?/sec 1.00 282.0±0.96µs ? ?/sec
like_utf8 scalar contains 1.00 347.6±0.52µs ? ?/sec 1.00 347.4±0.76µs ? ?/sec
like_utf8 scalar ends with 1.41 219.4±0.55µs ? ?/sec 1.00 155.3±2.99µs ? ?/sec
like_utf8 scalar equals 1.00 217.6±0.53µs ? ?/sec 1.00 217.9±0.73µs ? ?/sec
like_utf8 scalar starts with 1.34 232.8±0.29µs ? ?/sec 1.00 173.5±0.30µs ? ?/sec
like_utf8_scalar_dyn dictionary[10] string[4]) 1.00 88.1±0.17µs ? ?/sec 1.00 88.1±0.14µs ? ?/sec
like_utf8view scalar complex 1.00 531.3±2.04ms ? ?/sec 1.00 531.0±2.43ms ? ?/sec
like_utf8view scalar contains 1.00 378.6±0.36ms ? ?/sec 1.01 380.6±1.38ms ? ?/sec
like_utf8view scalar ends with 1.13 59.6±0.24ms ? ?/sec 1.00 52.7±0.23ms ? ?/sec
like_utf8view scalar equals 1.00 37.0±0.47ms ? ?/sec 1.00 36.9±0.09ms ? ?/sec
like_utf8view scalar starts with 1.06 60.0±0.48ms ? ?/sec 1.00 56.8±0.23ms ? ?/sec
nilike_utf8 scalar complex 1.00 301.9±1.19µs ? ?/sec 1.00 300.4±2.48µs ? ?/sec
nilike_utf8 scalar contains 1.00 1551.3±4.82µs ? ?/sec 1.03 1594.3±4.71µs ? ?/sec
nilike_utf8 scalar ends with 1.00 219.7±3.41µs ? ?/sec 1.12 246.3±0.86µs ? ?/sec
nilike_utf8 scalar equals 1.16 248.8±0.53µs ? ?/sec 1.00 214.5±0.33µs ? ?/sec
nilike_utf8 scalar starts with 1.00 279.6±1.02µs ? ?/sec 1.07 298.4±0.46µs ? ?/sec
nlike_utf8 scalar complex 1.01 283.2±1.05µs ? ?/sec 1.00 281.1±1.19µs ? ?/sec
nlike_utf8 scalar contains 1.00 347.5±0.63µs ? ?/sec 1.00 347.2±0.49µs ? ?/sec
nlike_utf8 scalar ends with 1.42 219.7±1.25µs ? ?/sec 1.00 154.8±0.24µs ? ?/sec
nlike_utf8 scalar equals 1.00 217.7±0.45µs ? ?/sec 1.00 217.7±0.47µs ? ?/sec
nlike_utf8 scalar starts with 1.34 233.0±0.85µs ? ?/sec 1.00 173.6±0.48µs ? ?/sec
BTW I am running this on a c2-standard-8 GCP instance
Script
pushd ~/arrow-rs
#git remote add samuelcolvin https://github.com/samuelcolvin/arrow-rs.git
git fetch -p samuelcolvin
BENCH_COMMAND="cargo bench -p arrow --bench comparison_kernels -F test_utils"
BENCH_FILTER="like"
REPO_NAME="samuelcolvin"
BRANCH_NAME="starts_with-ends_with-improvements"
# remove old test runs
rm -rf target/criterion/
git checkout $BRANCH_NAME
git reset --hard "$REPO_NAME/$BRANCH_NAME"
# Run on test branch$BENCH_COMMAND -- --save-baseline ${BRANCH_NAME} ${BENCH_FILTER}
# Run on master
MERGE_BASE=$(git merge-base HEAD apache/master)echo "** Comparing to ${MERGE_BASE}"
git checkout ${MERGE_BASE}$BENCH_COMMAND -- --save-baseline master ${BENCH_FILTER}
critcmp master ${BRANCH_NAME}
popd
—
Reply to this email directly, view it on GitHub
<#6118 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA62GGP2BI2NTQKNKSNMXZDZON4BBAVCNFSM6AAAAABLPEFVR6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDENJUGEYTGMBWGQ>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
079b4b1
to
8cc373b
Compare
I think the behavior here is again related to the unrealistically short (4 character) haystack used in benchmarks as I explained in #6128 (comment). Running #!/usr/bin/env bash
set -ex
git checkout master
rm -rf target/criterion
cargo bench -p arrow --bench comparison_kernels -F test_utils -- --save-baseline master 'like.*(starts|ends) with'
BRANCH_NAME=starts_with-ends_with-improvements
git checkout $BRANCH_NAME
cargo bench -p arrow --bench comparison_kernels -F test_utils -- --save-baseline $BRANCH_NAME 'like.*(starts|ends) with'
critcmp master $BRANCH_NAME on a With haystack length=4 (default)
With random haystack length=0..400
|
(I also rebased to latest master, which I guess could have had an affect, although I doubt it) |
Bechmarks after merging master and fixing conflicts:
|
🚀 |
Thanks again @samuelcolvin |
Which issue does this PR close?
Related to (but not closing) #6107.
Rationale for this change
Lots of context in #6107, this makes
LIKE
andILIKE
queries which are simply "starts with" and "ends with" significantly faster.Running
Gives notably:
Full output
What changes are included in this PR?
starts_with_ignore_ascii_case
andends_with_ignore_ascii_case
, these showed significant improvements (~20%) over the previous implementationscrate::predicate::starts_with
andcrate::predicate::ends_with
that show a 2 or 3x improvement overstr.starts_with
andstr.ends_with
Are there any user-facing changes?
Shouldn't be. I fuzzed all the implementations against the default methods here