Improve performance of LIKE matchers by martint · Pull Request #15999 · trinodb/trino

martint · 2023-02-07T00:33:01Z

Improve performance of LIKE matchers.

Improves performance when compiling the NFA and DFA for the LIKE matcher
Introduces an NFA-based matcher that's cheaper to construct to use when matching dynamic patterns.

Before this change, compilation times for the DFA vs Joni:

compileDFA    %abc%def%    1484095.662 ± 13372.922  ns/op
compileJoni   %abc%def%        767.662 ±     6.783  ns/op

Performance after this change (note that compilation for the DFA is now lazy, so that paid for on first match and amortized over multiple matches:

compileDFA     %abc%def%       128.121 ± 1.557  ns/op
compileNFA     %abc%def%       212.063 ± 3.552  ns/op
compileJoni    %abc%def%       779.040 ± 9.597  ns/op

Additionally, this is the relative amortized performance for calls to match in each of the implementation:

matchDFA      %abc%def%        146.179 ± 1.182  ns/op
matchNFA      %abc%def%        263.210 ± 5.038  ns/op
matchJoni     %abc%def%        450.610 ± 4.898  ns/op

Finally, this is the relative performance for a "dynamic pattern" scenario, i.e., when the matcher is constructed, used once and thrown away. Note that this includes some optimizations to the algorithm for compiling the DFA, hence the difference with the original numbers above.

dfa    249740.218 ± 2133.271  ns/op
nfa       570.003 ±    3.615  ns/op
joni     1268.622 ±   12.690  ns/op

Release notes

TBD

gaurav8297

Some minor comments

core/trino-main/src/main/java/io/trino/likematcher/DFA.java

core/trino-main/src/main/java/io/trino/likematcher/DenseDfaMatcher.java

core/trino-main/src/main/java/io/trino/likematcher/NFA.java

core/trino-main/src/main/java/io/trino/likematcher/NfaMatcher.java

gaurav8297

Can we fix the CI?

phd3 · 2023-02-14T23:25:38Z

these commits lgtm

Benchmark LIKE pattern compilation
Simplify construction of LIKE matcher NFA
Encapsulate translation of pattern into DFA
Use ordered list instead of map for transitions
Convert NFA to class

phd3 · 2023-02-14T23:29:39Z

for ci, I think the issue is that %$_% resolves to [ literal "_", Any "%"] and misses initial '%' because of special branching for escape

Instead of using epsilon transitions for %, as in: e ┌─────────────────┐ ▼ │ ┌─────┐ ┌┴────┐ │ │ <any> │ │ ──►│ 0 ├──────────────►│ 1 ├──► │ │ │ │ └───┬─┘ └─────┘ │ ▲ └──────────────────┘ e it can be modeled as a state with a loopback match on any input: <any> ┌───┐ │ │ ▼ │ ┌────┴┐ │ │ ───►│ 0 ├──► │ │ └─────┘ This removes the need to calculate the transitive closure of the states during the transformation to a DFA, and results in a minor performance improvement for pattern compilation. (pattern) Before After % 26.327 ± 0.239 ns/op 26.216 ± 0.353 ns/op _% 198676.805 ± 3001.159 ns/op 137534.589 ± 937.491 ns/op %_ 233316.578 ± 3901.336 ns/op 148844.829 ± 1654.260 ns/op abc% 144.492 ± 4.819 ns/op 131.837 ± 3.707 ns/op %abc 101.722 ± 1.923 ns/op 124.846 ± 1.603 ns/op _____ 1088049.595 ± 9539.347 ns/op 682908.928 ± 8803.749 ns/op abc%def%ghi 502509.362 ± 5676.648 ns/op 273538.092 ± 1928.285 ns/op %abc%def% 1460704.116 ± 24356.287 ns/op 756419.174 ± 9816.760 ns/op

This is in preparation for introducing alternative matching strategies that are no so expensive to compile.

To have more control over how instances are created.

Use fastutil sets and primitive ints instead of HashSet. Before After % 26.274 ± 0.176 ns/op 26.496 ± 0.081 ns/op _% 129592.079 ± 505.848 ns/op 47977.429 ± 181.518 ns/op %_ 128077.197 ± 483.910 ns/op 44113.168 ± 270.582 ns/op abc% 129.875 ± 0.854 ns/op 142.563 ± 0.719 ns/op %abc 93.918 ± 0.826 ns/op 124.741 ± 0.545 ns/op _____ 597602.869 ± 2663.150 ns/op 195144.440 ± 733.982 ns/op abc%def%ghi 258900.167 ± 1012.373 ns/op 94865.609 ± 5586.094 ns/op %abc%def% 675161.396 ± 3570.877 ns/op 271948.762 ± 1989.130 ns/op

Instead, use a hard-coded state id to represent the failed state. This helps reduce the complexity of the NFA and DFA by removing unnecessary transitions.

The matcher is much cheaper to construct as it doesn't require translating the NFA to a DFA. Benchmark including compilation and matching times for the "%abc%def%" pattern: dfa 257710.213 ± 22629.891 ns/op nfa 571.737 ± 5.254 ns/op joni 1146.428 ± 14.036 ns/op Benchmark just for the matching portion: dfa 148.657 ± 0.772 ns/op nfa 259.458 ± 4.031 ns/op joni 447.760 ± 2.812 ns/op

The parser now produces an optimized pattern consisting of a sequence of literals and compacted "any" elements. Previously, the sequence would contain one "any" element for each _ and %, which would later be coalesced by the optimize() method.

phd3

These commits lgtm:

Build NFA/DFA with more efficient data structures
Do not create a failed state
Simplify parsing and optimization of pattern

core/trino-main/src/main/java/io/trino/likematcher/DFA.java

cla-bot bot added the cla-signed label Feb 7, 2023

martint changed the title ~~Simplify construction of LIKE matcher NFA~~ Improve performance of LIKE matchers Feb 7, 2023

martint mentioned this pull request Feb 7, 2023

Fallback to regex for dynamic like pattern #15665

Closed

martint force-pushed the like-optimize-nfa branch 5 times, most recently from 9f78aff to 649d6f9 Compare February 9, 2023 18:44

martint requested review from dain and phd3 February 9, 2023 18:44

martint force-pushed the like-optimize-nfa branch from 649d6f9 to 4e59477 Compare February 9, 2023 20:09

gaurav8297 reviewed Feb 9, 2023

View reviewed changes

martint force-pushed the like-optimize-nfa branch 2 times, most recently from 198753b to 78bfd48 Compare February 10, 2023 23:50

gaurav8297 approved these changes Feb 14, 2023

View reviewed changes

martint force-pushed the like-optimize-nfa branch from 78bfd48 to b2933c2 Compare February 15, 2023 06:30

martint added 11 commits February 15, 2023 08:32

Benchmark LIKE pattern compilation

3e2001e

Encapsulate translation of pattern into DFA

1fb8e1d

This is in preparation for introducing alternative matching strategies that are no so expensive to compile.

Use ordered list instead of map for transitions

cf48b71

Convert NFA to class

abe6a62

To have more control over how instances are created.

Do not create a failed state

eb3c15c

Instead, use a hard-coded state id to represent the failed state. This helps reduce the complexity of the NFA and DFA by removing unnecessary transitions.

Construct dense DFA lazily

5509719

Avoid creating intermediate pattern list when constructing matcher

021441e

martint force-pushed the like-optimize-nfa branch from b2933c2 to 021441e Compare February 15, 2023 16:32

phd3 reviewed Feb 17, 2023

View reviewed changes

core/trino-main/src/main/java/io/trino/likematcher/DFA.java Show resolved Hide resolved

This was referenced Feb 18, 2023

Improve performance for LIKE patterns involving % #16167

Merged

Version 394 LIKE expression slowness issue #15778

Closed

phd3 approved these changes Feb 23, 2023

View reviewed changes

martint merged commit 2c492e1 into trinodb:master Feb 23, 2023

martint deleted the like-optimize-nfa branch February 23, 2023 23:40

github-actions bot added this to the 409 milestone Feb 24, 2023

colebow mentioned this pull request Mar 1, 2023

Add Trino 409 release notes #16335

Merged

phd3 mentioned this pull request Mar 9, 2023

Huge slow perf if LIKE in the JOIN condition with pattern %xxx%, v407 #16456

Closed

jaystarshot mentioned this pull request Sep 6, 2023

[WIP] Improve like performance prestodb/presto#20790

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of LIKE matchers#15999

Improve performance of LIKE matchers#15999
martint merged 11 commits intotrinodb:masterfrom
martint:like-optimize-nfa

martint commented Feb 7, 2023 •

edited

Loading

Uh oh!

gaurav8297 left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaurav8297 left a comment •

edited

Loading

Uh oh!

phd3 commented Feb 14, 2023

Uh oh!

phd3 commented Feb 14, 2023

Uh oh!

phd3 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

Conversation

martint commented Feb 7, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release notes

Uh oh!

gaurav8297 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gaurav8297 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phd3 commented Feb 14, 2023

Uh oh!

phd3 commented Feb 14, 2023

Uh oh!

phd3 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

3 participants

martint commented Feb 7, 2023 •

edited

Loading

gaurav8297 left a comment •

edited

Loading