Improve performance of LIKE matchers#15999
Merged
martint merged 11 commits intotrinodb:masterfrom Feb 23, 2023
Merged
Conversation
9f78aff to
649d6f9
Compare
649d6f9 to
4e59477
Compare
gaurav8297
reviewed
Feb 9, 2023
core/trino-main/src/main/java/io/trino/likematcher/DenseDfaMatcher.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/likematcher/DenseDfaMatcher.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/likematcher/DenseDfaMatcher.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/likematcher/NfaMatcher.java
Outdated
Show resolved
Hide resolved
core/trino-main/src/main/java/io/trino/likematcher/NfaMatcher.java
Outdated
Show resolved
Hide resolved
198753b to
78bfd48
Compare
Member
|
these commits lgtm |
Member
|
for ci, I think the issue is that |
78bfd48 to
b2933c2
Compare
Instead of using epsilon transitions for %, as in:
e
┌─────────────────┐
▼ │
┌─────┐ ┌┴────┐
│ │ <any> │ │
──►│ 0 ├──────────────►│ 1 ├──►
│ │ │ │
└───┬─┘ └─────┘
│ ▲
└──────────────────┘
e
it can be modeled as a state with a loopback match on any input:
<any>
┌───┐
│ │
▼ │
┌────┴┐
│ │
───►│ 0 ├──►
│ │
└─────┘
This removes the need to calculate the transitive closure of the
states during the transformation to a DFA, and results in a minor
performance improvement for pattern compilation.
(pattern) Before After
% 26.327 ± 0.239 ns/op 26.216 ± 0.353 ns/op
_% 198676.805 ± 3001.159 ns/op 137534.589 ± 937.491 ns/op
%_ 233316.578 ± 3901.336 ns/op 148844.829 ± 1654.260 ns/op
abc% 144.492 ± 4.819 ns/op 131.837 ± 3.707 ns/op
%abc 101.722 ± 1.923 ns/op 124.846 ± 1.603 ns/op
_____ 1088049.595 ± 9539.347 ns/op 682908.928 ± 8803.749 ns/op
abc%def%ghi 502509.362 ± 5676.648 ns/op 273538.092 ± 1928.285 ns/op
%abc%def% 1460704.116 ± 24356.287 ns/op 756419.174 ± 9816.760 ns/op
This is in preparation for introducing alternative matching strategies that are no so expensive to compile.
To have more control over how instances are created.
Use fastutil sets and primitive ints instead of HashSet.
Before After
% 26.274 ± 0.176 ns/op 26.496 ± 0.081 ns/op
_% 129592.079 ± 505.848 ns/op 47977.429 ± 181.518 ns/op
%_ 128077.197 ± 483.910 ns/op 44113.168 ± 270.582 ns/op
abc% 129.875 ± 0.854 ns/op 142.563 ± 0.719 ns/op
%abc 93.918 ± 0.826 ns/op 124.741 ± 0.545 ns/op
_____ 597602.869 ± 2663.150 ns/op 195144.440 ± 733.982 ns/op
abc%def%ghi 258900.167 ± 1012.373 ns/op 94865.609 ± 5586.094 ns/op
%abc%def% 675161.396 ± 3570.877 ns/op 271948.762 ± 1989.130 ns/op
Instead, use a hard-coded state id to represent the failed state. This helps reduce the complexity of the NFA and DFA by removing unnecessary transitions.
The matcher is much cheaper to construct as it doesn't require
translating the NFA to a DFA.
Benchmark including compilation and matching times for the "%abc%def%" pattern:
dfa 257710.213 ± 22629.891 ns/op
nfa 571.737 ± 5.254 ns/op
joni 1146.428 ± 14.036 ns/op
Benchmark just for the matching portion:
dfa 148.657 ± 0.772 ns/op
nfa 259.458 ± 4.031 ns/op
joni 447.760 ± 2.812 ns/op
The parser now produces an optimized pattern consisting of a sequence of literals and compacted "any" elements. Previously, the sequence would contain one "any" element for each _ and %, which would later be coalesced by the optimize() method.
b2933c2 to
021441e
Compare
phd3
reviewed
Feb 17, 2023
Member
phd3
left a comment
There was a problem hiding this comment.
These commits lgtm:
Build NFA/DFA with more efficient data structures
Do not create a failed state
Simplify parsing and optimization of pattern
This was referenced Feb 18, 2023
phd3
approved these changes
Feb 23, 2023
6 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Improve performance of LIKE matchers.
Before this change, compilation times for the DFA vs Joni:
Performance after this change (note that compilation for the DFA is now lazy, so that paid for on first match and amortized over multiple matches:
Additionally, this is the relative amortized performance for calls to match in each of the implementation:
Finally, this is the relative performance for a "dynamic pattern" scenario, i.e., when the matcher is constructed, used once and thrown away. Note that this includes some optimizations to the algorithm for compiling the DFA, hence the difference with the original numbers above.
Release notes
TBD