-
Notifications
You must be signed in to change notification settings - Fork 25.6k
Build complex automatons more efficiently #66724
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Build complex automatons more efficiently #66724
Conversation
This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.
The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.
Given a set of wildcard strings:
- "*-logs-*"
- "*-metrics-*"
- "web-*-prod-*"
- "web-*-staging-*"
The old implementation would perform steps roughly like:
minimize {
union {
concatenate { MATCH_ANY, "-logs-", MATCH_ANY }
concatenate { MATCH_ANY, "-metrics-", MATCH_ANY }
concatenate { "web-", MATCH_ANY, "prod-", MATCH_ANY }
concatenate { "web-", MATCH_ANY, "staging-", MATCH_ANY }
}
}
The outer minimize would require determinizing the automaton, which
was highly inefficient
The new implementation is:
minimize {
union {
concatenate {
MATCH_ANY ,
minimize {
union { "-logs-", "-metrics"- }
}
MATCH_ANY
}
concatenate {
minimize {
union {
concatenate { "web-", MATCH_ANY, "prod-" }
concatenate { "web-", MATCH_ANY, "staging-" }
}
}
MATCH_ANY
}
}
}
By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.
|
Pinging @elastic/es-security (Team:Security) |
|
To give an indication of the improvement... If a user has this role: Then in current versions (7.11 snapshot), even with After this PR, that role will work correctly, using the defaults ( |
ywangd
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a few minor comments.
The efficiency improvement is outstanding. Great job!
| final char first = p.charAt(0); | ||
| final char last = p.charAt(p.length() - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shoud we guard these with a check for p.length() == 0?
| } else if (first == '*') { | ||
| if (last == '*') { | ||
| // *something* | ||
| infix.add(p.substring(1, p.length() - 1)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is possible that the pattern is **. As such the infix is an empty string and can be skipped. But as discussed, it is probably better handled as part of compacting all consecutive *, which can be done in a separate PR.
| // But, that's not true if the string has an embedded '*' in it - in that case, we should handle them in this special way. | ||
| prefix.add(p.substring(0, p.length() - 1)); | ||
| } else { | ||
| // some*thing / some?thing / etc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the pattern like something* should also reach here and should be part of the comment.
| // some*thing* | ||
| // For simple prefix patterns ("something*") it's more efficient to do a single pass | ||
| // Lucene handles the shared trailing '*' on an accept state well, | ||
| // and performing 2 minimizes (on for the union of strings, then on again after concatenating MATCH_ANY) is slower. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This sentence reads weird to me. Should it be something like ... (one for the union of strings, and another one after ...)
| return minimize(pattern(patterns.iterator().next())); | ||
| } | ||
|
|
||
| final Function<Collection<String>, Automaton> build = strings -> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: I'd prefer to name this variable with a noun, something like buildFunc.
|
I am with Yang on comment on the line 122 ^^ |
This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.
The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.
Given a set of wildcard strings:
- "*-logs-*"
- "*-metrics-*"
- "web-*-prod-*"
- "web-*-staging-*"
The old implementation would perform steps roughly like:
minimize {
union {
concatenate { MATCH_ANY, "-logs-", MATCH_ANY }
concatenate { MATCH_ANY, "-metrics-", MATCH_ANY }
concatenate { "web-", MATCH_ANY, "prod-", MATCH_ANY }
concatenate { "web-", MATCH_ANY, "staging-", MATCH_ANY }
}
}
The outer minimize would require determinizing the automaton, which
was highly inefficient
The new implementation is:
minimize {
union {
concatenate {
MATCH_ANY ,
minimize {
union { "-logs-", "-metrics"- }
}
MATCH_ANY
}
concatenate {
minimize {
union {
concatenate { "web-", MATCH_ANY, "prod-" }
concatenate { "web-", MATCH_ANY, "staging-" }
}
}
MATCH_ANY
}
}
}
By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.
Backport of: elastic#66724
This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.
The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.
Given a set of wildcard strings:
- "*-logs-*"
- "*-metrics-*"
- "web-*-prod-*"
- "web-*-staging-*"
The old implementation would perform steps roughly like:
minimize {
union {
concatenate { MATCH_ANY, "-logs-", MATCH_ANY }
concatenate { MATCH_ANY, "-metrics-", MATCH_ANY }
concatenate { "web-", MATCH_ANY, "prod-", MATCH_ANY }
concatenate { "web-", MATCH_ANY, "staging-", MATCH_ANY }
}
}
The outer minimize would require determinizing the automaton, which
was highly inefficient
The new implementation is:
minimize {
union {
concatenate {
MATCH_ANY ,
minimize {
union { "-logs-", "-metrics"- }
}
MATCH_ANY
}
concatenate {
minimize {
union {
concatenate { "web-", MATCH_ANY, "prod-" }
concatenate { "web-", MATCH_ANY, "staging-" }
}
}
MATCH_ANY
}
}
}
By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.
Backport of: #66724
This change substantially reduces the CPU and Heap usage of
StringMatcher when processing large complex patterns.
The improvement is achieved by switching the order in which we
perform concatenation and union for common styles of wildcard patterns.
Given a set of wildcard strings:
"*-logs-*""*-metrics-*""web-*-prod-*""web-*-staging-*"The old implementation would perform steps roughly like:
The outer minimize would require determinizing the automaton, which
was highly inefficient
The new implementation is:
By performing a union of the inner strings before concatenating the
MATCH_ANY ("*") the time & heap space spent on determinizing the
automaton is greatly reduced.
Resolves: #36062