-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-19555][SQL] Improve the performance of StringUtils.escapeLikeRegex method #16893
Conversation
|
Test build #72733 has started for PR 16893 at commit |
|
jenkins retest this please |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since this PR is about improving performance, would recommend trying out with while loop and a counter instead of for(). I ran your test snippet on my box:
with for():
function took 5287 ms
function took 5234 ms
function took 5149 ms
with while loop and counter:
function took 4762 ms
function took 4216 ms
function took 4212 ms
Here is my version with while loop and counter:
def escapeLikeRegex(input: String): String = {
val builder = new StringBuilder("(?s)")
val length = input.length
var i = 0
var previousChar = ' '
while (i < length) {
val currentChar = input.charAt(i)
if (currentChar != '\\') {
val out = if (previousChar == '\\') {
currentChar match {
case '_' => "_"
case '%' => "%"
case _ => Pattern.quote("\\" + currentChar)
}
} else {
currentChar match {
case '_' => "."
case '%' => ".*"
case _ => Pattern.quote(Character.toString(currentChar))
}
}
builder.append(out)
}
previousChar = currentChar
i += 1
}
builder.toString()
}
|
|
||
| // replace the _ with .{1} exactly match 1 time of any character | ||
| // replace the % with .*, match 0 or more times with any character | ||
| def escapeLikeRegex(v: String): String = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: while you are at it, can you also make the var names better ?
v -> input
c -> currentChar
prev -> previousChar
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we could further simplify this by replacing previousChar with a boolean, nextCharacterIsEscaped (or inEscape).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, I'm wrong: we can't quite do that simplification because the old code has a subtle bug related to backslash-escaping. Due to the complexity and terseness the old implementation, it's a little non-obvious to spot that the case (prev, '\\') => "" has the effect of always ignoring backslash characters, so this method is incapable of producing a backslash in its output. This is a problem if the user wants to write a LIKE pattern to match backslashes then this is impossible with the current code.
It turns out that this is covered by #15398, which also implements performance improvements for this code, so I guess this PR and JIRA is redundant :(
I thought #15398 had been merged / fixed by now, but I guess not.
|
Test build #72788 has finished for PR 16893 at commit
|
|
Hi all, where are we on this? Is this still active? |
|
@HyukjinKwon, we can close this because its optimizations were incorporated into SPARK-17647 (I ran the benchmarks to verify this, too). I'm going to resolve this JIRA as fixed by that one as well. |
## What changes were proposed in this pull request? This PR proposes to close PRs ... - inactive to the review comments more than a month - WIP and inactive more than a month - with Jenkins build failure but inactive more than a month - suggested to be closed and no comment against that - obviously looking inappropriate (e.g., Branch 0.5) To make sure, I left a comment for each PR about a week ago and I could not have a response back from the author in these PRs below: Closes apache#11129 Closes apache#12085 Closes apache#12162 Closes apache#12419 Closes apache#12420 Closes apache#12491 Closes apache#13762 Closes apache#13837 Closes apache#13851 Closes apache#13881 Closes apache#13891 Closes apache#13959 Closes apache#14091 Closes apache#14481 Closes apache#14547 Closes apache#14557 Closes apache#14686 Closes apache#15594 Closes apache#15652 Closes apache#15850 Closes apache#15914 Closes apache#15918 Closes apache#16285 Closes apache#16389 Closes apache#16652 Closes apache#16743 Closes apache#16893 Closes apache#16975 Closes apache#17001 Closes apache#17088 Closes apache#17119 Closes apache#17272 Closes apache#17971 Added: Closes apache#17778 Closes apache#17303 Closes apache#17872 ## How was this patch tested? N/A Author: hyukjinkwon <[email protected]> Closes apache#18017 from HyukjinKwon/close-inactive-prs.
What changes were proposed in this pull request?
Copied from SPARK-19555 JIRA
Spark's
StringUtils.escapeLikeRegex()method is written inefficiently, performing tons of object allocations due to the usezip(),flatMap(), andmkString. Instead, I think method should be rewritten in an imperative style using a Java string builder.This method can become a performance bottleneck in cases where regex expressions are used with non-constant-foldable expressions (e.g. the regex expression comes from the data rather than being part of the query).
How was this patch tested?
Existing tests.
Performance Comparison
Perf testing code:
./bin/spark-shell --master "local-cluster[2,1,10240]"