-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-30245][SQL] Add cache for Like and RLike when pattern is not static #26875
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| s""" | ||
| String $rightStr = $eval2.toString(); | ||
| $patternClass $pattern = $patternClass.compile($escapeFunc($rightStr, '$newEscapeChar')); | ||
| if ($lastRightStr != $rightStr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this optimization is a corner case and highly depends on input data.... cc: @dongjoon-hyun
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, and here I have an another idea.
Now BroadcastNestedLoopJoinExec always loop streamed firstly then nest loop broadcast and the pattern is always at broadcast side.
In inner like join, we can add another optimization that loop broadcast first and then nest loop streamed. So we can use the feature of this pr.
|
Test build #115276 has finished for PR 26875 at commit
|
|
Test build #115285 has finished for PR 26875 at commit
|
|
Do you have time to take a look? @cloud-fan @HyukjinKwon @viirya |
|
This seems not very useful. We should recommend users to use string literal as pattern. |
|
Yes, it just has a few scene and it is disencouraging. But when it happens, spark will run very slowly. I meet a scene that check two table which storage some file locations, then use join and like to compare location. There is only one way spark can choose is E.G. table t1 and table t2 both have 1 million rows, and t1 has 200 partitions, t2 is broadcasted, then spark will compile pattern 1 million * 1 million count. But with cache, just compile 1 million * 200 count. |
| s""" | ||
| String $rightStr = $eval2.toString(); | ||
| $patternClass $pattern = $patternClass.compile($rightStr); | ||
| if ($rightStr != $lastRightStr) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean to use equals()?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, forget this is not scala code.
viirya
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be an uncommon use case. Do you have benchmark number?
|
Sorry, I mistakenly closed. Yeah, it doesn't look worthwhile enough. and yeah, do you have perf numbers? |
|
I test this code in my scene, about 10 times increase. Test sql like this t1 has 200,000 rows without duplicate. |
|
It seems this sql is not in TPCDS and TPCH. |
|
Test build #115731 has finished for PR 26875 at commit
|
|
The failed ut is not related with this pr. |
|
@ulysses-you can you show the exact steps and perf numbers in the PR description? |
Sure I will add it later. |
|
cc @HyukjinKwon @viirya @cloud-fan I add the perf numbers and update pr description. |
|
Do you have time to take a look? Thanks in advance. |
|
retest this please |
|
Test build #116209 has finished for PR 26875 at commit
|
|
@cloud-fan Do you have time to take a look again? Thanks in advance. |
|
retest this please |
|
Test build #116582 has finished for PR 26875 at commit
|
|
Merged to master. |
|
Gentle ping @ulysses-you, would you mind if I ask to do a benchmark, for instance, with a very long strings like |
|
@ulysses-you, we might have to revert this commit due to the lack of benchmark and performance investigation. Can we clarify this? |
Yes, It seems to consider more. |
|
@beliefer, seems the author is inactive. Can you make some perf numbers? If that's going to take a while, I am going to revert this PR before Spark 3.0 RC. |
|
@HyukjinKwon You can revert this PR first. If I'm ready, I will pick this PR. |
|
I am going to revert this at #27514 |
|
@HyukjinKwon After many tests, I found that this modification did not bring a clear performance improvement. I test the sql show below: Spark3.1.0
Spark3.1.0 apply this PR
So I think we should not make this change and I will close #27497 |
|
Yea, closing #27497 looks fine to me. |
|
Thank you so much @beliefer. |
|
@HyukjinKwon sorry, I missed many things. And thanks @beliefer do the benchmark, but it seems something wrong, the right part should not be a foldable value. The PR aims to improve perf of the dynamic like. e.g select count(*) from t where c1 like c2. Not affect the static like. I will do a new benchmark with a very long strings what @HyukjinKwon suggested. |
|
@ulysses-you, thank you so much. Feel free to reopen the PR. |
|
Env: centos 7, 40cores, 4GB ---- test 1 ---- ---- test2 ---- About 10x time performance improvement. Seems equals is more quickly than compile pattern. And longer strings would make performance improvement better. |
| s""" | ||
| String $rightStr = $eval2.toString(); | ||
| $patternClass $pattern = $patternClass.compile($rightStr); | ||
| if (!$rightStr.equals($lastRightStr)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The positive cases are good enough. The concern I heard was actually here we add some overhead for string comparison, and it could be worse when the strings are very long.
Can we identify the worst cases? It's okay to show the trade-off explicitly. I tend to agree with compiling the pattern once is better in general. Feel free to reopen the PR once we're clear on the trade-off.
cc @rednaxelafx as well FYI.
|
---- test3 ---- The worst case is that do compare and compile each row. And it seems only little regression. |
|
Seems merged PR cann't reopen. Is there any way ? If not I will send an another pr for this. |
|
You can create new one. Can I have some feedback from @rednaxelafx before we go ahead? @rednaxelafx, does it looks good enough to you or do you have any guidance on the testing for him? |
|
@kiszk too FYI |
|
@ulysses-you I'm sorry! I lost some thing. ---- test3 ---- Why is test1 and test3 so different in time? |
|
---- test2 ---- You use |
|
The positive test case (test1, test2) is to confirm this pr has much better performance in dynamic like scene. The test1 and test2 aim to check the different string length. The negative test case (test3) is to confirm this pr has little performance regression.
Since it's a join opt, the best way is using the column which is at right of like as the |
|
test1 looks the same as test3. |
|
It's different. test1 condition use As I said above, make column that need to compile at join |
|
@ulysses-you OK. I understand it. But it seems that the performance is just good at scenarios that left join right. |
|
yeah, but it also has little performance regression with normal case seen as test3. I think it's a reason to do this. |
|
I think you should to supplement a lot of benchmark and performance investigation. Consider a variety of scenarios. |
|
Maybe I miss some point, test3 is the worst case so that other scenarios is always better than it, and test1 and test2 is the best positive case. The result is that add the cache can make the performance better within [0, 10]x times. But if you have any special scenarios, I'm ready to do the benchmark. |
What changes were proposed in this pull request?
Add cache for Like and RLike when pattern is not static
Why are the changes needed?
When pattern is not static, we should avoid compile pattern every time if some pattern is same.
Here is perf numbers, include 3 test groups and use
rangeto make it easy.Does this PR introduce any user-facing change?
No.
How was this patch tested?