-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-27839][SQL] Change UTF8String.replace() to operate on UTF8 bytes #24707
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #105779 has finished for PR 24707 at commit
|
| // Use reference equality to cheaply detect whether the replacement had no effect, | ||
| // in which case we can simply return the original UTF8String and save some copying. | ||
| if (before == after) { | ||
| return this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One consideration here: do we need to make a defensive copy? If so, we can't do this optimization.
Why might we need to copy a UTF8String? The UTF8String instance itself is effectively immutable, but the underlying storage might be a region of potentially-not-exclusively-owned memory (either direct/off-heap memory or a region of a long[] array), so we might need to perform a copy in case we're going to buffer / otherwise hold onto the UTF8String past a point where the underlying underlying storage memory could be mutated.
I think the most common case to worry about would be a UTF8String which is backed by memory that is part of a larger UnsafeRow. If we're doing row-at-a-time processing and aren't holding onto this UTF8String across rows then I think we're ok since changes to rows' memory during single-row processing would impact many parts of Spark and would probably be detected. In the few places where we do hold references across evaluations / rows then we need to copy, but I suspect most places already do this: for example, see the regexp.clone() in the RegExpReplace expression.
My intuition is that we probably don't need to make a defensive copy here because I doubt we have parts of the code which specifically assume that replace() will copy (i.e. which are abusing replace() as a slow clone() mechanism). Put differently, I suspect that any code which would fail due to lack of copying in replace() is also vulnerable to this problem from other sources (including simply reading a string from a row without further modification), so I don't think we need to add extra copying here.
I'd love to get additional sets of eyes on this, though, and I'd ultimately be ok with changing return this to return this.clone() (and updating the other return this uses in UTF8String) if we conclude that this isn't safe (or are uncertain and want to err on the side of caution).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your intuition is right here.
common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java
Outdated
Show resolved
Hide resolved
| // Use reference equality to cheaply detect whether the replacement had no effect, | ||
| // in which case we can simply return the original UTF8String and save some copying. | ||
| if (before == after) { | ||
| return this; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your intuition is right here.
|
Test build #105803 has finished for PR 24707 at commit
|
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JoshRosen Would it make sense to implement replace directly on UTF8String without converting it to String? For example, using the lib or writing similar code: https://apple.github.io/foundationdb/javadoc/com/apple/foundationdb/tuple/ByteArrayUtil.html#replace-byte:A-byte:A-byte:A-
Also the implementation of the replace() method of commons-lang3 does not look so complex. It uses indexOf which we have in UTF8String + StringBuilder. The last one for UTF8String could be useful in another places. WDYT?
| // At least one match was found. Estimate space needed for result. | ||
| int increase = replace.numBytes - search.numBytes; | ||
| increase = increase < 0 ? 0 : increase; | ||
| increase *= 16; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Modeled after https://github.com/apache/commons-lang/blob/999030a23c214a1fdcfc2f1464183e0c752777f5/src/main/java/org/apache/commons/lang3/StringUtils.java#L5617, which (somewhat arbitrarily?) allocates enough buffer space to handle 16 replacements without growth in case max is not set.
|
@MaxGekk, that's a great idea: I've gone ahead and re-implemented Just pushed my changes and updated the description. I'm going to loop back later to give this a second pass of self-review, but wanted to get this new version posted up now for initial feedback. |
|
Test build #105847 has finished for PR 24707 at commit
|
|
Test build #106331 has finished for PR 24707 at commit
|
|
retest this please |
|
Test build #106336 has finished for PR 24707 at commit
|
|
retest this please |
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
|
Test build #4801 has finished for PR 24707 at commit
|
common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java
Show resolved
Hide resolved
|
Thanks to everyone who reviewed. I took another self-review and this still looks good to me, so I'm going to merge into master for Spark 3.x. |
What changes were proposed in this pull request?
This PR significantly improves the performance of
UTF8String.replace()by performing direct replacement over UTF8 bytes instead of decoding those bytes into Java Strings.In cases where the search string is not found (i.e. no replacements are performed, a case which I expect to be common) this new implementation performs no object allocation or memory copying.
My implementation is modeled after
commons-lang3'sStringUtils.replace()method. As part of my implementation, I needed a StringBuilder / resizable buffer, so I movedUTF8StringBuilderfrom thecatalystpackage tounsafe.How was this patch tested?
Copied tests from
StringExpressionSuitetoUTF8StringSuiteand added a couple of new cases.To evaluate performance, I did some quick local benchmarking by running the following code in
spark-shell(with Java 1.8.0_191):On my laptop this took ~54 / ~40 seconds seconds before this patch's changes and ~6.5 / ~3.8 seconds afterwards.