-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28066][CORE] Optimize UTF8String.trim() for common case of no whitespace #24884
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you optimize trimLeft() and trimRight too?
| if (s == 0 && e == numBytes - 1) { | ||
| return this; | ||
| } | ||
| return copyUTF8String(s, e); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you avoid copying by changing offset if the string contains spaces at the left side, and changing numBytes for spaces at the right side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks non-safe if the string is shared by others.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean creating new instance of UTF8String by passing new value of offset or numBytes, and the same reference to base. Does UTF8String modify underlying base object somewhere in place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. My interpretation of avoid copying was to update fieild in this object. It functionally works well.
Such an optimization was done in String.substring in previous JDKs.
Pros: Avoid copying characters when trim()
Cons: Cannot free a UTF8String object object even if the UTF8String object, which was referenced by the trimed UTF8String, is dead under the case that trimed UTF8String is live.
I think that the current implementation is preferable since it does not increase # of live UTF8String objects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What you're describing is kind of how the JVM implements "compressed strings". Interesting notes here on further optimizations in Java 9: https://www.baeldung.com/java-9-compact-string
One day we might revisit whether UTF8String is improving over String!
|
Test build #106557 has finished for PR 24884 at commit
|
|
LGTM |
|
Test build #106558 has finished for PR 24884 at commit
|
MaxGekk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update the function descriptions for this special case? Currently, it only describes the copy case.
trim
It returns a new string ...
trimLeft
, returns the new string.
trimRight
, returns the new string.
|
Test build #106566 has finished for PR 24884 at commit
|
|
Thank you for updating, @srowen . The sentence looks correct since it describes the content of the return value. However, can we explicitly describe both |
|
Do we want to describe the implementation in the docs? the caller doesn't necessarily care or need to know about that implementation detail. Granted this is an internal class anyway, but developers looking at the 'internal' javadoc are already looking at the source. Here I said "string" rather than |
|
Got it, @srowen ~ |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What changes were proposed in this pull request?
UTF8String.trim() allocates a new object even if the string has no whitespace, when it can just return itself. A simple check for this case makes the method about 3x faster in the common case.
How was this patch tested?
Existing tests.
A rough benchmark of 90% strings without whitespace (at ends), and 10% that do have whitespace, suggests the average runtime goes from 20 ns to 6 ns.