Skip to content

Conversation

@srowen
Copy link
Member

@srowen srowen commented Jun 16, 2019

What changes were proposed in this pull request?

UTF8String.trim() allocates a new object even if the string has no whitespace, when it can just return itself. A simple check for this case makes the method about 3x faster in the common case.

How was this patch tested?

Existing tests.

A rough benchmark of 90% strings without whitespace (at ends), and 10% that do have whitespace, suggests the average runtime goes from 20 ns to 6 ns.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you optimize trimLeft() and trimRight too?

if (s == 0 && e == numBytes - 1) {
return this;
}
return copyUTF8String(s, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid copying by changing offset if the string contains spaces at the left side, and changing numBytes for spaces at the right side?

Copy link
Member

@kiszk kiszk Jun 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks non-safe if the string is shared by others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean creating new instance of UTF8String by passing new value of offset or numBytes, and the same reference to base. Does UTF8String modify underlying base object somewhere in place?

Copy link
Member

@kiszk kiszk Jun 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My interpretation of avoid copying was to update fieild in this object. It functionally works well.

Such an optimization was done in String.substring in previous JDKs.
Pros: Avoid copying characters when trim()
Cons: Cannot free a UTF8String object object even if the UTF8String object, which was referenced by the trimed UTF8String, is dead under the case that trimed UTF8String is live.

I think that the current implementation is preferable since it does not increase # of live UTF8String objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're describing is kind of how the JVM implements "compressed strings". Interesting notes here on further optimizations in Java 9: https://www.baeldung.com/java-9-compact-string
One day we might revisit whether UTF8String is improving over String!

@SparkQA
Copy link

SparkQA commented Jun 16, 2019

Test build #106557 has finished for PR 24884 at commit 392b7b3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Jun 16, 2019

LGTM

@SparkQA
Copy link

SparkQA commented Jun 16, 2019

Test build #106558 has finished for PR 24884 at commit 89f7f45.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update the function descriptions for this special case? Currently, it only describes the copy case.

  • trim

It returns a new string ...

  • trimLeft

, returns the new string.

  • trimRight

, returns the new string.

@SparkQA
Copy link

SparkQA commented Jun 17, 2019

Test build #106566 has finished for PR 24884 at commit 81edf56.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@dongjoon-hyun
Copy link
Member

Thank you for updating, @srowen . The sentence looks correct since it describes the content of the return value. However, can we explicitly describe both return this and the memory copy cases separately? @return this string seems to give an impression to reuse the underlying memory space. But, we do copy for the 10% of the input strings.

@srowen
Copy link
Member Author

srowen commented Jun 17, 2019

Do we want to describe the implementation in the docs? the caller doesn't necessarily care or need to know about that implementation detail. Granted this is an internal class anyway, but developers looking at the 'internal' javadoc are already looking at the source. Here I said "string" rather than UTF8String to indicate it's the logical contents of this string with modifications. BTW the "10%" just referred to my crude benchmark. There's no way to know how many strings will be passed that need trimming, though probably few do.

@dongjoon-hyun
Copy link
Member

Got it, @srowen ~

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM. Merged to master.
Thank you, @srowen , @MaxGekk , @kiszk .

@srowen srowen deleted the SPARK-28066 branch June 19, 2019 13:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants