Skip to content
Closed
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -534,12 +534,16 @@ public UTF8String trim() {
// skip all of the space (0x20) in the left side
while (s < this.numBytes && getByte(s) == 0x20) s++;
if (s == this.numBytes) {
// empty string
// Everything trimmed
return EMPTY_UTF8;
}
// skip all of the space (0x20) in the right side
int e = this.numBytes - 1;
while (e > s && getByte(e) == 0x20) e--;
if (s == 0 && e == numBytes - 1) {
// Nothing trimmed
return this;
}
return copyUTF8String(s, e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you avoid copying by changing offset if the string contains spaces at the left side, and changing numBytes for spaces at the right side?

Copy link
Member

@kiszk kiszk Jun 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks non-safe if the string is shared by others.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean creating new instance of UTF8String by passing new value of offset or numBytes, and the same reference to base. Does UTF8String modify underlying base object somewhere in place?

Copy link
Member

@kiszk kiszk Jun 16, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. My interpretation of avoid copying was to update fieild in this object. It functionally works well.

Such an optimization was done in String.substring in previous JDKs.
Pros: Avoid copying characters when trim()
Cons: Cannot free a UTF8String object object even if the UTF8String object, which was referenced by the trimed UTF8String, is dead under the case that trimed UTF8String is live.

I think that the current implementation is preferable since it does not increase # of live UTF8String objects.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What you're describing is kind of how the JVM implements "compressed strings". Interesting notes here on further optimizations in Java 9: https://www.baeldung.com/java-9-compact-string
One day we might revisit whether UTF8String is improving over String!

}

Expand All @@ -562,12 +566,15 @@ public UTF8String trimLeft() {
int s = 0;
// skip all of the space (0x20) in the left side
while (s < this.numBytes && getByte(s) == 0x20) s++;
if (s == 0) {
// Nothing trimmed
return this;
}
if (s == this.numBytes) {
// empty string
// Everything trimmed
return EMPTY_UTF8;
} else {
return copyUTF8String(s, this.numBytes - 1);
}
return copyUTF8String(s, this.numBytes - 1);
}

/**
Expand Down Expand Up @@ -597,26 +604,30 @@ public UTF8String trimLeft(UTF8String trimString) {
}
srchIdx += searchCharBytes;
}

if (srchIdx == 0) {
// Nothing trimmed
return this;
}
if (trimIdx >= numBytes) {
// empty string
// Everything trimmed
return EMPTY_UTF8;
} else {
return copyUTF8String(trimIdx, numBytes - 1);
}
return copyUTF8String(trimIdx, numBytes - 1);
}

public UTF8String trimRight() {
int e = numBytes - 1;
// skip all of the space (0x20) in the right side
while (e >= 0 && getByte(e) == 0x20) e--;

if (e == numBytes - 1) {
// Nothing trimmed
return this;
}
if (e < 0) {
// empty string
// Everything trimmed
return EMPTY_UTF8;
} else {
return copyUTF8String(0, e);
}
return copyUTF8String(0, e);
}

/**
Expand Down Expand Up @@ -658,12 +669,15 @@ public UTF8String trimRight(UTF8String trimString) {
numChars --;
}

if (trimEnd == numBytes - 1) {
// Nothing trimmed
return this;
}
if (trimEnd < 0) {
// empty string
// Everything trimmed
return EMPTY_UTF8;
} else {
return copyUTF8String(0, trimEnd);
}
return copyUTF8String(0, trimEnd);
}

public UTF8String reverse() {
Expand Down