[SPARK-27839][SQL] Change UTF8String.replace() to operate on UTF8 bytes #24707

JoshRosen · 2019-05-25T07:18:46Z

What changes were proposed in this pull request?

This PR significantly improves the performance of UTF8String.replace() by performing direct replacement over UTF8 bytes instead of decoding those bytes into Java Strings.

In cases where the search string is not found (i.e. no replacements are performed, a case which I expect to be common) this new implementation performs no object allocation or memory copying.

My implementation is modeled after commons-lang3's StringUtils.replace() method. As part of my implementation, I needed a StringBuilder / resizable buffer, so I moved UTF8StringBuilder from the catalyst package to unsafe.

How was this patch tested?

Copied tests from StringExpressionSuite to UTF8StringSuite and added a couple of new cases.

To evaluate performance, I did some quick local benchmarking by running the following code in spark-shell (with Java 1.8.0_191):

import org.apache.spark.unsafe.types.UTF8String

def benchmark(text: String, search: String, replace: String) {
  val utf8Text = UTF8String.fromString(text)
  val utf8Search = UTF8String.fromString(search)
  val utf8Replace = UTF8String.fromString(replace)

  val start = System.currentTimeMillis
  var i = 0
  while (i < 1000 * 1000 * 100) {
    utf8Text.replace(utf8Search, utf8Replace)
    i += 1
  }
  val end = System.currentTimeMillis

  println(end - start)
}

benchmark("ABCDEFGH", "DEF", "ZZZZ")  // replacement occurs
benchmark("ABCDEFGH", "Z", "")  // no replacement occurs

On my laptop this took ~54 / ~40 seconds seconds before this patch's changes and ~6.5 / ~3.8 seconds afterwards.

SparkQA · 2019-05-25T09:33:10Z

Test build #105779 has finished for PR 24707 at commit b06d917.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-05-25T15:55:48Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+    // Use reference equality to cheaply detect whether the replacement had no effect,
+    // in which case we can simply return the original UTF8String and save some copying.
+    if (before == after) {
+      return this;


One consideration here: do we need to make a defensive copy? If so, we can't do this optimization.

Why might we need to copy a UTF8String? The UTF8String instance itself is effectively immutable, but the underlying storage might be a region of potentially-not-exclusively-owned memory (either direct/off-heap memory or a region of a long[] array), so we might need to perform a copy in case we're going to buffer / otherwise hold onto the UTF8String past a point where the underlying underlying storage memory could be mutated.

I think the most common case to worry about would be a UTF8String which is backed by memory that is part of a larger UnsafeRow. If we're doing row-at-a-time processing and aren't holding onto this UTF8String across rows then I think we're ok since changes to rows' memory during single-row processing would impact many parts of Spark and would probably be detected. In the few places where we do hold references across evaluations / rows then we need to copy, but I suspect most places already do this: for example, see the regexp.clone() in the RegExpReplace expression.

My intuition is that we probably don't need to make a defensive copy here because I doubt we have parts of the code which specifically assume that replace() will copy (i.e. which are abusing replace() as a slow clone() mechanism). Put differently, I suspect that any code which would fail due to lack of copying in replace() is also vulnerable to this problem from other sources (including simply reading a string from a row without further modification), so I don't think we need to add extra copying here.

I'd love to get additional sets of eyes on this, though, and I'd ultimately be ok with changing return this to return this.clone() (and updating the other return this uses in UTF8String) if we conclude that this isn't safe (or are uncertain and want to err on the side of caution).

I think your intuition is right here.

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

srowen · 2019-05-26T18:45:31Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+    // Use reference equality to cheaply detect whether the replacement had no effect,
+    // in which case we can simply return the original UTF8String and save some copying.
+    if (before == after) {
+      return this;


I think your intuition is right here.

SparkQA · 2019-05-26T22:26:59Z

Test build #105803 has finished for PR 24707 at commit b51035b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk

@JoshRosen Would it make sense to implement replace directly on UTF8String without converting it to String? For example, using the lib or writing similar code: https://apple.github.io/foundationdb/javadoc/com/apple/foundationdb/tuple/ByteArrayUtil.html#replace-byte:A-byte:A-byte:A-

Also the implementation of the replace() method of commons-lang3 does not look so complex. It uses indexOf which we have in UTF8String + StringBuilder. The last one for UTF8String could be useful in another places. WDYT?

JoshRosen · 2019-05-27T20:03:22Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

+    // At least one match was found. Estimate space needed for result.
+    int increase = replace.numBytes - search.numBytes;
+    increase = increase < 0 ? 0 : increase;
+    increase *= 16;


Modeled after https://github.com/apache/commons-lang/blob/999030a23c214a1fdcfc2f1464183e0c752777f5/src/main/java/org/apache/commons/lang3/StringUtils.java#L5617, which (somewhat arbitrarily?) allocates enough buffer space to handle 16 replacements without growth in case max is not set.

JoshRosen · 2019-05-27T20:04:58Z

@MaxGekk, that's a great idea: I've gone ahead and re-implemented UTF8String.replace() to operate over the raw bytes, yielding a huge speedup (~10x in some cases).

Just pushed my changes and updated the description. I'm going to loop back later to give this a second pass of self-review, but wanted to get this new version posted up now for initial feedback.

SparkQA · 2019-05-27T22:01:23Z

Test build #105847 has finished for PR 24707 at commit 5c74048.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public static class IntArrays
public class FetchShuffleBlocks extends BlockTransferMessage
trait BaseEvalPython extends UnaryNode
case class BatchEvalPython(
case class ArrowEvalPython(
case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
abstract class EvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)

SparkQA · 2019-06-10T00:04:34Z

Test build #106331 has finished for PR 24707 at commit 6188dcd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2019-06-10T02:43:31Z

retest this please

SparkQA · 2019-06-10T04:46:21Z

Test build #106336 has finished for PR 24707 at commit 6188dcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

JoshRosen · 2019-06-19T02:11:34Z

@srowen @MaxGekk, I think this UTF8String.replace() optimization PR is now ready for review: I've incorporated @MaxGekk's suggestion of implementing this via direct operation on the UTF8 bytes.

JoshRosen · 2019-06-19T04:03:59Z

retest this please

MaxGekk

LGTM

SparkQA · 2019-06-19T16:31:51Z

Test build #4801 has finished for PR 24707 at commit 6188dcd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java

JoshRosen · 2019-06-19T22:00:45Z

Thanks to everyone who reviewed. I took another self-review and this still looks good to me, so I'm going to merge into master for Spark 3.x.

JoshRosen added 2 commits May 24, 2019 22:50

Optimize UTF8String.replace() / StringReplace expression.

f3082d3

Can't interpolate as literals due to Scala string escaping issue.

b06d917

JoshRosen commented May 25, 2019

View reviewed changes

srowen reviewed May 26, 2019

View reviewed changes

Correct referenceObj variable naming typo

b51035b

MaxGekk reviewed May 27, 2019

View reviewed changes

JoshRosen added 3 commits May 27, 2019 12:03

Implement direct replace() over UTF8String bytes.

6fd7714

Roll back codegen changes.

ec423f1

Merge remote-tracking branch 'origin/master' into faster-string-replace

5c74048

JoshRosen commented May 27, 2019

View reviewed changes

JoshRosen changed the title ~~[SPARK-27839][SQL] Improve UTF8String.replace() / StringReplace performance~~ [SPARK-27839][SQL] Change UTF8String.replace() to operate on UTF8 bytes May 27, 2019

JoshRosen added 2 commits June 9, 2019 15:36

Fix comment typo

8123e42

Remove ternary operator

6188dcd

dongjoon-hyun added the SQL label Jun 14, 2019

JoshRosen requested a review from srowen June 19, 2019 02:10

srowen approved these changes Jun 19, 2019

View reviewed changes

MaxGekk approved these changes Jun 19, 2019

View reviewed changes

JoshRosen commented Jun 19, 2019

View reviewed changes

common/unsafe/src/test/java/org/apache/spark/unsafe/types/UTF8StringSuite.java Show resolved Hide resolved

JoshRosen closed this in fc65e0f Jun 19, 2019

MaxGekk mentioned this pull request Jun 23, 2020

[MINOR][SQL] Simplify DateTimeUtils.cleanLegacyTimestampStr #28892

Closed

[SPARK-27839][SQL] Change UTF8String.replace() to operate on UTF8 bytes #24707

[SPARK-27839][SQL] Change UTF8String.replace() to operate on UTF8 bytes #24707

Uh oh!

Conversation

JoshRosen commented May 25, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented May 25, 2019

Uh oh!

JoshRosen May 25, 2019

Choose a reason for hiding this comment

Uh oh!

srowen May 26, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srowen May 26, 2019

Choose a reason for hiding this comment

Uh oh!

SparkQA commented May 26, 2019

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

JoshRosen May 27, 2019

Choose a reason for hiding this comment

Uh oh!

JoshRosen commented May 27, 2019

Uh oh!

SparkQA commented May 27, 2019

Uh oh!

SparkQA commented Jun 10, 2019

Uh oh!

kiszk commented Jun 10, 2019

Uh oh!

SparkQA commented Jun 10, 2019

Uh oh!

JoshRosen commented Jun 19, 2019

Uh oh!

JoshRosen commented Jun 19, 2019

Uh oh!

MaxGekk left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 19, 2019

Uh oh!

Uh oh!

JoshRosen commented Jun 19, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

JoshRosen commented May 25, 2019 •

edited

Loading