Skip to content

Conversation

@JoshRosen
Copy link
Contributor

@JoshRosen JoshRosen commented May 25, 2019

What changes were proposed in this pull request?

This PR significantly improves the performance of UTF8String.replace() by performing direct replacement over UTF8 bytes instead of decoding those bytes into Java Strings.

In cases where the search string is not found (i.e. no replacements are performed, a case which I expect to be common) this new implementation performs no object allocation or memory copying.

My implementation is modeled after commons-lang3's StringUtils.replace() method. As part of my implementation, I needed a StringBuilder / resizable buffer, so I moved UTF8StringBuilder from the catalyst package to unsafe.

How was this patch tested?

Copied tests from StringExpressionSuite to UTF8StringSuite and added a couple of new cases.

To evaluate performance, I did some quick local benchmarking by running the following code in spark-shell (with Java 1.8.0_191):

import org.apache.spark.unsafe.types.UTF8String

def benchmark(text: String, search: String, replace: String) {
  val utf8Text = UTF8String.fromString(text)
  val utf8Search = UTF8String.fromString(search)
  val utf8Replace = UTF8String.fromString(replace)

  val start = System.currentTimeMillis
  var i = 0
  while (i < 1000 * 1000 * 100) {
    utf8Text.replace(utf8Search, utf8Replace)
    i += 1
  }
  val end = System.currentTimeMillis

  println(end - start)
}

benchmark("ABCDEFGH", "DEF", "ZZZZ")  // replacement occurs
benchmark("ABCDEFGH", "Z", "")  // no replacement occurs

On my laptop this took ~54 / ~40 seconds seconds before this patch's changes and ~6.5 / ~3.8 seconds afterwards.

@SparkQA
Copy link

SparkQA commented May 25, 2019

Test build #105779 has finished for PR 24707 at commit b06d917.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Use reference equality to cheaply detect whether the replacement had no effect,
// in which case we can simply return the original UTF8String and save some copying.
if (before == after) {
return this;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One consideration here: do we need to make a defensive copy? If so, we can't do this optimization.

Why might we need to copy a UTF8String? The UTF8String instance itself is effectively immutable, but the underlying storage might be a region of potentially-not-exclusively-owned memory (either direct/off-heap memory or a region of a long[] array), so we might need to perform a copy in case we're going to buffer / otherwise hold onto the UTF8String past a point where the underlying underlying storage memory could be mutated.

I think the most common case to worry about would be a UTF8String which is backed by memory that is part of a larger UnsafeRow. If we're doing row-at-a-time processing and aren't holding onto this UTF8String across rows then I think we're ok since changes to rows' memory during single-row processing would impact many parts of Spark and would probably be detected. In the few places where we do hold references across evaluations / rows then we need to copy, but I suspect most places already do this: for example, see the regexp.clone() in the RegExpReplace expression.

My intuition is that we probably don't need to make a defensive copy here because I doubt we have parts of the code which specifically assume that replace() will copy (i.e. which are abusing replace() as a slow clone() mechanism). Put differently, I suspect that any code which would fail due to lack of copying in replace() is also vulnerable to this problem from other sources (including simply reading a string from a row without further modification), so I don't think we need to add extra copying here.

I'd love to get additional sets of eyes on this, though, and I'd ultimately be ok with changing return this to return this.clone() (and updating the other return this uses in UTF8String) if we conclude that this isn't safe (or are uncertain and want to err on the side of caution).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your intuition is right here.

// Use reference equality to cheaply detect whether the replacement had no effect,
// in which case we can simply return the original UTF8String and save some copying.
if (before == after) {
return this;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think your intuition is right here.

@SparkQA
Copy link

SparkQA commented May 26, 2019

Test build #105803 has finished for PR 24707 at commit b51035b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JoshRosen Would it make sense to implement replace directly on UTF8String without converting it to String? For example, using the lib or writing similar code: https://apple.github.io/foundationdb/javadoc/com/apple/foundationdb/tuple/ByteArrayUtil.html#replace-byte:A-byte:A-byte:A-

Also the implementation of the replace() method of commons-lang3 does not look so complex. It uses indexOf which we have in UTF8String + StringBuilder. The last one for UTF8String could be useful in another places. WDYT?

// At least one match was found. Estimate space needed for result.
int increase = replace.numBytes - search.numBytes;
increase = increase < 0 ? 0 : increase;
increase *= 16;
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Modeled after https://github.com/apache/commons-lang/blob/999030a23c214a1fdcfc2f1464183e0c752777f5/src/main/java/org/apache/commons/lang3/StringUtils.java#L5617, which (somewhat arbitrarily?) allocates enough buffer space to handle 16 replacements without growth in case max is not set.

@JoshRosen JoshRosen changed the title [SPARK-27839][SQL] Improve UTF8String.replace() / StringReplace performance [SPARK-27839][SQL] Change UTF8String.replace() to operate on UTF8 bytes May 27, 2019
@JoshRosen
Copy link
Contributor Author

@MaxGekk, that's a great idea: I've gone ahead and re-implemented UTF8String.replace() to operate over the raw bytes, yielding a huge speedup (~10x in some cases).

Just pushed my changes and updated the description. I'm going to loop back later to give this a second pass of self-review, but wanted to get this new version posted up now for initial feedback.

@SparkQA
Copy link

SparkQA commented May 27, 2019

Test build #105847 has finished for PR 24707 at commit 5c74048.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public static class IntArrays
  • public class FetchShuffleBlocks extends BlockTransferMessage
  • trait BaseEvalPython extends UnaryNode
  • case class BatchEvalPython(
  • case class ArrowEvalPython(
  • case class ArrowEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
  • case class BatchEvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)
  • abstract class EvalPythonExec(udfs: Seq[PythonUDF], resultAttrs: Seq[Attribute], child: SparkPlan)

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106331 has finished for PR 24707 at commit 6188dcd.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member

kiszk commented Jun 10, 2019

retest this please

@SparkQA
Copy link

SparkQA commented Jun 10, 2019

Test build #106336 has finished for PR 24707 at commit 6188dcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen JoshRosen requested a review from srowen June 19, 2019 02:10
@JoshRosen
Copy link
Contributor Author

@srowen @MaxGekk, I think this UTF8String.replace() optimization PR is now ready for review: I've incorporated @MaxGekk's suggestion of implementing this via direct operation on the UTF8 bytes.

@JoshRosen
Copy link
Contributor Author

retest this please

Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@SparkQA
Copy link

SparkQA commented Jun 19, 2019

Test build #4801 has finished for PR 24707 at commit 6188dcd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@JoshRosen
Copy link
Contributor Author

Thanks to everyone who reviewed. I took another self-review and this still looks good to me, so I'm going to merge into master for Spark 3.x.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants