[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes #20850

kiszk · 2018-03-17T20:16:53Z

What changes were proposed in this pull request?

This PR implemented the following cleanups related to UnsafeWriter class:

Remove code duplication between UnsafeRowWriter and UnsafeArrayWriter
Make BufferHolder class internal by delegating its accessor methods to UnsafeWriter
Replace UnsafeRow.setTotalSize(...) with UnsafeRowWriter.setTotalSize()

How was this patch tested?

Tested by existing UTs

SparkQA · 2018-03-17T20:59:34Z

Test build #88342 has finished for PR 20850 at commit 0379f7c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public final class BufferHolder

SparkQA · 2018-03-17T21:10:02Z

Test build #88343 has finished for PR 20850 at commit 06e7435.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-03-18T02:12:29Z

I think the failure of org.apache.spark.ml.r.RWrapperUtilsSuite.avoid libsvm data column name conflicting may not be related to this PR. However, I am investigating the reason for this failure.

kiszk · 2018-03-18T02:12:44Z

cc @hvanhovell

kiszk · 2018-03-18T06:57:33Z

retest this please

SparkQA · 2018-03-18T07:47:45Z

Test build #88347 has finished for PR 20850 at commit 06e7435.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-03-19T08:46:09Z

retest this please

SparkQA · 2018-03-19T09:28:20Z

Test build #88373 has finished for PR 20850 at commit 06e7435.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-03-19T10:08:01Z

This test is consistently failed. While I did not change file reader, I am investigating the reason in my environment.

hvanhovell · 2018-03-19T10:46:18Z

.../main/scala/org/apache/spark/sql/catalyst/expressions/codegen/GenerateUnsafeProjection.scala

-              final int $tmpCursor = $bufferHolder.cursor;
-              ${writeStructToBuffer(ctx, input.value, t.map(_.dataType), bufferHolder)}
-              $rowWriter.setOffsetAndSize($index, $tmpCursor, $bufferHolder.cursor - $tmpCursor);
+              final int $tmpCursor = $rowWriter.cursor();


It seems a bit weird that we have to are storing state internal to the UnsafeWriter/BufferHolder here. It would be very nice if we can internalize this code into the UnsafeWriter.

Yeah, I agree with you. I will internalize this code that are frequently used.

hvanhovell · 2018-03-19T10:48:18Z

@kiszk this is a good start! This is very performance critical code, can you please extend/update/run the existing UnsafeProjectionBenchmark?

hvanhovell · 2018-03-19T10:49:49Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java

+
+  int getCursor() { return cursor; }
+
+  void addCursor(int val) { cursor += val; }


incrementCursor?

hvanhovell · 2018-03-19T10:50:54Z

...alyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java

+    super(writer.getBufferHolder());
+  }
+
+  public void initialize(int numElements, int elementSize) {


Should we move elementSize into the constructor? I don't think there are case where we are reusing UnsafeArrayWriter s.

hvanhovell · 2018-03-19T10:51:42Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java

 * if the fields of row are all fixed-length, as the size of result row is also fixed.
 */
-public class BufferHolder {
+public final class BufferHolder {


Why is this still public since you are making everything package private?

maropu · 2018-03-19T08:58:44Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java

  }

-  public void reset() {
+  byte[] buffer() { return buffer; }


nit: need line feeds to make styles along with other code?

spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/UTF8String.java

Line 197 in 4de638c

public int numBytes() {

byte[] buffer() { return buffer; }

maropu · 2018-03-19T09:35:36Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java

+
+  int getCursor() { return cursor; }
+
+  void addCursor(int val) { cursor += val; }


advanceCursor?

maropu · 2018-03-19T09:37:52Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriter.java

+    return holder;
+  }
+
+  public final byte[] buffer() { return holder.buffer(); }


We need these delegator methods? How about making holder protected same with WritableColumnVector?

spark/sql/core/src/main/java/org/apache/spark/sql/execution/vectorized/WritableColumnVector.java

Line 130 in 4de638c

protected Dictionary dictionary;

Even if we make holder default in the org.apache.spark.sql.catalyst.expressions.codegen package, it is inaccessible from the org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection class.
We do not want to expose BufferHolder class outside Unsafe*Row classes, too.
WDYT?

No performance impact?

maropu · 2018-03-19T11:04:54Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

+    this.startingOffset = cursor();
+  }
+
+  public void setTotalSize() {


How about renaming flip along with java ByteBuffer? If we call row.setTotalSize(totalSize) and reset BufferHolder positions inside flip, can we remove UnsafeWriter.reset in the head?

It could be. Beyond that, the current approach using reset and setTotalSize() looks easy to read the generated code.
It is clear to understand the beginning and end of the region. If it is critical to remove the UnsafeWriter.reset method, I agree with renaming to flip.
WDYT?

If we'd like to make generated code blocks easy-to-read, we should depend on generated comments instead of api names, I think. Anyway, this decision depends on other dev's thoughts.

hvanhovell · 2018-03-19T11:44:58Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java


-  public UnsafeRowWriter(BufferHolder holder, int numFields) {
-    this.holder = holder;
+  public UnsafeRowWriter(UnsafeRow row, int initialBufferSize) {


Do we really need two UnsafeRow constructors?

For the the top level row writer I also think it might be nice to create row internally, and just have a constructor that takes a numFields and (optionally) size argument.

hvanhovell · 2018-03-19T11:45:54Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriter.java

+    addCursor(16);
+  }
+
+  protected final void _write(long offset, boolean value) {


why the _write names? Just call themwriteBoolean etc...

SparkQA · 2018-03-19T17:55:47Z

Test build #88378 has finished for PR 20850 at commit b696b7c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-19T21:02:33Z

Test build #88380 has finished for PR 20850 at commit c342f0d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hvanhovell · 2018-03-19T22:40:52Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java

+    cursor += val;
+  }
+
+  int pushCursor() {


Can we make this a little bit less complex? I think just storing the cursor in the UnsafeWriter is enough.

Since one BufferHolder is shared by multiple UnsafeWriters, it seems to be simple to store cursors into BufferHolders.
WDYT?

I also feel the current code is a bit complicated. Can't we avoid the sharing?

Yeah, it is complicated. pushCursor will be dropped.

maropu · 2018-03-19T22:47:59Z

@hvanhovell btw, (this is not related to this pr thought...) the most part of code in UTF8StringBuffer and BufferHolder are overlapped. So, we could clean up there, too? master...maropu:CleanupBufferImpl

kiszk · 2018-03-20T05:18:16Z

...t/src/main/scala/org/apache/spark/sql/catalyst/expressions/InterpretedUnsafeProjection.scala

-          val tmpCursor = bufferHolder.cursor
-          writeArray(bufferHolder, arrayWriter, elementWriter, v.getArray(i), elementSize)
-          writer.setOffsetAndSize(i, tmpCursor, bufferHolder.cursor - tmpCursor)
+          arrayWriter.markCursor()


From the performance view, this abstraction may have more performance impact since we move temporal value on local frame into that on Java stack

arrayWriter.markCursor() writeArray(arrayWriter, elementWriter, v.getArray(i)) writer.setOffsetAndSizeFromMark(i)

Is this implementation enough from the balance of performance and abstraction? Or, is it better to do like this?

val mark = arrayWriter.cursor() writeArray(arrayWriter, elementWriter, v.getArray(i)) writer.setOffsetAndSizeFromMark(i, mark)

@Maropo @hvanhovell WDYT?

SparkQA · 2018-03-20T11:34:48Z

Test build #88411 has finished for PR 20850 at commit 3637a5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-03-30T02:41:02Z

Test build #88728 has finished for PR 20850 at commit a94d470.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-03-30T17:54:41Z

retest this please

SparkQA · 2018-03-30T20:52:38Z

Test build #88760 has finished for PR 20850 at commit a94d470.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-03-31T02:46:58Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java

+    return cursor;
+  }
+
+  void incrementCursor(int val) {


typo, should be increaseCursor

cloud-fan · 2018-03-31T02:51:05Z

...alyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java

-    final long offsetAndSize = (relativeOffset << 32) | (long)size;
-
-    write(ordinal, offsetAndSize);
+    _setOffsetAndSizeFromPreviousCursor(ordinal, mark);


_setOffsetAndSizeFromPreviousCursor calls setOffsetAndSize, which calls write and then calls assertIndexIsValid.

So we don't need _setOffsetAndSizeFromPreviousCursor

Ah, good catch

SparkQA · 2018-04-01T02:18:58Z

Test build #88787 has finished for PR 20850 at commit 6caf11c.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-04-01T07:04:41Z

retest this please

SparkQA · 2018-04-01T10:40:01Z

Test build #88789 has finished for PR 20850 at commit 6caf11c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-01T15:02:54Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/BufferHolder.java

  }

-  public void reset() {
+  byte[] buffer() {


nit: getBuffer is more java-style

cloud-fan · 2018-04-01T15:13:09Z

...alyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeArrayWriter.java

+    BitSetMethods.set(buffer(), startingOffset + 8, ordinal);
  }

  public void setNull1Bytes(int ordinal) {


seems now we only need a single setNullAt method.

IIUC, UnsafeRowWriter need a single setNullAt method for 8-byte width field.
On the other hand, UnsafeArrayWriter needs multiple setNull?Bytes() for different element size. Generated code also uses setNull?Bytes for array elements..

Could you elaborate your thought?

We still need the various setNull* methods because of arrays.

nvm, I thought we can pattern match the elementSize, but that may hurt performance a lot for the codegen version.

cloud-fan · 2018-04-01T15:16:01Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

    final long offset = getFieldOffset(ordinal);
-    Platform.putLong(holder.buffer, offset, 0L);
-    Platform.putBoolean(holder.buffer, offset, value);
+    Platform.putLong(buffer(), offset, 0L);


writeLong(0)?

cloud-fan · 2018-04-01T15:16:39Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

    final long offset = getFieldOffset(ordinal);
-    Platform.putLong(holder.buffer, offset, 0L);
-    Platform.putBoolean(holder.buffer, offset, value);
+    Platform.putLong(buffer(), offset, 0L);


writeLong(offset, 0L)?

cloud-fan · 2018-04-01T15:18:13Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

  public void write(int ordinal, Decimal input, int precision, int scale) {
    if (precision <= Decimal.MAX_LONG_DIGITS()) {
      // make sure Decimal object has the same scale as DecimalType
      if (input.changePrecision(precision, scale)) {


do we need input != null here like https://github.com/apache/spark/pull/20850/files#diff-85658ffc242280699a331c90530f54baR149

Good point. I will add input != null.
I am also curious about the differences between these two methods.

hvanhovell · 2018-04-01T21:41:26Z

sql/core/src/main/scala/org/apache/spark/sql/execution/columnar/GenerateColumnAccessor.scala

-          unsafeRow.setTotalSize(bufferHolder.totalSize());
-          return unsafeRow;
+          rowWriter.setTotalSize();
+          return rowWriter.getRow();


Is there any place where we call getRow() without calling setTotalSize() before that? If there aren't then I'd combine the two.

I think that here is only a place where we call getRow() without calling setTotalSize() if numVarLenFields == 0.

Should we call reset() and setTotalSize() as the interpreted version does?

SparkQA · 2018-04-02T06:13:53Z

Test build #88801 has finished for PR 20850 at commit 9dc36b7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-04-02T06:48:50Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

+ *
+ * Generally we should call `UnsafeRowWriter.setTotalSize` to update the size of the result row,
+ * after writing a record to the buffer. However, we can skip this step if the fields of row are
+ * all fixed-length, as the size of result row is also fixed.


Not sure if this optimization is really necessary. Maybe we can always update total size in getRow.

Got it. We will merge setTotalSize and getRow into getRow.

cloud-fan · 2018-04-02T06:54:09Z

LGTM

hvanhovell · 2018-04-02T10:08:14Z

@kiszk can you rerun the UnsafeProjectionBenchmark to make sure we didn't regress anywhere?

Otherwise LGTM.

kiszk · 2018-04-02T17:46:59Z

@hvanhovell Here are results of UnsafeProjectionBenchmark. I have not seen the regression.

With master 529f847105fa8d98a5dc4d20955e4870df6bc1c5

OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
unsafe projection:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
single long                                   1638 / 1638        163.9           6.1       1.0X
single nullable long                          2375 / 2568        113.0           8.8       0.7X
7 primitive types                             5108 / 5234         52.6          19.0       0.3X
7 nullable primitive types                    7809 / 7909         34.4          29.1       0.2X

With SPARK-23713

OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
unsafe projection:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
single long                                   1630 / 1630        164.7           6.1       1.0X
single nullable long                          2354 / 2403        114.0           8.8       0.7X
7 primitive types                             5107 / 5174         52.6          19.0       0.3X
7 nullable primitive types                    7867 / 7938         34.1          29.3       0.2X

hvanhovell · 2018-04-02T19:41:22Z

Merging to master. Thanks for your hard work and patience.

## What changes were proposed in this pull request? This PR implemented the following cleanups related to `UnsafeWriter` class: - Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter` - Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter` - Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()` ## How was this patch tested? Tested by existing UTs Author: Kazuaki Ishizaki <[email protected]> Closes apache#20850 from kiszk/SPARK-23713.

## What changes were proposed in this pull request? In #20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed. I see 2 solutions here: - we can zero-out all the bytes in advance as it was done before #20850 (safer solution IMHO); - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option). Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later. ## How was this patch tested? Running the test attached in the JIRA + added UT Closes #22602 from mgaido91/SPARK-25582. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d7ae36a) Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? In apache#20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed. I see 2 solutions here: - we can zero-out all the bytes in advance as it was done before apache#20850 (safer solution IMHO); - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option). Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later. ## How was this patch tested? Running the test attached in the JIRA + added UT Closes apache#22602 from mgaido91/SPARK-25582. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

This PR implemented the following cleanups related to `UnsafeWriter` class: - Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter` - Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter` - Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()` Tested by existing UTs Author: Kazuaki Ishizaki <[email protected]> Closes apache#20850 from kiszk/SPARK-23713. Ref: LIHADOOP-48531

kiszk added 2 commits March 17, 2018 21:10

initial commit

0379f7c

update comment

06e7435

hvanhovell reviewed Mar 19, 2018

View reviewed changes

maropu reviewed Mar 19, 2018

View reviewed changes

hvanhovell reviewed Mar 19, 2018

View reviewed changes

kiszk added 3 commits March 19, 2018 15:30

fix test failure

b696b7c

address review comments

760d08b

address review comment

c342f0d

hvanhovell reviewed Mar 19, 2018

View reviewed changes

kiszk commented Mar 20, 2018

View reviewed changes

refinements

3637a5c

cloud-fan reviewed Mar 31, 2018

View reviewed changes

address review comments

6caf11c

cloud-fan reviewed Apr 1, 2018

View reviewed changes

hvanhovell reviewed Apr 1, 2018

View reviewed changes

address review comment

9dc36b7

cloud-fan reviewed Apr 2, 2018

View reviewed changes

address review comment

209da24

asfgit closed this in a7c19d9 Apr 2, 2018

mgaido91 mentioned this pull request Oct 1, 2018

[SPARK-25538][SQL] Zero-out all bytes when writing decimal #22602

Closed


		int getCursor() { return cursor; }

		void addCursor(int val) { cursor += val; }

[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes #20850

[SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes #20850

Uh oh!

Conversation

kiszk commented Mar 17, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Mar 17, 2018

Uh oh!

SparkQA commented Mar 17, 2018

Uh oh!

kiszk commented Mar 18, 2018

Uh oh!

kiszk commented Mar 18, 2018

Uh oh!

kiszk commented Mar 18, 2018

Uh oh!

SparkQA commented Mar 18, 2018

Uh oh!

maropu commented Mar 19, 2018

Uh oh!

SparkQA commented Mar 19, 2018

Uh oh!

kiszk commented Mar 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Mar 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 19, 2018

Uh oh!

SparkQA commented Mar 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

maropu commented Mar 19, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Mar 20, 2018

Uh oh!

SparkQA commented Mar 30, 2018

Uh oh!

kiszk commented Mar 30, 2018

Uh oh!

kiszk Apr 1, 2018 •

edited

Loading

cloud-fan Apr 1, 2018 •

edited

Loading