Skip to content

Conversation

@kiszk
Copy link
Member

@kiszk kiszk commented Mar 17, 2018

What changes were proposed in this pull request?

This PR implemented the following cleanups related to UnsafeWriter class:

  • Remove code duplication between UnsafeRowWriter and UnsafeArrayWriter
  • Make BufferHolder class internal by delegating its accessor methods to UnsafeWriter
  • Replace UnsafeRow.setTotalSize(...) with UnsafeRowWriter.setTotalSize()

How was this patch tested?

Tested by existing UTs

@SparkQA
Copy link

SparkQA commented Mar 17, 2018

Test build #88342 has finished for PR 20850 at commit 0379f7c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public final class BufferHolder

@SparkQA
Copy link

SparkQA commented Mar 17, 2018

Test build #88343 has finished for PR 20850 at commit 06e7435.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Mar 18, 2018

I think the failure of org.apache.spark.ml.r.RWrapperUtilsSuite.avoid libsvm data column name conflicting may not be related to this PR. However, I am investigating the reason for this failure.

@kiszk
Copy link
Member Author

kiszk commented Mar 18, 2018

cc @hvanhovell

@kiszk
Copy link
Member Author

kiszk commented Mar 18, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 18, 2018

Test build #88347 has finished for PR 20850 at commit 06e7435.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Mar 19, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 19, 2018

Test build #88373 has finished for PR 20850 at commit 06e7435.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Mar 19, 2018

This test is consistently failed. While I did not change file reader, I am investigating the reason in my environment.

final int $tmpCursor = $bufferHolder.cursor;
${writeStructToBuffer(ctx, input.value, t.map(_.dataType), bufferHolder)}
$rowWriter.setOffsetAndSize($index, $tmpCursor, $bufferHolder.cursor - $tmpCursor);
final int $tmpCursor = $rowWriter.cursor();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a bit weird that we have to are storing state internal to the UnsafeWriter/BufferHolder here. It would be very nice if we can internalize this code into the UnsafeWriter.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I agree with you. I will internalize this code that are frequently used.

@hvanhovell
Copy link
Contributor

@kiszk this is a good start! This is very performance critical code, can you please extend/update/run the existing UnsafeProjectionBenchmark?


int getCursor() { return cursor; }

void addCursor(int val) { cursor += val; }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

incrementCursor?

super(writer.getBufferHolder());
}

public void initialize(int numElements, int elementSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we move elementSize into the constructor? I don't think there are case where we are reusing UnsafeArrayWriter s.

* if the fields of row are all fixed-length, as the size of result row is also fixed.
*/
public class BufferHolder {
public final class BufferHolder {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this still public since you are making everything package private?

}

public void reset() {
byte[] buffer() { return buffer; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: need line feeds to make styles along with other code?

byte[] buffer() {
  return buffer;
}


int getCursor() { return cursor; }

void addCursor(int val) { cursor += val; }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

advanceCursor?

return holder;
}

public final byte[] buffer() { return holder.buffer(); }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need these delegator methods? How about making holder protected same with WritableColumnVector?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even if we make holder default in the org.apache.spark.sql.catalyst.expressions.codegen package, it is inaccessible from the org.apache.spark.sql.catalyst.expressions.InterpretedUnsafeProjection class.
We do not want to expose BufferHolder class outside Unsafe*Row classes, too.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No performance impact?

this.startingOffset = cursor();
}

public void setTotalSize() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about renaming flip along with java ByteBuffer? If we call row.setTotalSize(totalSize) and reset BufferHolder positions inside flip, can we remove UnsafeWriter.reset in the head?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be. Beyond that, the current approach using reset and setTotalSize() looks easy to read the generated code.
It is clear to understand the beginning and end of the region. If it is critical to remove the UnsafeWriter.reset method, I agree with renaming to flip.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we'd like to make generated code blocks easy-to-read, we should depend on generated comments instead of api names, I think. Anyway, this decision depends on other dev's thoughts.


public UnsafeRowWriter(BufferHolder holder, int numFields) {
this.holder = holder;
public UnsafeRowWriter(UnsafeRow row, int initialBufferSize) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need two UnsafeRow constructors?

For the the top level row writer I also think it might be nice to create row internally, and just have a constructor that takes a numFields and (optionally) size argument.

addCursor(16);
}

protected final void _write(long offset, boolean value) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the _write names? Just call themwriteBoolean etc...

@SparkQA
Copy link

SparkQA commented Mar 19, 2018

Test build #88378 has finished for PR 20850 at commit b696b7c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 19, 2018

Test build #88380 has finished for PR 20850 at commit c342f0d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

cursor += val;
}

int pushCursor() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this a little bit less complex? I think just storing the cursor in the UnsafeWriter is enough.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since one BufferHolder is shared by multiple UnsafeWriters, it seems to be simple to store cursors into BufferHolders.
WDYT?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also feel the current code is a bit complicated. Can't we avoid the sharing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it is complicated. pushCursor will be dropped.

@maropu
Copy link
Member

maropu commented Mar 19, 2018

@hvanhovell btw, (this is not related to this pr thought...) the most part of code in UTF8StringBuffer and BufferHolder are overlapped. So, we could clean up there, too? master...maropu:CleanupBufferImpl

val tmpCursor = bufferHolder.cursor
writeArray(bufferHolder, arrayWriter, elementWriter, v.getArray(i), elementSize)
writer.setOffsetAndSize(i, tmpCursor, bufferHolder.cursor - tmpCursor)
arrayWriter.markCursor()
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From the performance view, this abstraction may have more performance impact since we move temporal value on local frame into that on Java stack

arrayWriter.markCursor()
writeArray(arrayWriter, elementWriter, v.getArray(i))
writer.setOffsetAndSizeFromMark(i)

Is this implementation enough from the balance of performance and abstraction? Or, is it better to do like this?

val mark = arrayWriter.cursor()
writeArray(arrayWriter, elementWriter, v.getArray(i))
writer.setOffsetAndSizeFromMark(i, mark)

@Maropo @hvanhovell WDYT?

@SparkQA
Copy link

SparkQA commented Mar 20, 2018

Test build #88411 has finished for PR 20850 at commit 3637a5c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88728 has finished for PR 20850 at commit a94d470.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Mar 30, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Mar 30, 2018

Test build #88760 has finished for PR 20850 at commit a94d470.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

return cursor;
}

void incrementCursor(int val) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo, should be increaseCursor

final long offsetAndSize = (relativeOffset << 32) | (long)size;

write(ordinal, offsetAndSize);
_setOffsetAndSizeFromPreviousCursor(ordinal, mark);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_setOffsetAndSizeFromPreviousCursor calls setOffsetAndSize, which calls write and then calls assertIndexIsValid.

So we don't need _setOffsetAndSizeFromPreviousCursor

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, good catch

@SparkQA
Copy link

SparkQA commented Apr 1, 2018

Test build #88787 has finished for PR 20850 at commit 6caf11c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Member Author

kiszk commented Apr 1, 2018

retest this please

@SparkQA
Copy link

SparkQA commented Apr 1, 2018

Test build #88789 has finished for PR 20850 at commit 6caf11c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}

public void reset() {
byte[] buffer() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: getBuffer is more java-style

BitSetMethods.set(buffer(), startingOffset + 8, ordinal);
}

public void setNull1Bytes(int ordinal) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems now we only need a single setNullAt method.

Copy link
Member Author

@kiszk kiszk Apr 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, UnsafeRowWriter need a single setNullAt method for 8-byte width field.
On the other hand, UnsafeArrayWriter needs multiple setNull?Bytes() for different element size. Generated code also uses setNull?Bytes for array elements..

Could you elaborate your thought?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need the various setNull* methods because of arrays.

Copy link
Contributor

@cloud-fan cloud-fan Apr 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nvm, I thought we can pattern match the elementSize, but that may hurt performance a lot for the codegen version.

final long offset = getFieldOffset(ordinal);
Platform.putLong(holder.buffer, offset, 0L);
Platform.putBoolean(holder.buffer, offset, value);
Platform.putLong(buffer(), offset, 0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeLong(0)?

final long offset = getFieldOffset(ordinal);
Platform.putLong(holder.buffer, offset, 0L);
Platform.putBoolean(holder.buffer, offset, value);
Platform.putLong(buffer(), offset, 0L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

writeLong(offset, 0L)?

public void write(int ordinal, Decimal input, int precision, int scale) {
if (precision <= Decimal.MAX_LONG_DIGITS()) {
// make sure Decimal object has the same scale as DecimalType
if (input.changePrecision(precision, scale)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I will add input != null.
I am also curious about the differences between these two methods.

unsafeRow.setTotalSize(bufferHolder.totalSize());
return unsafeRow;
rowWriter.setTotalSize();
return rowWriter.getRow();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any place where we call getRow() without calling setTotalSize() before that? If there aren't then I'd combine the two.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that here is only a place where we call getRow() without calling setTotalSize() if numVarLenFields == 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we call reset() and setTotalSize() as the interpreted version does?

@SparkQA
Copy link

SparkQA commented Apr 2, 2018

Test build #88801 has finished for PR 20850 at commit 9dc36b7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*
* Generally we should call `UnsafeRowWriter.setTotalSize` to update the size of the result row,
* after writing a record to the buffer. However, we can skip this step if the fields of row are
* all fixed-length, as the size of result row is also fixed.
Copy link
Contributor

@cloud-fan cloud-fan Apr 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if this optimization is really necessary. Maybe we can always update total size in getRow.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. We will merge setTotalSize and getRow into getRow.

@cloud-fan
Copy link
Contributor

LGTM

@hvanhovell
Copy link
Contributor

@kiszk can you rerun the UnsafeProjectionBenchmark to make sure we didn't regress anywhere?

Otherwise LGTM.

@kiszk
Copy link
Member Author

kiszk commented Apr 2, 2018

@hvanhovell Here are results of UnsafeProjectionBenchmark. I have not seen the regression.

With master 529f847105fa8d98a5dc4d20955e4870df6bc1c5

OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
unsafe projection:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
single long                                   1638 / 1638        163.9           6.1       1.0X
single nullable long                          2375 / 2568        113.0           8.8       0.7X
7 primitive types                             5108 / 5234         52.6          19.0       0.3X
7 nullable primitive types                    7809 / 7909         34.4          29.1       0.2X

With SPARK-23713

OpenJDK 64-Bit Server VM 1.8.0_151-8u151-b12-0ubuntu0.16.04.2-b12 on Linux 4.4.0-66-generic
Intel(R) Xeon(R) CPU E5-2667 v3 @ 3.20GHz
unsafe projection:                       Best/Avg Time(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------
single long                                   1630 / 1630        164.7           6.1       1.0X
single nullable long                          2354 / 2403        114.0           8.8       0.7X
7 primitive types                             5107 / 5174         52.6          19.0       0.3X
7 nullable primitive types                    7867 / 7938         34.1          29.3       0.2X

@hvanhovell
Copy link
Contributor

Merging to master. Thanks for your hard work and patience.

@asfgit asfgit closed this in a7c19d9 Apr 2, 2018
robert3005 pushed a commit to palantir/spark that referenced this pull request Apr 4, 2018
## What changes were proposed in this pull request?

This PR implemented the following cleanups related to  `UnsafeWriter` class:
- Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter`
- Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter`
- Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()`

## How was this patch tested?

Tested by existing UTs

Author: Kazuaki Ishizaki <[email protected]>

Closes apache#20850 from kiszk/SPARK-23713.
mshtelma pushed a commit to mshtelma/spark that referenced this pull request Apr 5, 2018
## What changes were proposed in this pull request?

This PR implemented the following cleanups related to  `UnsafeWriter` class:
- Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter`
- Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter`
- Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()`

## How was this patch tested?

Tested by existing UTs

Author: Kazuaki Ishizaki <[email protected]>

Closes apache#20850 from kiszk/SPARK-23713.
asfgit pushed a commit that referenced this pull request Oct 3, 2018
## What changes were proposed in this pull request?

In #20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed.

I see 2 solutions here:
 - we can zero-out all the bytes in advance as it was done before #20850 (safer solution IMHO);
 - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option).

Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later.

## How was this patch tested?

Running the test attached in the JIRA + added UT

Closes #22602 from mgaido91/SPARK-25582.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
(cherry picked from commit d7ae36a)
Signed-off-by: Dongjoon Hyun <[email protected]>
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

In apache#20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed.

I see 2 solutions here:
 - we can zero-out all the bytes in advance as it was done before apache#20850 (safer solution IMHO);
 - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option).

Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later.

## How was this patch tested?

Running the test attached in the JIRA + added UT

Closes apache#22602 from mgaido91/SPARK-25582.

Authored-by: Marco Gaido <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
otterc pushed a commit to linkedin/spark that referenced this pull request Mar 22, 2023
This PR implemented the following cleanups related to  `UnsafeWriter` class:
- Remove code duplication between `UnsafeRowWriter` and `UnsafeArrayWriter`
- Make `BufferHolder` class internal by delegating its accessor methods to `UnsafeWriter`
- Replace `UnsafeRow.setTotalSize(...)` with `UnsafeRowWriter.setTotalSize()`

Tested by existing UTs

Author: Kazuaki Ishizaki <[email protected]>

Closes apache#20850 from kiszk/SPARK-23713.

Ref: LIHADOOP-48531
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants