[SPARK-25538][SQL] Zero-out all bytes when writing decimal #22602

mgaido91 · 2018-10-01T16:32:06Z

What changes were proposed in this pull request?

In #20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed.

I see 2 solutions here:

we can zero-out all the bytes in advance as it was done before [SPARK-23713][SQL] Cleanup UnsafeWriter and BufferHolder classes #20850 (safer solution IMHO);
we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option).

Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later.

How was this patch tested?

Running the test attached in the JIRA + added UT

mgaido91 · 2018-10-01T16:32:18Z

cc @cloud-fan @kiszk

mgaido91 · 2018-10-01T17:46:45Z

Sorry, I realize only now I linked the wrong JIRA. This is for 25538. Unfortunately I am not in front of my laptop right now so I cannot update the title. I'll do asap. Sorry for the mistake. Thanks for understanding.

dongjoon-hyun

+1, LGTM for SPARK-25538 correctness issue.

dongjoon-hyun · 2018-10-01T18:20:00Z

cc @gatorsmile

mgaido91 · 2018-10-01T18:25:28Z

Thanks for the review @dongjoon-hyun

gatorsmile · 2018-10-01T20:11:54Z

@mgaido91 Could you change the title to [WIP] before you add the test case?

Also cc @hvanhovell @kiszk who are the best person to review these code.

SparkQA · 2018-10-01T20:48:16Z

Test build #96822 has finished for PR 22602 at commit 851d723.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-10-01T23:59:31Z

Thank you. The first option looks good. Let me think about a good UT, too.

kiszk · 2018-10-02T00:02:43Z

...atalyst/src/main/java/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriter.java

      // grow the global buffer before writing data.
      holder.grow(16);

+      // zero-out the bytes


nit: Can we refine a comment like the following to avoid this problem in the future?

// always zero-out the 16-byte buffer

cloud-fan · 2018-10-02T02:16:37Z

good catch! LGTM, waiting for the UT.

cloud-fan · 2018-10-02T02:21:16Z

I think we can create a UnsafeWriterSuite to do some low-level checking. We can leave the end-to-end test if it's too hard to write.

HeartSaVioR

+1 Nice finding.

mgaido91 · 2018-10-02T08:47:53Z

thank you all for the reviews! I added the UT according to @cloud-fan's suggestion as I was unable to set up a reasonable the end-to-end UT. Thanks.

cloud-fan · 2018-10-02T11:28:01Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriterSuite.scala

+    val res = unsafeRowWriter.getRow
+    assert(res.getDecimal(0, decimal1.precision, decimal1.scale) == decimal1)
+    // Check that the bytes which are not used by decimal1 (but are allocated) are zero-ed out
+    assert(res.getBytes()(25) == 0x00)


can we add a comment about how we get the 25?

cloud-fan · 2018-10-02T11:30:10Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriterSuite.scala

+    unsafeRowWriter.reset()
+    unsafeRowWriter.write(0, decimal1, decimal1.precision, decimal1.scale)
+    val res = unsafeRowWriter.getRow
+    assert(res.getDecimal(0, decimal1.precision, decimal1.scale) == decimal1)


Actually I think this assert is strong enough, we don't need to check the zero bytes below.

this would pass also before the fix, as only the first 8 bytes are read. I added it as an additional safety fix.

then this doesn't demonstrate how this bug could affect correctness.

one better idea for this test: we create 2 row writers, one writes the decimal as what you did here, and the other one writes decimal 0.431 directly. Then we compare the 2 result rows and make sure they equals.

thanks for the suggestion @cloud-fan, I like this approach.

SparkQA · 2018-10-02T12:38:12Z

Test build #96852 has finished for PR 22602 at commit 6b84b41.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-10-02T14:03:50Z

LGTM, pending jenkins

kiszk · 2018-10-02T16:28:18Z

LGTM, pending jenkins

SparkQA · 2018-10-02T16:48:00Z

Test build #96857 has finished for PR 22602 at commit 64f4ed0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-02T16:58:34Z

...yst/src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeWriterSuite.scala

+import org.apache.spark.SparkFunSuite
+import org.apache.spark.sql.types.Decimal
+
+class UnsafeWriterSuite extends SparkFunSuite {


UnsafeWriterSuite -> UnsafeRowWriterSuite? Also, renaming the file?

I don't think it is necessary, as we may want to include here also tests for other UnsafeWriter in the future.

I don't think so. We had better have both UnsafeRowWriterSuite and UnsafeWriterSuite in the future if needed.

SparkQA · 2018-10-02T18:10:39Z

Test build #96859 has finished for PR 22602 at commit 72b7c5c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-10-03T09:24:41Z

.../src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriterSuite.scala

+    unsafeRowWriter1.reset()
+    unsafeRowWriter1.write(0, decimal1, decimal1.precision, decimal1.scale)
+    val res1 = unsafeRowWriter1.getRow
+    // On a second UnsafeRowWriter we write directly decimal2


... we write directly decimal1.

viirya · 2018-10-03T09:25:58Z

.../src/test/scala/org/apache/spark/sql/catalyst/expressions/codegen/UnsafeRowWriterSuite.scala

+    val decimal1 = Decimal(0.431)
+    decimal1.changePrecision(38, 18)
+    // This decimal holds 11 bytes
+    val decimal2 = Decimal(123456789.1232456789)


Shall we verify the number of bytes?

thanks, I added a check for that.

viirya · 2018-10-03T09:26:32Z

two minor comments. LGTM and good catch!

SparkQA · 2018-10-03T12:39:13Z

Test build #96892 has finished for PR 22602 at commit d7d17d8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class UnsafeRowWriterSuite extends SparkFunSuite

SparkQA · 2018-10-03T13:53:14Z

Test build #96893 has finished for PR 22602 at commit be38c4c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun · 2018-10-03T14:27:32Z

Thank you, @mgaido91 and all.

According to the all review comments, I'll merge this to master and branch-2.4.

## What changes were proposed in this pull request? In #20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed. I see 2 solutions here: - we can zero-out all the bytes in advance as it was done before #20850 (safer solution IMHO); - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option). Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later. ## How was this patch tested? Running the test attached in the JIRA + added UT Closes #22602 from mgaido91/SPARK-25582. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d7ae36a) Signed-off-by: Dongjoon Hyun <[email protected]>

## What changes were proposed in this pull request? In apache#20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed. I see 2 solutions here: - we can zero-out all the bytes in advance as it was done before apache#20850 (safer solution IMHO); - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option). Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later. ## How was this patch tested? Running the test attached in the JIRA + added UT Closes apache#22602 from mgaido91/SPARK-25582. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

[SPARK-25582][SQL] Zero-out all bytes when writing decimal

851d723

dongjoon-hyun approved these changes Oct 1, 2018

View reviewed changes

mgaido91 changed the title ~~[SPARK-25582][SQL] Zero-out all bytes when writing decimal~~ [SPARK-25538][SQL] Zero-out all bytes when writing decimal Oct 1, 2018

kiszk reviewed Oct 2, 2018

View reviewed changes

HeartSaVioR approved these changes Oct 2, 2018

View reviewed changes

HyukjinKwon approved these changes Oct 2, 2018

View reviewed changes

add UT + address comment

6b84b41

cloud-fan reviewed Oct 2, 2018

View reviewed changes

mgaido91 added 2 commits October 2, 2018 14:49

address comment

64f4ed0

improve ut

72b7c5c

dongjoon-hyun reviewed Oct 2, 2018

View reviewed changes

address comment

d7d17d8

viirya reviewed Oct 3, 2018

View reviewed changes

address comments

be38c4c

asfgit closed this in d7ae36a Oct 3, 2018

[SPARK-25538][SQL] Zero-out all bytes when writing decimal #22602

[SPARK-25538][SQL] Zero-out all bytes when writing decimal #22602

Uh oh!

Conversation

mgaido91 commented Oct 1, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

mgaido91 commented Oct 1, 2018

Uh oh!

mgaido91 commented Oct 1, 2018

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Oct 1, 2018

Uh oh!

mgaido91 commented Oct 1, 2018

Uh oh!

gatorsmile commented Oct 1, 2018

Uh oh!

SparkQA commented Oct 1, 2018

Uh oh!

kiszk commented Oct 1, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 2, 2018

Uh oh!

cloud-fan commented Oct 2, 2018

Uh oh!

HeartSaVioR left a comment

Choose a reason for hiding this comment

Uh oh!

mgaido91 commented Oct 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 2, 2018

Uh oh!

cloud-fan commented Oct 2, 2018

Uh oh!

kiszk commented Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Oct 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun Oct 2, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 2, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

viirya commented Oct 3, 2018

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

SparkQA commented Oct 3, 2018

Uh oh!

mgaido91 commented Oct 1, 2018 •

edited

Loading

kiszk commented Oct 2, 2018 •

edited

Loading

dongjoon-hyun Oct 2, 2018 •

edited

Loading