-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25538][SQL] Zero-out all bytes when writing decimal #22602
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
cc @cloud-fan @kiszk |
|
Sorry, I realize only now I linked the wrong JIRA. This is for 25538. Unfortunately I am not in front of my laptop right now so I cannot update the title. I'll do asap. Sorry for the mistake. Thanks for understanding. |
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1, LGTM for SPARK-25538 correctness issue.
|
cc @gatorsmile |
|
Thanks for the review @dongjoon-hyun |
|
@mgaido91 Could you change the title to [WIP] before you add the test case? Also cc @hvanhovell @kiszk who are the best person to review these code. |
|
Test build #96822 has finished for PR 22602 at commit
|
|
Thank you. The first option looks good. Let me think about a good UT, too. |
| // grow the global buffer before writing data. | ||
| holder.grow(16); | ||
|
|
||
| // zero-out the bytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Can we refine a comment like the following to avoid this problem in the future?
// always zero-out the 16-byte buffer
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
|
good catch! LGTM, waiting for the UT. |
|
I think we can create a |
HeartSaVioR
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 Nice finding.
|
thank you all for the reviews! I added the UT according to @cloud-fan's suggestion as I was unable to set up a reasonable the end-to-end UT. Thanks. |
| val res = unsafeRowWriter.getRow | ||
| assert(res.getDecimal(0, decimal1.precision, decimal1.scale) == decimal1) | ||
| // Check that the bytes which are not used by decimal1 (but are allocated) are zero-ed out | ||
| assert(res.getBytes()(25) == 0x00) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a comment about how we get the 25?
| unsafeRowWriter.reset() | ||
| unsafeRowWriter.write(0, decimal1, decimal1.precision, decimal1.scale) | ||
| val res = unsafeRowWriter.getRow | ||
| assert(res.getDecimal(0, decimal1.precision, decimal1.scale) == decimal1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually I think this assert is strong enough, we don't need to check the zero bytes below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this would pass also before the fix, as only the first 8 bytes are read. I added it as an additional safety fix.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
then this doesn't demonstrate how this bug could affect correctness.
one better idea for this test: we create 2 row writers, one writes the decimal as what you did here, and the other one writes decimal 0.431 directly. Then we compare the 2 result rows and make sure they equals.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks for the suggestion @cloud-fan, I like this approach.
|
Test build #96852 has finished for PR 22602 at commit
|
|
LGTM, pending jenkins |
1 similar comment
|
LGTM, pending jenkins |
|
Test build #96857 has finished for PR 22602 at commit
|
| import org.apache.spark.SparkFunSuite | ||
| import org.apache.spark.sql.types.Decimal | ||
|
|
||
| class UnsafeWriterSuite extends SparkFunSuite { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UnsafeWriterSuite -> UnsafeRowWriterSuite? Also, renaming the file?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it is necessary, as we may want to include here also tests for other UnsafeWriter in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. We had better have both UnsafeRowWriterSuite and UnsafeWriterSuite in the future if needed.
|
Test build #96859 has finished for PR 22602 at commit
|
| unsafeRowWriter1.reset() | ||
| unsafeRowWriter1.write(0, decimal1, decimal1.precision, decimal1.scale) | ||
| val res1 = unsafeRowWriter1.getRow | ||
| // On a second UnsafeRowWriter we write directly decimal2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
... we write directly decimal1.
| val decimal1 = Decimal(0.431) | ||
| decimal1.changePrecision(38, 18) | ||
| // This decimal holds 11 bytes | ||
| val decimal2 = Decimal(123456789.1232456789) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we verify the number of bytes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks, I added a check for that.
|
two minor comments. LGTM and good catch! |
|
Test build #96892 has finished for PR 22602 at commit
|
|
Test build #96893 has finished for PR 22602 at commit
|
|
Thank you, @mgaido91 and all. According to the all review comments, I'll merge this to |
## What changes were proposed in this pull request? In #20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed. I see 2 solutions here: - we can zero-out all the bytes in advance as it was done before #20850 (safer solution IMHO); - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option). Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later. ## How was this patch tested? Running the test attached in the JIRA + added UT Closes #22602 from mgaido91/SPARK-25582. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]> (cherry picked from commit d7ae36a) Signed-off-by: Dongjoon Hyun <[email protected]>
## What changes were proposed in this pull request? In apache#20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed. I see 2 solutions here: - we can zero-out all the bytes in advance as it was done before apache#20850 (safer solution IMHO); - we can allocate only the needed bytes (may be a bit more efficient in terms of memory used, but I have not investigated the feasibility of this option). Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later. ## How was this patch tested? Running the test attached in the JIRA + added UT Closes apache#22602 from mgaido91/SPARK-25582. Authored-by: Marco Gaido <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>
What changes were proposed in this pull request?
In #20850 when writing non-null decimals, instead of zero-ing all the 16 allocated bytes, we zero-out only the padding bytes. Since we always allocate 16 bytes, if the number of bytes needed for a decimal is lower than 9, then this means that the bytes between 8 and 16 are not zero-ed.
I see 2 solutions here:
Hence I propose here the first solution in order to fix the correctness issue. We can eventually switch to the second if we think is more efficient later.
How was this patch tested?
Running the test attached in the JIRA + added UT