Skip to content

Conversation

@openinx
Copy link
Member

@openinx openinx commented Jul 30, 2020

This PR addressed the bug in #1269, it mainly fixed the two sub-issues:

  1. when writing a Decimal (precision<=18) into hive orc file, the orc writer will scale down the decimal. for example, we have a value 10.100 for type Decimal(10, 3), the hive orc will remove all the trailing zero and store it as 101*10^(-1), mean precision is 3 and scale is 1. Here the scale of decimal read from hive orc file, is not strictly equal to 3. so for both spark orc reader and generic orc reader we need to transform it to the given scale =3 . Otherwise, the unit test will be broken.

  2. The long value of zoned timestamp can be negative, while we spark orc reader/writer did not consider this case, and just use the / and % to do the arithmetic computation, while actually we should use Math.floorDiv and Math.floorMod.

Comment on lines 198 to 199
return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());
Copy link
Contributor

@shardulm94 shardulm94 Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

value.serialize64() will take in an expected scale as a parameter, so I think the only change required to the original code is to pass our expected reader scale into value.serialize64() instead of passing value.scale() and passing expected precision and scale to Decimal.set.

So this would look like return new Decimal().set(value.serialize64(scale), precision, scale);

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds great. The essential purpose here is to construct a Decimal with the correct precision and scale ( instead of the value.precision() and value.scale().

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, seems it's still incorrect. Because the value.serialize64(scale) is still encoded by value.precision() and value.scale(). we use the given precision and scale to parse this long value, it will be messed up. Notice, the value.precision is not equals to precision, similar to scale.

The correct way should be:

Decimal decimal = new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
decimal.changePrecision(precision, scale);

Copy link
Contributor

@shardulm94 shardulm94 Aug 4, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe value.serialize64 returns the raw long value adjusted for the requested scale (and since precision <= 18, it always fits in long), I don't think it is tied to any precision. That being said, I am not very familiar with using decimals, so maybe I am missing something. Can you give an example of the case you are referring to?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Checked this again, I wrongly used the return new Decimal().set(value.serialize64(value.scale()), precision, scale) to construct the decimal before, which broken the unit tests. You are right, the long value is not tied to any precision. Sorry for the noisy.

@openinx openinx requested a review from shardulm94 August 4, 2020 07:22
@openinx
Copy link
Member Author

openinx commented Aug 5, 2020

Ping @shardulm94 @rdsr @rdblue , any other concern ? Thanks.

// as 101*10^(-1), its scale will adjust from 3 to 1. So here we could not assert that value.scale() == scale.
// we also need to convert the hive orc decimal to a decimal with expected precision and scale.
Preconditions.checkArgument(value.precision() <= precision,
"Cannot read value as decimal(%s,%s), too large: %s", precision, scale, value);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure we need to check the precision either. If we read a value, then we should return it, right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is necessary to do this check. we need to make sure that there's no bug when written a decimal into ORC. For example, for decimal(3, 0) data type we encounter a hive decimal 10000 (whose precision is 5), that should be something wrong. Throwing an exception is the correct way in that case.


@Override
protected void writeAndValidate(Schema schema) throws IOException {
List<Record> records = RandomGenericData.generate(schema, NUM_RECORDS, 1992L);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Validation should be done against this data, not data that has been read from a file. That way the test won't be broken by a problem with the reader or writer that produces the expected rows. To validate against these, use the GenericsHelpers.assertEqualsUnsafe methods.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It make sense.

@rdblue
Copy link
Contributor

rdblue commented Aug 5, 2020

This looks good to me, except that we need to update the test to validate against the original in-memory records, not against a set that was read from a file. It would also be good to have a test that specifically exercises the decimal path, or increase the number of random records until we are confident that one decimal will have one or more trailing 0s.

@rdblue
Copy link
Contributor

rdblue commented Aug 7, 2020

@openinx, I'm ready to merge this. Thanks for updating the tests! The only blocker is that the conflicts need to be fixed. Thank you!

@openinx
Copy link
Member Author

openinx commented Aug 7, 2020

OK, let me resolve the conflicts.

@openinx
Copy link
Member Author

openinx commented Aug 7, 2020

Rebased the patch, and let's wait for the travis testing result.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

The records are not aligned between spark orc reader/writer and generic orc reader/writer.

3 participants