-
Notifications
You must be signed in to change notification settings - Fork 3k
Align the records written by GenericOrcWriter and SparkOrcWriter #1271
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java
Outdated
Show resolved
Hide resolved
| return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale()); | ||
| BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
value.serialize64() will take in an expected scale as a parameter, so I think the only change required to the original code is to pass our expected reader scale into value.serialize64() instead of passing value.scale() and passing expected precision and scale to Decimal.set.
So this would look like return new Decimal().set(value.serialize64(scale), precision, scale);
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds great. The essential purpose here is to construct a Decimal with the correct precision and scale ( instead of the value.precision() and value.scale().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, seems it's still incorrect. Because the value.serialize64(scale) is still encoded by value.precision() and value.scale(). we use the given precision and scale to parse this long value, it will be messed up. Notice, the value.precision is not equals to precision, similar to scale.
The correct way should be:
Decimal decimal = new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
decimal.changePrecision(precision, scale);There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe value.serialize64 returns the raw long value adjusted for the requested scale (and since precision <= 18, it always fits in long), I don't think it is tied to any precision. That being said, I am not very familiar with using decimals, so maybe I am missing something. Can you give an example of the case you are referring to?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Checked this again, I wrongly used the return new Decimal().set(value.serialize64(value.scale()), precision, scale) to construct the decimal before, which broken the unit tests. You are right, the long value is not tied to any precision. Sorry for the noisy.
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
Outdated
Show resolved
Hide resolved
spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java
Outdated
Show resolved
Hide resolved
|
Ping @shardulm94 @rdsr @rdblue , any other concern ? Thanks. |
| // as 101*10^(-1), its scale will adjust from 3 to 1. So here we could not assert that value.scale() == scale. | ||
| // we also need to convert the hive orc decimal to a decimal with expected precision and scale. | ||
| Preconditions.checkArgument(value.precision() <= precision, | ||
| "Cannot read value as decimal(%s,%s), too large: %s", precision, scale, value); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure we need to check the precision either. If we read a value, then we should return it, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is necessary to do this check. we need to make sure that there's no bug when written a decimal into ORC. For example, for decimal(3, 0) data type we encounter a hive decimal 10000 (whose precision is 5), that should be something wrong. Throwing an exception is the correct way in that case.
|
|
||
| @Override | ||
| protected void writeAndValidate(Schema schema) throws IOException { | ||
| List<Record> records = RandomGenericData.generate(schema, NUM_RECORDS, 1992L); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Validation should be done against this data, not data that has been read from a file. That way the test won't be broken by a problem with the reader or writer that produces the expected rows. To validate against these, use the GenericsHelpers.assertEqualsUnsafe methods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It make sense.
|
This looks good to me, except that we need to update the test to validate against the original in-memory records, not against a set that was read from a file. It would also be good to have a test that specifically exercises the decimal path, or increase the number of random records until we are confident that one decimal will have one or more trailing 0s. |
|
@openinx, I'm ready to merge this. Thanks for updating the tests! The only blocker is that the conflicts need to be fixed. Thank you! |
|
OK, let me resolve the conflicts. |
|
Rebased the patch, and let's wait for the travis testing result. |
This PR addressed the bug in #1269, it mainly fixed the two sub-issues:
when writing a Decimal (precision<=18) into hive orc file, the orc writer will scale down the decimal. for example, we have a value 10.100 for type
Decimal(10, 3), the hive orc will remove all the trailing zero and store it as 101*10^(-1), mean precision is 3 and scale is 1. Here the scale of decimal read from hive orc file, is not strictly equal to 3. so for both spark orc reader and generic orc reader we need to transform it to the given scale =3 . Otherwise, the unit test will be broken.The long value of zoned timestamp can be negative, while we spark orc reader/writer did not consider this case, and just use the
/and%to do the arithmetic computation, while actually we should useMath.floorDivandMath.floorMod.