Align the records written by GenericOrcWriter and SparkOrcWriter #1271

openinx · 2020-07-30T09:41:13Z

This PR addressed the bug in #1269, it mainly fixed the two sub-issues:

when writing a Decimal (precision<=18) into hive orc file, the orc writer will scale down the decimal. for example, we have a value 10.100 for type Decimal(10, 3), the hive orc will remove all the trailing zero and store it as 101*10^(-1), mean precision is 3 and scale is 1. Here the scale of decimal read from hive orc file, is not strictly equal to 3. so for both spark orc reader and generic orc reader we need to transform it to the given scale =3 . Otherwise, the unit test will be broken.
The long value of zoned timestamp can be negative, while we spark orc reader/writer did not consider this case, and just use the / and % to do the arithmetic computation, while actually we should use Math.floorDiv and Math.floorMod.

data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

shardulm94 · 2020-08-04T03:15:13Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

-      return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
+      BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());


value.serialize64() will take in an expected scale as a parameter, so I think the only change required to the original code is to pass our expected reader scale into value.serialize64() instead of passing value.scale() and passing expected precision and scale to Decimal.set.

So this would look like return new Decimal().set(value.serialize64(scale), precision, scale);

Sounds great. The essential purpose here is to construct a Decimal with the correct precision and scale ( instead of the value.precision() and value.scale().

Oh, seems it's still incorrect. Because the value.serialize64(scale) is still encoded by value.precision() and value.scale(). we use the given precision and scale to parse this long value, it will be messed up. Notice, the value.precision is not equals to precision, similar to scale.

The correct way should be:

Decimal decimal = new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale()); decimal.changePrecision(precision, scale);

I believe value.serialize64 returns the raw long value adjusted for the requested scale (and since precision <= 18, it always fits in long), I don't think it is tied to any precision. That being said, I am not very familiar with using decimals, so maybe I am missing something. Can you give an example of the case you are referring to?

Checked this again, I wrongly used the return new Decimal().set(value.serialize64(value.scale()), precision, scale) to construct the decimal before, which broken the unit tests. You are right, the long value is not tied to any precision. Sorry for the noisy.

data/src/main/java/org/apache/iceberg/data/orc/GenericOrcWriters.java

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcWriter.java

openinx · 2020-08-05T14:11:02Z

Ping @shardulm94 @rdsr @rdblue , any other concern ? Thanks.

rdblue · 2020-08-05T21:13:14Z

spark/src/main/java/org/apache/iceberg/spark/data/SparkOrcValueReaders.java

+      // as 101*10^(-1), its scale will adjust from 3 to 1. So here we could not assert that value.scale() == scale.
+      // we also need to convert the hive orc decimal to a decimal with expected precision and scale.
+      Preconditions.checkArgument(value.precision() <= precision,
+          "Cannot read value as decimal(%s,%s), too large: %s", precision, scale, value);


I'm not sure we need to check the precision either. If we read a value, then we should return it, right?

It is necessary to do this check. we need to make sure that there's no bug when written a decimal into ORC. For example, for decimal(3, 0) data type we encounter a hive decimal 10000 (whose precision is 5), that should be something wrong. Throwing an exception is the correct way in that case.

rdblue · 2020-08-05T21:28:23Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkRecordOrcReaderWriter.java

+
+  @Override
+  protected void writeAndValidate(Schema schema) throws IOException {
+    List<Record> records = RandomGenericData.generate(schema, NUM_RECORDS, 1992L);


Validation should be done against this data, not data that has been read from a file. That way the test won't be broken by a problem with the reader or writer that produces the expected rows. To validate against these, use the GenericsHelpers.assertEqualsUnsafe methods.

It make sense.

rdblue · 2020-08-05T21:30:53Z

This looks good to me, except that we need to update the test to validate against the original in-memory records, not against a set that was read from a file. It would also be good to have a test that specifically exercises the decimal path, or increase the number of random records until we are confident that one decimal will have one or more trailing 0s.

rdblue · 2020-08-07T01:08:24Z

@openinx, I'm ready to merge this. Thanks for updating the tests! The only blocker is that the conflicts need to be fixed. Thank you!

openinx · 2020-08-07T01:49:27Z

OK, let me resolve the conflicts.

openinx · 2020-08-07T02:10:28Z

Rebased the patch, and let's wait for the travis testing result.

openinx mentioned this pull request Jul 31, 2020

Flink: update parquet reader with schema visitor #1266

Merged

shardulm94 reviewed Aug 4, 2020

View reviewed changes

openinx requested a review from shardulm94 August 4, 2020 07:22

rdblue reviewed Aug 5, 2020

View reviewed changes

Align the records between GenericOrcWriter and SparkOrcWriter

b7fe76f

rdblue merged commit 6f96b36 into apache:master Aug 7, 2020

rdblue mentioned this pull request Aug 7, 2020

The records are not aligned between spark orc reader/writer and generic orc reader/writer. #1269

Closed

rdblue linked an issue Aug 7, 2020 that may be closed by this pull request

The records are not aligned between spark orc reader/writer and generic orc reader/writer. #1269

Closed

rdblue pushed a commit that referenced this pull request Aug 11, 2020

ORC: Fix decimal and timestamp bugs (#1271)

768dc08

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

ORC: Fix decimal and timestamp bugs (apache#1271)

976eb78

		return new Decimal().set(value.serialize64(value.scale()), value.precision(), value.scale());
		BigDecimal decimal = new BigDecimal(BigInteger.valueOf(value.serialize64(value.scale())), value.scale());

Align the records written by GenericOrcWriter and SparkOrcWriter #1271

Align the records written by GenericOrcWriter and SparkOrcWriter #1271

Uh oh!

Conversation

openinx commented Jul 30, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

shardulm94 Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

openinx Aug 4, 2020

Choose a reason for hiding this comment

Uh oh!

shardulm94 Aug 4, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

openinx Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

openinx commented Aug 5, 2020

Uh oh!

rdblue Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

openinx Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue Aug 5, 2020

Choose a reason for hiding this comment

Uh oh!

openinx Aug 6, 2020

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 5, 2020

Uh oh!

rdblue commented Aug 7, 2020

Uh oh!

openinx commented Aug 7, 2020

Uh oh!

openinx commented Aug 7, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

shardulm94 Aug 4, 2020 •

edited

Loading

shardulm94 Aug 4, 2020 •

edited

Loading