🐛 fix Flink Read support for parquet int96 timestamps #3987

kingeasternsun · 2022-01-26T09:48:24Z

In spark We use add_filles syscall to migrate hive table to iceberg table, but when we use flink to read that target table, failed with these errors

ava.lang.UnsupportedOperationException: Unsupported type: optional int96 wafer_start_time = 4

  at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:268)

  at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:73)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:52)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitField(TypeWithSchemaVisitor.java:155)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitFields(TypeWithSchemaVisitor.java:169)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:47)

  at org.apache.iceberg.flink.data.FlinkParquetReaders.buildReader(FlinkParquetReaders.java:68)

  at org.apache.iceberg.flink.source.RowDataFileScanTaskReader.lambda$newParquetIterable$1(RowDataFileScanTaskReader.java:138)

  at org.apache.iceberg.parquet.ReadConf.(ReadConf.java:118)

The solution was inspired by #1184

… int96 after we use `add_filles` syscall to migrate hive table to iceberg table, when we use flink to read that target table, failed ava.lang.UnsupportedOperationException: Unsupported type: optional int96 wafer_start_time = 4 at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:268) at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:73) at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:52) at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitField(TypeWithSchemaVisitor.java:155) at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitFields(TypeWithSchemaVisitor.java:169) at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:47) at org.apache.iceberg.flink.data.FlinkParquetReaders.buildReader(FlinkParquetReaders.java:68) at org.apache.iceberg.flink.source.RowDataFileScanTaskReader.lambda$newParquetIterable$1(RowDataFileScanTaskReader.java:138) at org.apache.iceberg.parquet.ReadConf.(ReadConf.java:118)

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

smallx · 2022-01-28T08:11:47Z

I like this 🐛

kbendick

Thanks for the contribution @kingeasternsun!

For the systems I'm used to, usually reading int96 as timestamp is a configurable option. I'm not sure with respect to Flink if it's configurable or not, but in either case it's something we might consider.

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

…d parquetint96

…as little as possible

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java

rdblue · 2022-02-18T16:56:40Z

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java

+    List<RowData> readDataRows = rowDatasFromFile(parquetInputFile, schema);
+    Assert.assertEquals(rows.size(), readDataRows.size());
+    for (int i = 0; i < rows.size(); i += 1) {
+      Assert.assertEquals(rows.get(i).getLong(0), readDataRows.get(i).getLong(0));


I thought that Flink used millisecond precision for timestamps? Spark uses microsecond. Should these match?

I'll reconsider it

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

…eader

openinx · 2022-02-21T07:29:29Z

flink/v1.14/build.gradle

      exclude group: 'org.apache.hive', module: 'hive-storage-api'
    }

+    testImplementation "org.apache.spark:spark-sql_2.12:3.2.0"


Seems like this dependency was added for writing the timestamp as int96 in the unit test, but in fact we apache flink's ParquetRowDataWriter support writing a timestamp_with_local_time_zone into an INT96. So I will suggest to use the flink parquet writer rather than the spark parquet writer. (It's strange for me to introduce a spark module in in the flink module).

I actually prefer using the Spark module, unless Flink natively supports writing INT96 timestamps to Parquet. The benefit of using the Spark module is that support has been around for a long time and is relatively trusted to produce correct INT96 timestamp values.

flink/v1.14/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java

openinx · 2022-02-21T07:35:49Z

flink/v1.14/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java

+  }
+
+  @Test
+  public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOException {


I will suggest to write few rows by using the flink native writers, and then use the the following readers to assert the their results:

flink native parquet reader;

iceberg generic parquet reader

iceberg flink parquet reader

… use GenericParquetReaders to read int96 file

kingeasternsun · 2023-01-16T14:56:52Z

Hi, @rdblue @openinx @kbendick , Thanks for your advices, the conflict has been resolved, could this PR be merged?

github-actions · 2024-08-07T00:13:35Z

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

github-actions · 2024-08-15T00:13:13Z

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

github-actions bot added the flink label Jan 26, 2022

rdblue reviewed Jan 28, 2022

View reviewed changes

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java Outdated Show resolved Hide resolved

kbendick reviewed Jan 31, 2022

View reviewed changes

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java Outdated Show resolved Hide resolved

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java Outdated Show resolved Hide resolved

kingeasternsun and others added 3 commits February 8, 2022 11:31

Merge branch 'apache:master' into flink-read-int96

be8b45f

🎨 format code according Configuring Code Formatter for IntelliJ IDEA

a4ada94

✅ finkv1.14 read parquet-int96 test pass

d886d80

github-actions bot added the build label Feb 10, 2022

King added 7 commits February 10, 2022 10:01

✅ finkv1.14 read parquet-int96 test pass

9371226

:doc: add note

1fee14e

✅ flink v1.13 read parquet-int96 test pass

f24d0ff

✅ generate InternalRow directly in flink v1.13 read parquet-int96 test

636c2c5

🔥 delete unused file RandomData.java that just used to test flink rea…

cc6ba8a

…d parquetint96

🎨 checkstyle fix

068163b

🎨 revert the code style of old code as before, keep the modification …

0e7c819

…as little as possible

rdblue reviewed Feb 18, 2022

View reviewed changes

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 18, 2022

View reviewed changes

flink/v1.13/flink/src/test/java/org/apache/iceberg/flink/data/TestFlinkParquetReader.java Outdated Show resolved Hide resolved

rdblue reviewed Feb 18, 2022

View reviewed changes

flink/v1.13/flink/src/main/java/org/apache/iceberg/flink/data/FlinkParquetReaders.java Outdated Show resolved Hide resolved

King added 2 commits February 21, 2022 11:50

🎨 undo the support of parquet-int96 flinkv1.13

a49fc8c

♻️ add new method microsFromInt96 in DateTimeUtil for parquet-int96 r…

85d6471

…eader

github-actions bot added the core label Feb 21, 2022

King added 2 commits February 21, 2022 12:59

♻️ add new method microsFromInt96 in DateTimeUtil for parquet-int96 r…

7db1116

…eader

remove final in local

02570a4

openinx reviewed Feb 21, 2022

View reviewed changes

King added 3 commits February 23, 2022 10:55

♻️ flink parquet read int96 as TimestampData ; add new test case that…

036fd64

… use GenericParquetReaders to read int96 file

🔥 remove unused import

51c1ea4

🎨 checkstyle fix

150f800

King and others added 2 commits February 23, 2022 13:14

🎨 checkstyle fix

f29b145

Merge branch 'master' into flink-read-int96

6610357

github-actions bot added the stale label Aug 7, 2024

github-actions bot closed this Aug 15, 2024

🐛 fix Flink Read support for parquet int96 timestamps #3987

🐛 fix Flink Read support for parquet int96 timestamps #3987

Uh oh!

Conversation

kingeasternsun commented Jan 26, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

smallx commented Jan 28, 2022

Uh oh!

kbendick left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

rdblue Feb 18, 2022

Choose a reason for hiding this comment

Uh oh!

kingeasternsun Feb 19, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

openinx Feb 21, 2022

Choose a reason for hiding this comment

Uh oh!

rdblue Feb 22, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

openinx Feb 21, 2022

Choose a reason for hiding this comment

Uh oh!

kingeasternsun commented Jan 16, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

github-actions bot commented Aug 7, 2024

Uh oh!

github-actions bot commented Aug 15, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

kingeasternsun commented Jan 26, 2022 •

edited

Loading

kingeasternsun commented Jan 16, 2023 •

edited

Loading