Skip to content

Conversation

@kingeasternsun
Copy link
Contributor

@kingeasternsun kingeasternsun commented Jan 26, 2022

In spark We use add_filles syscall to migrate hive table to iceberg table, but when we use flink to read that target table, failed with these errors

ava.lang.UnsupportedOperationException: Unsupported type: optional int96 wafer_start_time = 4

  at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:268)

  at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:73)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:52)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitField(TypeWithSchemaVisitor.java:155)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitFields(TypeWithSchemaVisitor.java:169)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:47)

  at org.apache.iceberg.flink.data.FlinkParquetReaders.buildReader(FlinkParquetReaders.java:68)

  at org.apache.iceberg.flink.source.RowDataFileScanTaskReader.lambda$newParquetIterable$1(RowDataFileScanTaskReader.java:138)

  at org.apache.iceberg.parquet.ReadConf.(ReadConf.java:118)

The solution was inspired by #1184

… int96

after we use `add_filles` syscall to migrate hive table to iceberg table, when we use flink to read that target table,  failed
ava.lang.UnsupportedOperationException: Unsupported type: optional int96 wafer_start_time = 4

  at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:268)

  at org.apache.iceberg.flink.data.FlinkParquetReaders$ReadBuilder.primitive(FlinkParquetReaders.java:73)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:52)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitField(TypeWithSchemaVisitor.java:155)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visitFields(TypeWithSchemaVisitor.java:169)

  at org.apache.iceberg.parquet.TypeWithSchemaVisitor.visit(TypeWithSchemaVisitor.java:47)

  at org.apache.iceberg.flink.data.FlinkParquetReaders.buildReader(FlinkParquetReaders.java:68)

  at org.apache.iceberg.flink.source.RowDataFileScanTaskReader.lambda$newParquetIterable$1(RowDataFileScanTaskReader.java:138)

  at org.apache.iceberg.parquet.ReadConf.(ReadConf.java:118)
@github-actions github-actions bot added the flink label Jan 26, 2022
@smallx
Copy link
Contributor

smallx commented Jan 28, 2022

I like this 🐛

Copy link
Contributor

@kbendick kbendick left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution @kingeasternsun!

For the systems I'm used to, usually reading int96 as timestamp is a configurable option. I'm not sure with respect to Flink if it's configurable or not, but in either case it's something we might consider.

@github-actions github-actions bot added the build label Feb 10, 2022
List<RowData> readDataRows = rowDatasFromFile(parquetInputFile, schema);
Assert.assertEquals(rows.size(), readDataRows.size());
for (int i = 0; i < rows.size(); i += 1) {
Assert.assertEquals(rows.get(i).getLong(0), readDataRows.get(i).getLong(0));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought that Flink used millisecond precision for timestamps? Spark uses microsecond. Should these match?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll reconsider it

@github-actions github-actions bot added the core label Feb 21, 2022
exclude group: 'org.apache.hive', module: 'hive-storage-api'
}

testImplementation "org.apache.spark:spark-sql_2.12:3.2.0"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this dependency was added for writing the timestamp as int96 in the unit test, but in fact we apache flink's ParquetRowDataWriter support writing a timestamp_with_local_time_zone into an INT96. So I will suggest to use the flink parquet writer rather than the spark parquet writer. (It's strange for me to introduce a spark module in in the flink module).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually prefer using the Spark module, unless Flink natively supports writing INT96 timestamps to Parquet. The benefit of using the Spark module is that support has been around for a long time and is relatively trusted to produce correct INT96 timestamp values.

}

@Test
public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOException {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will suggest to write few rows by using the flink native writers, and then use the the following readers to assert the their results:

  • flink native parquet reader;
  • iceberg generic parquet reader
  • iceberg flink parquet reader

@kingeasternsun
Copy link
Contributor Author

kingeasternsun commented Jan 16, 2023

Hi, @rdblue @openinx @kbendick , Thanks for your advices, the conflict has been resolved, could this PR be merged?

@github-actions
Copy link

github-actions bot commented Aug 7, 2024

This pull request has been marked as stale due to 30 days of inactivity. It will be closed in 1 week if no further activity occurs. If you think that’s incorrect or this pull request requires a review, please simply write any comment. If closed, you can revive the PR at any time and @mention a reviewer or discuss it on the [email protected] list. Thank you for your contributions.

@github-actions github-actions bot added the stale label Aug 7, 2024
@github-actions
Copy link

This pull request has been closed due to lack of activity. This is not a judgement on the merit of the PR in any way. It is just a way of keeping the PR queue manageable. If you think that is incorrect, or the pull request requires review, you can revive the PR at any time.

@github-actions github-actions bot closed this Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants