Read support for parquet int96 timestamps #1184

gustavoatt · 2020-07-08T19:11:24Z

Summary

Add read support for Parquet INT96 timestamps (fixes #1138). This is needed so that parquet files written by Spark, that used INT96 timestamps, are able to be read by Iceberg without having to rewrite these files. This is specially useful for migrations.

apache/parquet-format#49 has more information about how parquet int96 timestamps are stored. Note that I only implemented read support since this representation has many issues (as visible in the conversation in the parquet-format PR).

Testing

Added unit test for spark readers
Unsure about what is the best place to add unit-tests for the non-spark parquet readers. Would gladly add one.

gustavoatt · 2020-07-08T19:18:05Z

cc: @rdblue @edgarRd

rdblue · 2020-07-14T01:01:54Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+        SparkSession.builder()
+            .master("local[2]")
+            .config("spark.sql.parquet.int96AsTimestamp", "false")
+            .getOrCreate();


Is it possible to avoid creating a Spark session just to write a timestamp? What about calling Spark's FileFormat to write directly instead?

We wrap Spark's FileFormat in our DSv2 table implementation: https://github.com/Netflix/iceberg/blob/netflix-spark-2.4/metacat/src/main/java/com/netflix/iceberg/batch/BatchPatternWrite.java#L90

This test would run much faster by using that to create a file instead of creating a Spark context.

Yes, I would much rather avoid creating a SparkSession here if possible. However, looking into ParquetFileFormat it seems like we would still need to pass a SparkSession to create the writer.

I can look at ParquetOutputWriter but I might need to match the configuration there with what Spark uses to write int96.

Another approach would be to check-in a parquet file written by a spark and have the test just read it?

A drawback with that approach is that updating this file would be brittle, but I can check in the code that writes the file in an ignored test, but that should avoid us from creating a spark session during unit tests. What do you think @rdblue?

At one point, we supported writing to Parquet using Spark's built-in ReadSupport. I think we can probably get that working again to create the files.

Yes, looking at one of the tests we do support writing parquet files using Spark's WriteSupport.

To be able to use a FileAppender I had to add a TimestampAsInt96 type (that can only be written using Spark's builtin WriteSupport) so that schema conversion within Iceberg's ParquetWriteSupport knows that this timestamps should be encoded as int96 in the parquet schema.

rdblue · 2020-07-14T01:03:22Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+
+    final String parquetPath = temp.getRoot().getAbsolutePath() + "/parquet_int96";
+    final java.sql.Timestamp ts = java.sql.Timestamp.valueOf("2014-01-01 23:00:01");
+    spark.createDataset(ImmutableList.of(ts), Encoders.TIMESTAMP()).write().parquet(parquetPath);


Using Spark's FileFormat would also make this test easier. You'd be able to pass in a value in micros and validate that you get the same value back, unmodified. You'd also not need to locate the Parquet file using find.

rdblue · 2020-07-14T01:04:14Z

Mostly looks good, but I'd like to fix up the test to avoid creating a SparkSession for just one case. Thanks @gustavoatt!

gustavoatt

Simplified the test by removing the usage of a SparkSession but still used Spark's ParquetWriteSupport. PTAL @rdblue

gustavoatt · 2020-07-23T21:07:37Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+        SparkSession.builder()
+            .master("local[2]")
+            .config("spark.sql.parquet.int96AsTimestamp", "false")
+            .getOrCreate();


Yes, looking at one of the tests we do support writing parquet files using Spark's WriteSupport.

To be able to use a FileAppender I had to add a TimestampAsInt96 type (that can only be written using Spark's builtin WriteSupport) so that schema conversion within Iceberg's ParquetWriteSupport knows that this timestamps should be encoded as int96 in the parquet schema.

rdblue · 2020-07-24T02:36:33Z

api/src/main/java/org/apache/iceberg/types/Types.java

+    /**
+     * @return Timestamp type (with timezone) represented as INT96. This is only added for compatibility reasons
+     * and can only be written using a Spark's ParquetWriteSupport. Writing this type should be avoided.
+     */


I don't think we should change the type system to support this. INT96 may be something that we can read, but Iceberg cannot write it, per the spec.

Was this needed to build the tests?

Agreed. I found a way to have tests running that doesn't add a new type, I had to create an implementation of ParquetWriter.Builder that uses Spark's ParquetWriteSupport and Iceberg's ParquetWriteAdapter to avoid creating a SparkSession.

rdblue · 2020-07-24T02:37:56Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+    final Schema schema = new Schema(required(1, "ts", Types.TimestampType.asSparkInt96()));
+    final StructType sparkSchema = SparkSchemaUtil.convert(schema);
+    final Path parquetFile = Paths.get(temp.getRoot().getAbsolutePath(), "parquet_int96.parquet");
+    final List<InternalRow> rows = Lists.newArrayList(RandomData.generateSpark(schema, 10, 0L));


Nit: we don't use final for local variables.

Done. Removed these final modifiers.

rdblue · 2020-07-24T02:39:09Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+  public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOException {
+    final Schema schema = new Schema(required(1, "ts", Types.TimestampType.asSparkInt96()));
+    final StructType sparkSchema = SparkSchemaUtil.convert(schema);
+    final Path parquetFile = Paths.get(temp.getRoot().getAbsolutePath(), "parquet_int96.parquet");


Why not use temp.newFile?

I initially tried that way but the writer fails because the file already exists.

rdblue · 2020-07-24T02:39:45Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+            .set("spark.sql.parquet.int96AsTimestamp", "true")
+            .set("spark.sql.parquet.writeLegacyFormat", "false")
+            .set("spark.sql.parquet.outputTimestampType", "INT96")
+            .schema(schema)


I'd prefer to pass in a normal timestamp type and set a property, if needed, to enable INT96 support.

I'm not sure I fully understand this comment.

But I did change my approach here, and while still writing InternalRow I removed most of these properties and left only the relevant ones to make sure that Spark writes these as int96.

gustavoatt

Thanks for the review @rdblue! I was able to keep Iceberg types unchanged and only added read support for int96 timestamps so this PR should be ready for another look 👀

gustavoatt · 2020-07-24T15:38:52Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+  public void testInt96TimestampProducedBySparkIsReadCorrectly() throws IOException {
+    final Schema schema = new Schema(required(1, "ts", Types.TimestampType.asSparkInt96()));
+    final StructType sparkSchema = SparkSchemaUtil.convert(schema);
+    final Path parquetFile = Paths.get(temp.getRoot().getAbsolutePath(), "parquet_int96.parquet");


I initially tried that way but the writer fails because the file already exists.

gustavoatt · 2020-07-24T15:39:27Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+    final Schema schema = new Schema(required(1, "ts", Types.TimestampType.asSparkInt96()));
+    final StructType sparkSchema = SparkSchemaUtil.convert(schema);
+    final Path parquetFile = Paths.get(temp.getRoot().getAbsolutePath(), "parquet_int96.parquet");
+    final List<InternalRow> rows = Lists.newArrayList(RandomData.generateSpark(schema, 10, 0L));


Done. Removed these final modifiers.

gustavoatt · 2020-07-24T19:45:27Z

api/src/main/java/org/apache/iceberg/types/Types.java

+    /**
+     * @return Timestamp type (with timezone) represented as INT96. This is only added for compatibility reasons
+     * and can only be written using a Spark's ParquetWriteSupport. Writing this type should be avoided.
+     */


Agreed. I found a way to have tests running that doesn't add a new type, I had to create an implementation of ParquetWriter.Builder that uses Spark's ParquetWriteSupport and Iceberg's ParquetWriteAdapter to avoid creating a SparkSession.

gustavoatt · 2020-07-24T19:47:21Z

spark/src/test/java/org/apache/iceberg/spark/data/TestSparkParquetReader.java

+            .set("spark.sql.parquet.int96AsTimestamp", "true")
+            .set("spark.sql.parquet.writeLegacyFormat", "false")
+            .set("spark.sql.parquet.outputTimestampType", "INT96")
+            .schema(schema)


I'm not sure I fully understand this comment.

But I did change my approach here, and while still writing InternalRow I removed most of these properties and left only the relevant ones to make sure that Spark writes these as int96.

rdblue · 2020-07-24T20:01:13Z

data/src/main/java/org/apache/iceberg/data/parquet/BaseParquetReaders.java

+
+    @Override
+    public LocalDateTime read(LocalDateTime reuse) {
+      final ByteBuffer byteBuffer = column.nextBinary().toByteBuffer().order(ByteOrder.LITTLE_ENDIAN);


Note for reviewers (and future me): toByteBuffer returns a duplicate of the internal buffer so that it is safe for uses of it to modify the buffer's position with methods like getLong.

rdblue · 2020-07-24T20:03:10Z

Nice work, @gustavoatt! Thank you for updating this so that the test is self-contained.

I'll merge this when tests are passing.

rdblue · 2020-07-24T20:52:50Z

Merged. Thanks for fixing this, @gustavoatt!

thesquelched · 2020-07-24T20:58:20Z

Awesome possum, thanks for resolving this

gustavoatt · 2020-07-24T21:17:47Z

Thanks for merging and for the review @rdblue!

gustavoatt added 2 commits July 7, 2020 17:05

Read support for int96 as timestamp

ca3e955

Parquet int96 timestamp spark read tests

cea839f

gustavoatt changed the title ~~Gustavoatt parquet read int96 timestamps~~ Read support for parquet int96 timestamps Jul 8, 2020

gustavoatt mentioned this pull request Jul 8, 2020

int96 support in parquet #1138

Closed

rdblue reviewed Jul 14, 2020

View reviewed changes

gustavoatt added 2 commits July 23, 2020 14:09

Add int96 timestamp type for parquet-read support

944c325

Use spark's ParquetWriteSupport to test int96 timestamps read support

0c94f88

gustavoatt commented Jul 23, 2020

View reviewed changes

Fix style checks

b35027b

rdblue reviewed Jul 24, 2020

View reviewed changes

Rewrite spark int96 test without creating Int96 timestamp type

7e48187

gustavoatt commented Jul 24, 2020

View reviewed changes

rdblue reviewed Jul 24, 2020

View reviewed changes

Test checkstyle fixes

68dc4c5

rdblue merged commit 62ab6a7 into apache:master Jul 24, 2020

gustavoatt deleted the gustavoatt--parquet-read-int96-timestamps branch July 24, 2020 21:17

rdblue pushed a commit to rdblue/iceberg that referenced this pull request Jul 29, 2020

Parquet: Support reading int96 timestamps in imported data (apache#1184)

e29d53a

aokolnychyi pushed a commit to aokolnychyi/iceberg that referenced this pull request Aug 18, 2020

Parquet: Support reading int96 timestamps in imported data (apache#1184)

625c743

cmathiesen pushed a commit to ExpediaGroup/iceberg that referenced this pull request Aug 19, 2020

Parquet: Support reading int96 timestamps in imported data (apache#1184)

fb740f1

kingeasternsun mentioned this pull request Jan 26, 2022

🐛 fix Flink Read support for parquet int96 timestamps #3987

Closed

yabola mentioned this pull request Mar 1, 2023

Support vectorized reading int96 timestamps in imported data #6962

Merged

marcinsbd mentioned this pull request Jul 5, 2023

Support timestamp type in Iceberg migrate procedure trinodb/trino#17391

Merged

Read support for parquet int96 timestamps #1184

Read support for parquet int96 timestamps #1184

Uh oh!

Conversation

gustavoatt commented Jul 8, 2020

Summary

Testing

Uh oh!

gustavoatt commented Jul 8, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 14, 2020

Uh oh!

gustavoatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gustavoatt left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Jul 24, 2020

Uh oh!

rdblue commented Jul 24, 2020

Uh oh!

thesquelched commented Jul 24, 2020

Uh oh!

gustavoatt commented Jul 24, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants