Add delta writer classes #1802

rdblue · 2020-11-21T01:43:05Z

This is a draft of how we can extend the current writer classes to handle deltas and file rolling.

Adds DataWriter like EqualityDeleteWriter and PositionDeleteWriter
Adds methods to FileAppenderFactory to create writers; it may be cleaner to move these to a separate WriterFactory
Adds a rolling writer implementation for equality deletes to BaseTaskWriter
Abstracts rolling writer logic into a base class
Adds an example UnpartitionedDeltaWriter

These changes are incomplete, so this probably doesn't work yet. This would still need a real task writer implementation and additional arguments passed to create the FileAppenderFactory.

openinx · 2020-11-22T06:02:55Z

spark/src/main/java/org/apache/iceberg/spark/source/SparkAppenderFactory.java

+      switch (format) {
+        case PARQUET:
+          return Parquet.writeDeletes(outputFile.encryptingOutputFile())
+              .createWriterFunc(msgType -> SparkParquetWriters.buildWriter(dsSchema, msgType))


Here we will need to provide the schema which combines with writeSchema and file & pos columns to construct the ParquetValueWriter because the PositionDeleteStructWriter will treat the user-provided rows and file+pos as a whole record to write into the target file.

The builder here will automatically wrap the row schema and the function passed here to add the extra schema layer. So we just need to configure this for rows, not for the combined schema. That's part of why the position writer's delete method accepts file and position independent of row, to keep the encapsulation and not leak the concern to places like this.

In this sentence reateWriterFunc(msgType -> SparkParquetWriters.buildWriter(dsSchema, msgType)), the msgType have included the file and pos columns, but the dsSchema hasn't. I mean we need to provide a correct dsSchema so that it could match the msgType exactly. It's similar to this: https://github.com/apache/iceberg/pull/1663/files#diff-7f498f01885f6e813bc3892c8dfb02b8893365540438b78b3a0221f9c8667c8fR211

I think that's a bug. My intent was to have the caller configure the writer just like normal for a data file or equality delete file. There is no need to expose that complexity to the writer. So we should update the builder to extract the row type and pass it into the createWriterFunc.

openinx · 2020-11-22T06:05:28Z

core/src/main/java/org/apache/iceberg/io/UnpartitionedDeltaWriter.java

+    currentWriter.add(record);
+  }
+
+  public void delete(T record) throws IOException {


Will it be better to add this delete(T record) to the TaskWriter interface ?

Maybe not TaskWriter, it's DeltaWriter.

Probably. I just wanted to demonstrate that we can add a delete here that works with the rolling writer. What we actually expose will probably be different.

openinx · 2020-11-22T06:24:18Z

core/src/main/java/org/apache/iceberg/io/UnpartitionedDeltaWriter.java

+import org.apache.iceberg.FileFormat;
+import org.apache.iceberg.PartitionSpec;
+
+public class UnpartitionedDeltaWriter<T> extends BaseTaskWriter<T> {


I think we would still need to add an extra DeltaWriter between the TaskWriter and RollingFileWriters, because for a fanout TaskWriter, it will have rows from different partitions or buckets, and the writer for different partitions(or buckets) would accept data record, equality deletions , which is named DeltaWriter.

In this way, we could move all the equality delete logics inside a single common DeltaWriter class, and the TaskWriter will focus on how to dispatch the records with the customized policy to the methods in DeltaWriter, for example, Flink's RowData has INSERT/DELETE/UPDATE_BEFORE/UPDATE_AFTER, if the row is a DELETE, then could use the fanout policy to direct it to DeltaWriter's delete method.

rdblue · 2020-11-25T00:45:55Z

Closing this because it is incorporated in #1818, which is a full working implementation.

rdblue added 2 commits November 20, 2020 17:28

Current work on delta writers.

fba3a82

Reuse base logic in rolling writer.

b84e858

rdblue requested a review from openinx November 21, 2020 01:43

rdblue marked this pull request as draft November 21, 2020 01:43

github-actions bot added core spark labels Nov 21, 2020

rdblue mentioned this pull request Nov 21, 2020

Flink: write the CDC records into apache iceberg tables #1663

Closed

openinx reviewed Nov 22, 2020

View reviewed changes

openinx mentioned this pull request Nov 24, 2020

Flink: Add RowDataTaskWriter to accept equality deletions. #1818

Closed

rdblue closed this Nov 25, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add delta writer classes #1802

Add delta writer classes #1802

Uh oh!

rdblue commented Nov 21, 2020

Uh oh!

openinx Nov 22, 2020

Uh oh!

rdblue Nov 23, 2020

Uh oh!

openinx Nov 24, 2020

Uh oh!

rdblue Nov 24, 2020

Uh oh!

openinx Nov 22, 2020

Uh oh!

openinx Nov 22, 2020

Uh oh!

rdblue Nov 23, 2020

Uh oh!

openinx Nov 22, 2020

Uh oh!

rdblue commented Nov 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add delta writer classes #1802

Add delta writer classes #1802

Uh oh!

Conversation

rdblue commented Nov 21, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Nov 25, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants