Flink: use SerializableTable for source #6407

chenjunjiedada · 2022-12-12T06:55:10Z

This revives effort from #2987. Copy most from @aokolnychyi 's PR.

stevenzwu · 2022-12-12T18:27:42Z

core/src/main/java/org/apache/iceberg/SerializableTable.java

+  @Override
+  public IncrementalAppendScan newIncrementalAppendScan() {
+    TableOperations ops = new StaticTableOperations(metadataFileLocation, io, locationProvider);
+    return new BaseIncrementalAppendScan(ops, lazyTable());


should this be lazyTable().newIncrementalAppendScan?

nit: please move this method next to newScan() above following the same order as Table interface.

stevenzwu · 2022-12-12T19:15:09Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/DataIterator.java

    this.recordOffset = 0L;
  }

+  public DataIterator(


should we remove the other constructor (also to avoid code duplication)? this is an @Internal class. it should be ok.

stevenzwu · 2022-12-12T19:15:57Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java

-  private final EncryptionManager encryption;
  private final ScanContext context;
  private final RowDataFileScanTaskReader rowDataReader;
+  private final Table table;


nit: put this before context following the same order on usage bellow.

stevenzwu · 2022-12-12T19:23:43Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSource.java


      return new FlinkInputFormat(
-          tableLoader, icebergSchema, io, encryption, contextBuilder.build());
+          (SerializableTable) SerializableTable.copyOf(table), contextBuilder.build());


Personally I like that in this PRFlinkInputFormat use the SeriallizableTable as arg type so that we are clear it is a SerializableTable or regular table.

I see other place just use Table as arg to avoid type cast, e.g. RowDataTaskWriterFactory. hence just to point it out.

static IcebergStreamWriter<RowData> createStreamWriter( Table table, FlinkWriteConf flinkWriteConf, RowType flinkRowType, List<Integer> equalityFieldIds) { Preconditions.checkArgument(table != null, "Iceberg table shouldn't be null"); Table serializableTable = SerializableTable.copyOf(table); TaskWriterFactory<RowData> taskWriterFactory = new RowDataTaskWriterFactory( serializableTable, flinkRowType, flinkWriteConf.targetDataFileSize(), flinkWriteConf.dataFileFormat(), equalityFieldIds, flinkWriteConf.upsertMode()); return new IcebergStreamWriter<>(table.name(), taskWriterFactory); }

I see your point. Let me change those input parameters explicitly.

Is there a place where we specifically can not work with tables, just SerializedTables?
I usually try to stick to the lowest requirements in method arguments, so we can have higher reusability.

stevenzwu · 2022-12-12T19:25:20Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/FlinkSource.java

      }

+      icebergSchema = table.schema();
+


nit: empty line not needed?

stevenzwu · 2022-12-12T19:33:14Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/RowDataRewriter.java

    RowType flinkSchema = FlinkSchemaUtil.convert(table.schema());
    this.taskWriterFactory =
        new RowDataTaskWriterFactory(
-            SerializableTable.copyOf(table),


table seems like a regular table, if we trace back the code. so we may not be able to change this.

Changed ctor input parameter to SerializableTable.

stevenzwu · 2022-12-12T19:37:12Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/source/reader/RowDataReaderFunction.java

  private final boolean caseSensitive;
-  private final FileIO io;
-  private final EncryptionManager encryption;
+  private final Table table;


nit: keep the same order as arg usage

stevenzwu · 2022-12-12T19:38:23Z

.../v1.16/flink/src/main/java/org/apache/iceberg/flink/source/reader/RowDataReaderFunction.java

-      boolean caseSensitive,
-      FileIO io,
-      EncryptionManager encryption) {
+      Table table, ReadableConfig config, Schema projectedSchema, boolean caseSensitive) {


if we want to be consistent, this should be SerializableTable

stevenzwu · 2022-12-12T19:49:21Z

...v1.16/flink/src/test/java/org/apache/iceberg/flink/source/reader/ReaderFunctionTestBase.java


  @ClassRule public static final TemporaryFolder TEMPORARY_FOLDER = new TemporaryFolder();

+  public static final HadoopTables tables = new HadoopTables(new Configuration());


can we use HadoopTableResource instead?

Sure, I didn't know that. It looks cleaner.

stevenzwu · 2022-12-13T02:44:45Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/IcebergSource.java

                flinkConfig,
-                table.schema(),
                context.project(),
-                context.nameMapping(),


this essentially deprecates ScanContext#nameMapping and only retrieve from table property. Technically, this breaks backward compatibility. is there any use case where Flink job needs to set a different name mapping than table property?

I remember that when we were implementing historical queries in Hive (AS OF VERSION, AS OF TIMESTAMP) then we found that the namemapping is bound to the table, and not bound to the version. So if some specific schema evolution happens (migrate a table, rename a column, add back an old column - or something like this - sadly I do not remember the specifics) then we were not able to restore the original mapping from the current one, and we do not able to query the data.
Being able to provide a namemapping could help here.

Arguably, this is a rare case, and the correct fix would be a spec change, still I thought it worth to mention.

Also, it was long ago, so we might want to double check the current situation before acting on this 😄

IIRC, This one (#2275) does the job. Now we can use schema Id in the snapshot to track which schema it writes. Anyway, let me revert this back since it should be in another PR even if we want to remove this.

stevenzwu · 2023-01-11T02:49:11Z

flink/v1.16/flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java

-    } finally {
-      workerPool.shutdown();
-    }
+    return FlinkSplitPlanner.planInputSplits(table, context, workerPool);


just to double check. input format is only for batch query, right? SerializableTable is read only and can't be used for long-running streaming source.

Flink source uses the monitor function to monitor snapshots and forward splits to StreamingReaderOperator, StreamingReaderOperator uses FlinkInputFormat to open splits. Opening splits should work even the table is readonly.

yeah. using readonly table for reader function is fine.

Flink: use SerializableTable for source

27e9f13

github-actions bot added data flink labels Dec 12, 2022

fix ut fail

a2f9a5b

github-actions bot added the core label Dec 12, 2022

chenjunjiedada mentioned this pull request Dec 12, 2022

Flink: use correct metric config for position deletes #6313

Merged

stevenzwu reviewed Dec 12, 2022

View reviewed changes

address steven's comments

1180ac6

stevenzwu reviewed Dec 13, 2022

View reviewed changes

revert back namemapping changes

4f9dc97

stevenzwu reviewed Jan 11, 2023

View reviewed changes

chenjunjiedada closed this Jul 23, 2023


		@ClassRule public static final TemporaryFolder TEMPORARY_FOLDER = new TemporaryFolder();

		public static final HadoopTables tables = new HadoopTables(new Configuration());

Flink: use SerializableTable for source #6407

Flink: use SerializableTable for source #6407

Uh oh!

Conversation

chenjunjiedada commented Dec 12, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

chenjunjiedada commented Dec 12, 2022 •

edited

Loading