Skip to content

Conversation

@chenjunjiedada
Copy link
Collaborator

@chenjunjiedada chenjunjiedada commented Dec 12, 2022

This revives effort from #2987. Copy most from @aokolnychyi 's PR.

@Override
public IncrementalAppendScan newIncrementalAppendScan() {
TableOperations ops = new StaticTableOperations(metadataFileLocation, io, locationProvider);
return new BaseIncrementalAppendScan(ops, lazyTable());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be lazyTable().newIncrementalAppendScan?

nit: please move this method next to newScan() above following the same order as Table interface.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

this.recordOffset = 0L;
}

public DataIterator(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we remove the other constructor (also to avoid code duplication)? this is an @Internal class. it should be ok.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

private final EncryptionManager encryption;
private final ScanContext context;
private final RowDataFileScanTaskReader rowDataReader;
private final Table table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: put this before context following the same order on usage bellow.


return new FlinkInputFormat(
tableLoader, icebergSchema, io, encryption, contextBuilder.build());
(SerializableTable) SerializableTable.copyOf(table), contextBuilder.build());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Personally I like that in this PRFlinkInputFormat use the SeriallizableTable as arg type so that we are clear it is a SerializableTable or regular table.

I see other place just use Table as arg to avoid type cast, e.g. RowDataTaskWriterFactory. hence just to point it out.

  static IcebergStreamWriter<RowData> createStreamWriter(
      Table table,
      FlinkWriteConf flinkWriteConf,
      RowType flinkRowType,
      List<Integer> equalityFieldIds) {
    Preconditions.checkArgument(table != null, "Iceberg table shouldn't be null");

    Table serializableTable = SerializableTable.copyOf(table);
    TaskWriterFactory<RowData> taskWriterFactory =
        new RowDataTaskWriterFactory(
            serializableTable,
            flinkRowType,
            flinkWriteConf.targetDataFileSize(),
            flinkWriteConf.dataFileFormat(),
            equalityFieldIds,
            flinkWriteConf.upsertMode());
    return new IcebergStreamWriter<>(table.name(), taskWriterFactory);
  }

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see your point. Let me change those input parameters explicitly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a place where we specifically can not work with tables, just SerializedTables?
I usually try to stick to the lowest requirements in method arguments, so we can have higher reusability.

}

icebergSchema = table.schema();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: empty line not needed?

RowType flinkSchema = FlinkSchemaUtil.convert(table.schema());
this.taskWriterFactory =
new RowDataTaskWriterFactory(
SerializableTable.copyOf(table),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

table seems like a regular table, if we trace back the code. so we may not be able to change this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed ctor input parameter to SerializableTable.

private final boolean caseSensitive;
private final FileIO io;
private final EncryptionManager encryption;
private final Table table;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: keep the same order as arg usage

boolean caseSensitive,
FileIO io,
EncryptionManager encryption) {
Table table, ReadableConfig config, Schema projectedSchema, boolean caseSensitive) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we want to be consistent, this should be SerializableTable

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.


@ClassRule public static final TemporaryFolder TEMPORARY_FOLDER = new TemporaryFolder();

public static final HadoopTables tables = new HadoopTables(new Configuration());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we use HadoopTableResource instead?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I didn't know that. It looks cleaner.

flinkConfig,
table.schema(),
context.project(),
context.nameMapping(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this essentially deprecates ScanContext#nameMapping and only retrieve from table property. Technically, this breaks backward compatibility. is there any use case where Flink job needs to set a different name mapping than table property?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember that when we were implementing historical queries in Hive (AS OF VERSION, AS OF TIMESTAMP) then we found that the namemapping is bound to the table, and not bound to the version. So if some specific schema evolution happens (migrate a table, rename a column, add back an old column - or something like this - sadly I do not remember the specifics) then we were not able to restore the original mapping from the current one, and we do not able to query the data.
Being able to provide a namemapping could help here.

Arguably, this is a rare case, and the correct fix would be a spec change, still I thought it worth to mention.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, it was long ago, so we might want to double check the current situation before acting on this 😄

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIRC, This one (#2275) does the job. Now we can use schema Id in the snapshot to track which schema it writes. Anyway, let me revert this back since it should be in another PR even if we want to remove this.

} finally {
workerPool.shutdown();
}
return FlinkSplitPlanner.planInputSplits(table, context, workerPool);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to double check. input format is only for batch query, right? SerializableTable is read only and can't be used for long-running streaming source.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flink source uses the monitor function to monitor snapshots and forward splits to StreamingReaderOperator, StreamingReaderOperator uses FlinkInputFormat to open splits. Opening splits should work even the table is readonly.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah. using readonly table for reader function is fine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants