Skip to content

Conversation

@stevenzwu
Copy link
Contributor

@stevenzwu stevenzwu commented Aug 1, 2021

With this composition mode, DataIterator deals with CombinedScanTask and IteratorReader deals with individual FileScanTask. DataIterator use composition to reference IteratorReader

Here are the motivations behind this change

  1. address @openinx 's review comment
  2. make it easier to extend. E.g., internally we would like to add another converter from RowData to some other type Foo. We can't define some FooDataIterator extends RowDataIterator as the generic type is already fixed to RowData. With this new composition model, we can define a FooIteratorReader implements IteratorReader and FooIteratorReader can use RowDataIteratorReader by composition.

@github-actions github-actions bot added the flink label Aug 1, 2021
@stevenzwu
Copy link
Contributor Author

@openinx @JingsongLi @rdblue can you help take a look?

/**
* Read a {@link FileScanTask} into a {@link CloseableIterator}
*/
public interface IteratorReader<T> extends Serializable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The name IteratorReader doesn't make sense to me because iterators aren't generally read. They are iterated through. It also seems strange to me to have a method that is essentially a way to create an iterator but is not a CloseableIterable. Is there a way to restructure this so that this is a CloseableIterable instead?

Last, why does this need to be Serializable?

Copy link
Contributor Author

@stevenzwu stevenzwu Aug 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what about FileReader as class name? Its purpose is to open/read a FileScanTask as a CloseableIterator or CloseableIterable.

It needs to be Serializable because it gets shipped/serialized from jobmanager to taksmanager during deployment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed it to FileReader

@stevenzwu stevenzwu force-pushed the refactorDataIterator branch from 9ddaa6b to d96c737 Compare August 13, 2021 20:38
@github-actions github-actions bot added the core label Aug 13, 2021
this.io = io;
this.encryption = encryption;
this.context = context;
this.rowDataReader = new RowDataFileReader(tableSchema,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there necessary to expose the RowDataFileReader to the FlinkInputFormat ? The calling chain is actually: FlinkInputFormat ( whole scan) -> DataIterator (CombinedScanTask reader) -> RowDataFileReader ( FileScanTask reader ). I mean the FlinkInputFormat won't call the FileScanTask reader directly, so I'd prefer to hidden the internal details about how to construct the RowDataFileReader inside DataIterator. This is more in line with the interface design of software design.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

DataIterator is a generic type. The actual FileScanTaskReader needs to be provided when constructing DataIterator. Previously, FlinkInputFormat constructs the extended class RowDataIterator, which is removed in the PR. Now FlinkInputFormat constructs the RowDataFileScanTaskReader and pass it into DataIterator constructor for composition.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You mean there will be other data type applied to the generic DataIterator ( For example for flip-24 ) ? There seems no other FileScanTaskReader implementation except the RowDataFileScanTaskReader. I think we may don't want to the generic type for the DataIterator, then we also don't need the extra FileScanTaskReader (we could just rename the RowDataFileScanTaskReader to FileScanTaskReader directly), and finally we construct the FileScanTaskReader inside the DataIterator. How do you think ?

Copy link
Contributor Author

@stevenzwu stevenzwu Sep 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is correct that in iceberg-flink, there is no other impl of FileScanTaskReader other than RowDataFileScanTaskReader. But at Netflix and Apple, we use Avro schema for the Kafka data and deserialize bytes into Avro Record. For Iceberg source (like for backfill or bootstrap), we also want to deserialize Iceberg data into Avro Record so that app code deals with the same data type no matter consuming from Kafka or Iceberg. Hence we will plug in AvroGenericRecordFileScanTaskReader. That is one of the motivation of this refactoring.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That make sense for providing an abstracted FIleScanTaskReader in the offical apache iceberg repo, what concerns me is: we may get this interface refactored or removed when other developers propose a new pull request , if we don't have other impl in apache repo. Will we have other impl for the flip-27 work in offical repo ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx I extracted AbstractFileScanTaskReader as you suggested.

In flip-27 source in apache iceberg repo, we won't have other FileScanTaskReader impl.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, I think I did not describe this more clear. I mean the newly introduced FileScanTaskReader interface is nice to have in apache repo, so that others could provides their own implementation, such as AvroGenericRecordFileScanTaskReader. I don't think the latest commit's AbstractFileScanTaskReader will help to share common code between the AvroGenericRecordFileScanTaskReader & RowDataFileScanTaskReader because its exposed classes such as DeleteFilter, newAvroIterator, InputFilesDecryptor are not stable enough, we may change/refactor them in the following PRs (The DeleteFilter will need a big refactor for supporting unified v2 compaction).

Let's just revert the commit 72348c7, all the other things look great to me now, I plan to merge this PR. Thanks for the great work @stevenzwu !

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@openinx sorry for the misunderstanding. I have reverted the last commit by hard reset and force push

@stevenzwu stevenzwu force-pushed the refactorDataIterator branch from 7531913 to 7db0ec9 Compare September 2, 2021 16:16

@Override
protected CloseableIterator<RowData> openTaskIterator(FileScanTask task) {
public CloseableIterator<RowData> open(FileScanTask task, InputFilesDecryptor inputFilesDecryptor) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we construct the RowDataFileScanTaskReader inside the DataIterator, then we don't need to pass the InputFilesDecryptor for every open FileScanTask method, we also don't need to pass the InputFilesDecryptor to the newXXXIterable methods (I mean those changes could be reverted) because they could just use the class' private InputFilesDecryptor instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also tied to the other comment where we need DataIterator to be generic.

What you described above is similar to the current status, where we have RowDataIterator extends from DataIterator. Due to inheritance, RowDataIterator can call protected methods/variables from base class.

If we switch to composition model, we need to pass in InputFilesDecryptor to the RowDataFileScanTaskReader

Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @stevenzwu for the refactoring, this PR almost looks good to me. Just left several comments.

@stevenzwu stevenzwu force-pushed the refactorDataIterator branch from 72348c7 to 7db0ec9 Compare September 6, 2021 15:51
Copy link
Member

@openinx openinx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@openinx openinx merged commit 5f90476 into apache:master Sep 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants