-
Notifications
You must be signed in to change notification settings - Fork 3k
[WIP] Flink Iceberg sink #856
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
IMO, I think this PR is too big to review, we should break down into small tasks and submitted one by one. Besides, we should have a design doc about this. |
|
@jerryshao Thanks for the comments! Yes, I will break it down into tasks/small pieces of code to contain separate functions which are more convenient for reviewers. For the design doc, please take a look at here that I just uploaded. |
flink/src/main/java/org/apache/iceberg/flink/connector/IcebergConnectorConstant.java
Outdated
Show resolved
Hide resolved
flink/src/main/java/org/apache/iceberg/flink/connector/IcebergConnectorConstant.java
Outdated
Show resolved
Hide resolved
flink/src/main/java/org/apache/iceberg/flink/connector/model/CommitMetadataUtil.java
Show resolved
Hide resolved
flink/src/main/java/org/apache/iceberg/flink/connector/sink/AvroSerializer.java
Outdated
Show resolved
Hide resolved
flink/src/main/java/org/apache/iceberg/flink/connector/sink/AvroUtils.java
Outdated
Show resolved
Hide resolved
flink/src/main/java/org/apache/iceberg/flink/connector/sink/FileWriter.java
Outdated
Show resolved
Hide resolved
| * | ||
| * @param serializer Serialize input data type to Avro GenericRecord | ||
| */ | ||
| public IcebergSinkAppender<IN> withSerializer(AvroSerializer<IN> serializer) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be removed.
versions.lock
Outdated
| com.google.code.gson:gson:2.2.4 (2 constraints: 9518bfd2) | ||
| com.github.scopt:scopt_2.11:3.5.0 (1 constraints: 5a0ef868) | ||
| com.github.stephenc.findbugs:findbugs-annotations:1.3.9-1 (5 constraints: 693c9a5d) | ||
| com.google.code.findbugs:jsr305:3.0.2 (18 constraints: a0f346c9) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This dependency must be excluded because it has a questionable license.
versions.lock
Outdated
| ch.qos.logback:logback-core:1.0.9 (4 constraints: dd32435a) | ||
| co.cask.tephra:tephra-api:0.6.0 (3 constraints: 0828ded1) | ||
| co.cask.tephra:tephra-core:0.6.0 (2 constraints: 831cd90d) | ||
| co.cask.tephra:tephra-hbase-compat-1.0:0.6.0 (1 constraints: 370d6920) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we need to go through dependencies and remove whatever isn't used.
openinx
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @waterlx for working this.
Skipped the patch, sharing two points:
- seems the current flink connector is binded to avro format. for iceberg's connector, we'd better to abstract to map the flink schema to iceberg schema, so that we can decouple with the underlying data format.
- Seems the patch is depending on few services inside netflix, such as S3, meta cat etc. better to remove that.
Thanks.
flink/src/main/java/org/apache/iceberg/flink/connector/sink/IcebergCommitter.java
Outdated
Show resolved
Hide resolved
flink/src/main/java/org/apache/iceberg/flink/connector/sink/IcebergWriter.java
Outdated
Show resolved
Hide resolved
|
@rdblue @openinx Thanks very much for your time to review and comment! I am working on some of them and opened a few issues to track other relative bigger ones. In the future, commits will be made in terms of issues, making it more convenient for reviewing and tracking. @rdblue mind creating a milestone for Flink Iceberg sink? |
|
I created the milestone for a Flink sink. Looks like most of the issues are to remove things from this PR, but we don't want to commit this before those are done. You might consider breaking up the work into issues that describe a reasonable order to add needed parts of this PR to Iceberg, instead of remove unnecessary parts of this PR from your branch. |
|
Some update here in case you are interested:
More on item 2 above: @jerryshao @chenjunjiedada @openinx @aokolnychyi @bowenli86 @stevenzwu @rdblue FYI |
Why do you need to serialize We don't want
|
07c94aa to
f102d99
Compare
|
@rdblue Thanks for sharing your thoughts on the reasons why Table/BaseTable is not serializable. Totally agree. But I am currently in the dilemma where there might be a need to call high-level operations in Flink tasks, like table.newTransaction() when trying to commit DataFiles accumulated from the streaming inputs. Currently code limits the parallelism to 1 so that the commit won't be performed in parallel. For now, it is not a blocker because I could pass the namespace and table name by config and call loadTable() of Catalog to build the table when there is a need. But the implementation is not that good as the table informations(like namespace, table name, it is HiveCatalog or HadoopCatalog) passes eveywhere, while some of them are not needed. I am also thinking about passing the path as a string (db.table for Hive Catalog while full qualified path for HadoopTables) instead of passing table instance, but the purpose is also to re-build the table instance so as to call some high-level operations. |
|
Some progress of Flink Iceberg sink in case you are interested:
PR is updated with basic functions verified, but still not for detailed review yet. Hopefully I will get a clean version by the end of this week. |
…elete count to pass the pre-check of MergeAppend
…to load it again from Catalog
…pdate GenericFlinkManifestFile due to PR apache#1030
|
Oops. I clicked "close and comment" instead of "comment". I think it should be closed anyway, so I'll leave it as is. Feel free to reopen if this still has outstanding work. Thanks for all the great work! |
|
I think it's time to close now, I will take a look to this wip PR again and update the improvement issue here if there's anything we've missed. |
Flink Iceberg sink, trying to address #567, modeled after nfflink-connector-iceberg
A workable version which could pass checkstyle, compiled with the latest master code, and has basic functions verified:
Being improve and polished.
The design doc, which contains: