Flink: Introduce CatalogLoader and TableLoader #1332

JingsongLi · 2020-08-13T08:59:44Z

Fixes #1303
Unlike Spark catalog table (Table is only required on the client/driver side), Flink needs obtain Table object in Job Manager or Task.

For writer (Flink: Add the iceberg files committer to collect data files and commit to iceberg table. #1185): Flink needs obtain Table in committer task for appending files.
For reader (Flink: Implement Flink InputFormat and integrate it to FlinkCatalog #1293): Flink needs obtain Table in Job Manager for planing tasks.

So we can introduce a CatalogLoader for reader and writer, users can define a custom catalog loader in FlinkCatalogFactory.

public interface CatalogLoader extends Serializable {
  Catalog loadCatalog(Configuration hadoopConf);
}

For support hadoop table based on location. Introduce TableLoader for both catalog table and hadoop location table.

Q: Can/Should we introduce CatalogLoader and TableLoader to an Iceberg common module.

openinx · 2020-08-18T04:05:51Z

flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java

+  private final CatalogLoader catalogLoader;
+  private final Configuration hadoopConf;


Those two seems don't have to be private members because I did not see anywhere accessing them except the constructor.

openinx · 2020-08-18T04:15:09Z

flink/src/main/java/org/apache/iceberg/flink/FlinkCatalog.java

-    this.originalCatalog = icebergCatalog;
-    this.icebergCatalog = cacheEnabled ? CachingCatalog.wrap(icebergCatalog) : icebergCatalog;
+    this.hadoopConf = hadoopConf;
+    this.originalCatalog = catalogLoader.loadCatalog(hadoopConf);


Is possible to track only on Catalog in this FlinkCatalog class ? For example, we only keep the icebergCatalog as the member of this class, that will be much easier to follow the code ( Introducing two member here confused me sometime). When close the catalog we could make the CachingCatalog to implement Closeable inteface.

CachingCatalog is just a wrapper for original catalog. I think it is better to not let it implement Closeable interface.
A way to solve this problem can be: Keep Closeable as a class member instead of Catalog.
closeable = originalCatalog instanceof Closeable ? (Closeable) originalCatalog : null;
What do you think?

Make sense.

edgarRd · 2020-08-18T19:16:39Z

flink/src/main/java/org/apache/iceberg/flink/TableLoader.java

+    return new HadoopTableLoader(location);
+  }
+
+  class HadoopTableLoader implements TableLoader {


Do you really need this? isn't this covered by the CatalogTableLoader using hadoop?

I saw #1306 , I don't know if I misunderstood something.
It seems that a simple path can make it easier for users to use.

Right, I think part of the discussion is that providing a table via this API does not guarantee the atomic commit on table changes, therefore we have HiveCatalog for tables that are backed in systems like s3. While HadoopCatalog for hdfs which does not implement HMS coordination since the filesystem itself has atomic renames.

I'm not sure which case specifically this specific Flink implementation will use, but as mentioned in #1306 (comment) it seems dangerous to provide tables loaded with HadoopTables if there's no context of the underlying limitations.

Yes, There are some comments in HadoopCatalog: The HadoopCatalog requires that the underlying file system supports atomic rename. So as HadoopTables.
But I think if users use HadoopTables, HadoopCatalog or TableLoader.fromHadoopTable, users should know, yeah, we need a file system supports atomic rename instead S3 and etc...

In #1306, should be discussing how to support location in the SQL layer.
For Flink, there are two kinds of users: DataStream users and SQL users.

So here is my thoughts:

Just like Spark and Mapreduce. For a Flink DataStream programer, he/she can just specify a hadoop location path for reading and writing, or use iceberg Catalog.

But for Flink SQL user, he/she must specify an iceberg Catalog to load tables.

rdblue · 2020-08-20T18:01:15Z

Merged! Thanks @JingsongLi!

I think this seems like a reasonable way to load tables on tasks that need to. We should document how to supply your own catalog, though. I think it is common for people to override them.

This was referenced Aug 13, 2020

Flink: Add the iceberg files committer to collect data files and commit to iceberg table. #1185

Merged

Flink: Implement Flink InputFormat and integrate it to FlinkCatalog #1293

Closed

Flink: Introduce CatalogLoader and TableLoader

1c75721

JingsongLi force-pushed the loader branch from 17adea8 to 1c75721 Compare August 17, 2020 07:38

Add util clusterHadoopConf

de62dac

JingsongLi mentioned this pull request Aug 17, 2020

Flink: Introduce Flink InputFormat #1346

Merged

openinx reviewed Aug 18, 2020

View reviewed changes

Address comments

103997e

edgarRd reviewed Aug 18, 2020

View reviewed changes

rdblue merged commit e2d0d73 into apache:master Aug 20, 2020

JingsongLi deleted the loader branch November 5, 2020 09:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Flink: Introduce CatalogLoader and TableLoader #1332

Flink: Introduce CatalogLoader and TableLoader #1332

Uh oh!

JingsongLi commented Aug 13, 2020 •

edited

Loading

Uh oh!

openinx Aug 18, 2020

Uh oh!

openinx Aug 18, 2020

Uh oh!

JingsongLi Aug 18, 2020

Uh oh!

openinx Aug 18, 2020

Uh oh!

edgarRd Aug 18, 2020

Uh oh!

JingsongLi Aug 19, 2020

Uh oh!

edgarRd Aug 20, 2020 •

edited

Loading

Uh oh!

JingsongLi Aug 20, 2020 •

edited

Loading

Uh oh!

rdblue commented Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

		private final CatalogLoader catalogLoader;
		private final Configuration hadoopConf;

Flink: Introduce CatalogLoader and TableLoader #1332

Flink: Introduce CatalogLoader and TableLoader #1332

Uh oh!

Conversation

JingsongLi commented Aug 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

openinx Aug 18, 2020

Choose a reason for hiding this comment

Uh oh!

openinx Aug 18, 2020

Choose a reason for hiding this comment

Uh oh!

JingsongLi Aug 18, 2020

Choose a reason for hiding this comment

Uh oh!

openinx Aug 18, 2020

Choose a reason for hiding this comment

Uh oh!

edgarRd Aug 18, 2020

Choose a reason for hiding this comment

Uh oh!

JingsongLi Aug 19, 2020

Choose a reason for hiding this comment

Uh oh!

edgarRd Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

JingsongLi Aug 20, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Aug 20, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

JingsongLi commented Aug 13, 2020 •

edited

Loading

edgarRd Aug 20, 2020 •

edited

Loading

JingsongLi Aug 20, 2020 •

edited

Loading