Flink: Integrate Flink reader to SQL #1509

JingsongLi · 2020-09-25T04:28:11Z

This is subtask of #1293
Also fix left comments in #1346

flink/src/main/java/org/apache/iceberg/flink/TypeToFlinkType.java

flink/src/test/java/org/apache/iceberg/flink/TestFlinkSchemaUtil.java

rdblue · 2020-09-25T19:21:14Z

flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkScan.java


    // produce another timestamp
-    Thread.sleep(10);
+    waitUntilAfter(10);


This should be called with timestampMillis because you want to wait until it is strictly after the current snapshot's timestamp.

Yes, you are right.

rdblue · 2020-09-25T19:25:01Z

Mostly looks good. Just a couple of minor issues.

JingsongLi · 2020-09-28T05:27:49Z

Let's take a look at ScanOptions in the next PR then. I would prefer to keep user-facing APIs simple, rather than leaking a SQL concern (options come from WITH) to users (need to use two builders). Since SQL will most likely use the fromProperties method, it may make sense to use a single builder, add withProperties, and pass properties from SQL as a map.

Got it, I think I can change the ScanOptions to ScanContext, it is an internal helper class like TableScanContext.
Then expose all setters to FlinkSource.Builder.

flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkScan.java

rdblue · 2020-09-28T22:20:19Z

flink/src/main/java/org/apache/iceberg/flink/source/FlinkSource.java


-    public Builder filters(List<Expression> newFilters) {
-      this.filterExpressions = newFilters;
+    public Builder hadoopConf(Configuration newConf) {


We try to avoid using Hadoop classes in Iceberg APIs because they are hard to remove. Injecting a Hadoop Configuration is currently done in one place: to instantiate a catalog that requires it.

The catalog creates tables and tables are associated with a FileIO, so the configuration is passed down through that chain. The table's configuration should be used for any table configuration.

MR also has a Hadoop Configuration, but that's required by the API so we can't remove it from the API there. But we still prefer using the table's Configuration when it is needed for components like HadoopFileIO.

+1 for avoiding using Hadoop Configuration.

The reason why Hadoop conf needs to be passed now is because:

JobManager needs the splits of the scan, so it needs to call Table.newScan.

So JobManager needs Table object, where can a table be generated? TableLoader -> from Catalog or HadoopTables.

The creation of Catalog and HadoopTables needs a Hadoop Configuration.

Maybe we can pass this chain with an Iceberg object (FileIO may looks not good to the creation of catalog)?

Maybe CatalogLoader should serialize its own Configuration? It would make sense to pass one when creating a CatalogLoader for HadoopCatalog or HiveCatalog because those need a Configuration to create the catalog.

Or maybe this could use the approach from the FlinkCatalogFactory:

public static Configuration clusterHadoopConf() { return HadoopUtils.getHadoopConfiguration(GlobalConfiguration.loadConfiguration()); }

Using clusterHadoopConf internally here would avoid the need to expose this in the API and we could add Configuration to the loader later.

Using clusterHadoopConf may not so flexible.
I am +1 for CatalogLoader should serialize its own Configuration. I will create a PR later for source and sink.

flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java

rdblue · 2020-09-29T17:02:42Z

Looks good, @JingsongLi! I'd like to remove the Hadoop Configuration from the API, but that was an existing problem so there's no need to block this PR. Thank you!

probot-autolabeler bot added the flink label Sep 25, 2020

JingsongLi mentioned this pull request Sep 25, 2020

Flink: Implement Flink InputFormat and integrate it to FlinkCatalog #1293

Closed

rdblue reviewed Sep 25, 2020

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/TypeToFlinkType.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 25, 2020

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/TestFlinkSchemaUtil.java Show resolved Hide resolved

rdblue reviewed Sep 25, 2020

View reviewed changes

rdblue reviewed Sep 28, 2020

View reviewed changes

flink/src/test/java/org/apache/iceberg/flink/source/TestFlinkScan.java Outdated Show resolved Hide resolved

rdblue reviewed Sep 28, 2020

View reviewed changes

flink/src/main/java/org/apache/iceberg/flink/source/FlinkInputFormat.java Show resolved Hide resolved

JingsongLi added 4 commits September 29, 2020 10:46

Flink: Integrate Flink reader to SQL

12a155d

Fix isBounded

76363be

Address comments

eb15354

Address comment

8e1b03f

JingsongLi force-pushed the flink_reader_sql branch from a41dd8f to 8e1b03f Compare September 29, 2020 03:15

rdblue merged commit 9d9f677 into apache:master Sep 29, 2020

JingsongLi deleted the flink_reader_sql branch November 5, 2020 09:39

rdblue added this to the Java 0.10.0 Release milestone Nov 16, 2020

openinx mentioned this pull request Jan 7, 2021

Flink: Support streaming reader. #1793

Merged

Flink: Integrate Flink reader to SQL #1509

Flink: Integrate Flink reader to SQL #1509

Uh oh!

Conversation

JingsongLi commented Sep 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue commented Sep 25, 2020

Uh oh!

JingsongLi commented Sep 28, 2020

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented Sep 29, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

JingsongLi commented Sep 25, 2020 •

edited

Loading