Skip to content

Conversation

@JingsongLi
Copy link
Contributor

@JingsongLi JingsongLi commented Sep 25, 2020

This is subtask of #1293
Also fix left comments in #1346


// produce another timestamp
Thread.sleep(10);
waitUntilAfter(10);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be called with timestampMillis because you want to wait until it is strictly after the current snapshot's timestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, you are right.

@rdblue
Copy link
Contributor

rdblue commented Sep 25, 2020

Mostly looks good. Just a couple of minor issues.

@JingsongLi
Copy link
Contributor Author

Let's take a look at ScanOptions in the next PR then. I would prefer to keep user-facing APIs simple, rather than leaking a SQL concern (options come from WITH) to users (need to use two builders). Since SQL will most likely use the fromProperties method, it may make sense to use a single builder, add withProperties, and pass properties from SQL as a map.

Got it, I think I can change the ScanOptions to ScanContext, it is an internal helper class like TableScanContext.
Then expose all setters to FlinkSource.Builder.


public Builder filters(List<Expression> newFilters) {
this.filterExpressions = newFilters;
public Builder hadoopConf(Configuration newConf) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We try to avoid using Hadoop classes in Iceberg APIs because they are hard to remove. Injecting a Hadoop Configuration is currently done in one place: to instantiate a catalog that requires it.

The catalog creates tables and tables are associated with a FileIO, so the configuration is passed down through that chain. The table's configuration should be used for any table configuration.

MR also has a Hadoop Configuration, but that's required by the API so we can't remove it from the API there. But we still prefer using the table's Configuration when it is needed for components like HadoopFileIO.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for avoiding using Hadoop Configuration.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason why Hadoop conf needs to be passed now is because:

  • JobManager needs the splits of the scan, so it needs to call Table.newScan.
  • So JobManager needs Table object, where can a table be generated? TableLoader -> from Catalog or HadoopTables.
  • The creation of Catalog and HadoopTables needs a Hadoop Configuration.

Maybe we can pass this chain with an Iceberg object (FileIO may looks not good to the creation of catalog)?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe CatalogLoader should serialize its own Configuration? It would make sense to pass one when creating a CatalogLoader for HadoopCatalog or HiveCatalog because those need a Configuration to create the catalog.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Or maybe this could use the approach from the FlinkCatalogFactory:

  public static Configuration clusterHadoopConf() {
    return HadoopUtils.getHadoopConfiguration(GlobalConfiguration.loadConfiguration());
  }

Using clusterHadoopConf internally here would avoid the need to expose this in the API and we could add Configuration to the loader later.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using clusterHadoopConf may not so flexible.
I am +1 for CatalogLoader should serialize its own Configuration. I will create a PR later for source and sink.

@rdblue rdblue merged commit 9d9f677 into apache:master Sep 29, 2020
@rdblue
Copy link
Contributor

rdblue commented Sep 29, 2020

Looks good, @JingsongLi! I'd like to remove the Hadoop Configuration from the API, but that was an existing problem so there's no need to block this PR. Thank you!

@JingsongLi JingsongLi deleted the flink_reader_sql branch November 5, 2020 09:39
@rdblue rdblue added this to the Java 0.10.0 Release milestone Nov 16, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants