Skip to content

Add Iceberg RESTSessionCatalog Implementation#13294

Merged
electrum merged 3 commits intotrinodb:masterfrom
danielcweeks:iceberg-rest-catalog
Dec 22, 2022
Merged

Add Iceberg RESTSessionCatalog Implementation#13294
electrum merged 3 commits intotrinodb:masterfrom
danielcweeks:iceberg-rest-catalog

Conversation

@danielcweeks
Copy link
Copy Markdown
Contributor

Description

This PR adds and the REST Catalog implementation for Iceberg.

Is this change a fix, improvement, new feature, refactoring, or other?

New Feature / Improvement

Is this a change to the core query engine, a connector, client library, or the SPI interfaces? (be specific)

Connector: Iceberg

How would you describe this change to a non-technical end user or system administrator?

This allows Trino to access any data source that supports the Iceberg REST Catalog spec.

Related issues, pull requests, and links

Documentation

( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.

Release notes

( ) No release notes entries required.
() Release notes entries required with the following suggested text:

# Section
* Fix some things. ({issue}`issuenumber`)

@cla-bot cla-bot bot added the cla-signed label Jul 21, 2022
Copy link
Copy Markdown
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add connector and product test. You can refer to #11772 as the example.

@danielcweeks danielcweeks force-pushed the iceberg-rest-catalog branch from 42b060b to 8fe1db8 Compare August 1, 2022 23:37
@nastra nastra force-pushed the iceberg-rest-catalog branch from 88be805 to 692d065 Compare August 3, 2022 14:21
@nastra nastra force-pushed the iceberg-rest-catalog branch 6 times, most recently from 194b117 to 7b02c1b Compare August 5, 2022 13:31
@nastra nastra closed this Aug 6, 2022
@nastra nastra reopened this Aug 6, 2022
@nastra nastra closed this Aug 9, 2022
@nastra nastra reopened this Aug 9, 2022
@nastra nastra force-pushed the iceberg-rest-catalog branch 2 times, most recently from a9f5498 to 1369e25 Compare August 10, 2022 15:45
@danielcweeks danielcweeks marked this pull request as ready for review August 10, 2022 21:41
@danielcweeks danielcweeks requested a review from ebyhr August 10, 2022 21:41
Copy link
Copy Markdown
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, looks pretty good. Mostly style/nit pick comments

@alexjo2144
Copy link
Copy Markdown
Member

Not having View or MV support is pretty limiting. Is that a restriction from the backend, or can we implement them as follow up?

@danielcweeks
Copy link
Copy Markdown
Contributor Author

Not having View or MV support is pretty limiting. Is that a restriction from the backend, or can we implement them as follow up?

We definitely want to add view support as a follow-up. However, the iceberg view spec is still in review, so we should wait until that is finalized.

@nastra nastra force-pushed the iceberg-rest-catalog branch 3 times, most recently from e9f524b to f8f7d32 Compare August 15, 2022 10:32
Copy link
Copy Markdown
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lett initial comments. Is it possible to add a test for iceberg.rest.token and iceberg.rest.credential?

Also, could you update documentation iceberg.rst?

@nastra nastra force-pushed the iceberg-rest-catalog branch from f7b1f5d to fe10a21 Compare August 16, 2022 10:26
@github-actions github-actions bot added the docs label Aug 16, 2022
@danielcweeks
Copy link
Copy Markdown
Contributor Author

Lett initial comments. Is it possible to add a test for iceberg.rest.token and iceberg.rest.credential?

@ebyhr It may be possible to add tests to validate that the token and credential values are passed through, but would require a lot of additional infrastructure to support actually testing the OAuth2 flows. The logic for all of that is already handled by the underlying Iceberg library and there are extensive tests to validate the flow behaviors.

@nastra nastra force-pushed the iceberg-rest-catalog branch 2 times, most recently from 8acf3b7 to fdc6f6a Compare October 27, 2022 07:11
Copy link
Copy Markdown
Member

@electrum electrum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally looks good

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this? If we're going to enable preview features, seems like we should make that decision for all of Trino and set it at the top level.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't recall whether we added that or whether this was a leftover after a rebase. I've removed it

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed, I like the iceberg.rest-catalog prefix

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this required? Can we file an issue in Iceberg to make this not depend on Hadoop? We're trying to eliminate Hadoop as a required dependency for Iceberg and other connectors.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we were asked to remove the FileIO implementations other than HadoopFileIO, which requires a Configuration. Should we add the iceberg-aws dependency back and make this optional?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is required unfortunately for ResolvingFileIO here

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify the other responses, the REST Catalog does not have any runtime Hadoop dependencies. However, Trino performs IO operations using TrinoFileSystem (via HadoopFileIO), which does requires a Hadoop configuration. I don't think we want to add scope to this PR by trying to change that path, but whenever alternatives are available, this won't be a blocker.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need a new test group? What is different between ICEBERG and ICEBERG_REST?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we mainly introduced a new test group that allows running only a subset of tests, otherwise we'd have to make more changes to existing product tests to make them work across different environments.
In the long run (and in a follow-up PR) however I think we could aim for removing that particular test group if that's desired.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is source generic in the REST catalog? How might it be used?

While it might be be useful for informational or auditing purposes, we need to ensure that it won't be used for access control, returning different information depending on the value, or that caching results with different source values is ok. Same concerns as the user.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Passing the user is impersonation, which has various implications. We need a config for this which is disabled by default. When disabled, we need a config to specify the shared system user to pass here.

If we're going to support impersonation, I think we need some guarantees in the REST catalog specification that will not prevent caching:

  • The user is only used for access control, not auditing.
  • Name mappings and table data are the same for all users. In other words, loadTable() will return the same information for all users, or will be denied. It won't return different information depending on the user.

@jebnix
Copy link
Copy Markdown

jebnix commented Nov 24, 2022

I think that it will be good to add some explanation about how the REST Catalog works. 3-4 sentences would help people very much with adopting this. General explanations of the implications of using this would also be a good addition (e.g. does the usage of this enables us to deploy Trino with Iceberg connector without Hive Metastore?).

Otherwise, seems awesome

@code-magician323
Copy link
Copy Markdown

code-magician323 commented Nov 25, 2022

@jebnix I think the current docs are OK. @danielcweeks correct me if I'm wrong but there is no need for Hive Metastore when configuring Iceberg with the REST Catalog, but it is required to use some Catalog behind the scenes (e.g. Nessie / JDBC)

@danielcweeks
Copy link
Copy Markdown
Contributor Author

@jebnix I think the current docs are OK. @danielcweeks correct me if I'm wrong but there is no need for Hive Metastore when configuring Iceberg with the REST Catalog, but it is required to use some Catalog behind the scenes (e.g. Nessie / JDBC)

That's correct. There's no dependency on Hive, but you need a server side implementation of the REST spec, which can just be a proxied version of one of the existing Iceberg catalog implementations.

@danielcweeks
Copy link
Copy Markdown
Contributor Author

danielcweeks commented Nov 27, 2022

@electrum I've updated to add the configuration for session info which defaults to NONE and there is a PR to clarify the REST behavior. Please take another look when you can.

@jebnix
Copy link
Copy Markdown

jebnix commented Dec 15, 2022

@danielcweeks @nastra Do you think this can get merged soon? Trino still cannot use Iceberg without a Hive Metastore. That's the most important feature of the Iceberg Connector by far.

@nastra
Copy link
Copy Markdown
Contributor

nastra commented Dec 15, 2022

I believe nothing is left to be done from our side, it is mainly waiting for a final approval

@code-magician323
Copy link
Copy Markdown

@bitsondatadev Why doesn't this get merged?

Co-authored-by: Eduard Tudenhoefner <etudenhoefner@gmail.com>
Co-authored-by: bryanck <bryanck@gmail.com>
@bitsondatadev
Copy link
Copy Markdown
Member

@bitsondatadev Why doesn't this get merged?

I'm gonna tag in @electrum and @findepi here. I'm seeing activity in general start to slow down for development as we approach the holidays so I wouldn't hold your breathe for too much happening until after new years.

Just like Santa 🎅 I'm making a list but my list contains PRs I'll be checking twice when I get back from break.

@bitsondatadev
Copy link
Copy Markdown
Member

@bitsondatadev Why doesn't this get merged?

I'm gonna tag in @electrum and @findepi here. I'm seeing activity in general start to slow down for development as we approach the holidays so I wouldn't hold your breathe for too much happening until after new years.

Just like Santa 🎅 I'm making a list but my list contains PRs I'll be checking twice when I get back from break.

I stand corrected! Happy holidays!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.