Skip to content

Reintroduce HiveTableRedirectionsProvider interface#19161

Open
architgyl wants to merge 13 commits intotrinodb:masterfrom
architgyl:reintroduceHiveTableRedirectionsProvider
Open

Reintroduce HiveTableRedirectionsProvider interface#19161
architgyl wants to merge 13 commits intotrinodb:masterfrom
architgyl:reintroduceHiveTableRedirectionsProvider

Conversation

@architgyl
Copy link
Copy Markdown

@architgyl architgyl commented Sep 26, 2023

Description

The HiveTableRedirectionsProvider interface has been reintroduced with some modifications to enhance its usability in addressing a broader range of use cases. Initially, this interface provided a valuable entry point for administrators to configure redirections from Hive to various table formats, including but not limited to open-source formats like Hudi, Iceberg, and Delta. However, the enhanced interface aims to provide even greater flexibility:

  • empower administrators to configure redirections to a wide range of table formats
  • leverage external metadata stores for redirection control
  • perform redirections even when tables are absent in the Hive Metastore

Reference PR which removed HiveTableRedirectionsProvider - #12252

Testing

./mvnw clean install -DskipTests is successful.

[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  07:43 min
[INFO] Finished at: 2023-09-21T13:52:41-07:00
[INFO] ------------------------------------------------------------------------

Also added new unit tests for the following scenarios:

  • table exists but no redirection applied
  • table exists and redirection applied
  • table doesn't exist and no redirection applied
  • table doesn't exist and redirection applied

Release notes

(X) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
( ) Release notes are required, with the following suggested text:

@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Sep 26, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@github-actions github-actions bot added tests:hive hive Hive connector labels Sep 26, 2023
@cla-bot
Copy link
Copy Markdown

cla-bot bot commented Sep 26, 2023

Thank you for your pull request and welcome to the Trino community. We require contributors to sign our Contributor License Agreement, and we don't seem to have you on file. Continue to work with us on the review and improvements in this PR, and submit the signed CLA to cla@trino.io. Photos, scans, or digitally-signed PDF files are all suitable. Processing may take a few days. The CLA needs to be on file before we merge your changes. For more information, see https://github.com/trinodb/cla

@cla-bot cla-bot bot added the cla-signed label Sep 26, 2023
@xkrogen
Copy link
Copy Markdown
Member

xkrogen commented Sep 26, 2023

For a little more context on why this is needed ...

HiveTableRedirectionsProvider gave a very useful entrypoint for administrators to configure redirections from Hive to any other table format. The default/built-in behavior worked well if you used a standard open-source table format (Hudi/Iceberg/Delta), but by having a provider interface, it was also possible for an administrator to configure a seamless redirection to any other table format, e.g. a proprietary or experimental one. It also made it possible to use an external metadata store to control the redirection logic. For example an administrator may want to leverage some external metadata store to control whether the redirection is used, rather than relying on the specific table properties that are checked in the default implementation. So this is the motivation for reintroducing this provider interface.

However we have reintroduced it here with a slightly different interface, specifically to be able to tackle the situation where we may want to perform redirection even though the table no longer exists in HMS. The previous implementation required the table to exist in HMS and forced no-redirection if it did not, limiting the flexibility of a provider implementation to perform redirection regardless of the state of HMS. This can be useful, for example, if an external metadata store controls the redirection.

@xkrogen
Copy link
Copy Markdown
Member

xkrogen commented Sep 27, 2023

The new tests seem to be failing, can you take a look at them @architgyl ?

@architgyl architgyl marked this pull request as draft September 28, 2023 19:00
@architgyl architgyl marked this pull request as ready for review September 29, 2023 16:46
@architgyl
Copy link
Copy Markdown
Author

The new tests seem to be failing, can you take a look at them @architgyl ?

Fixed the test cases.

Copy link
Copy Markdown
Member

@xkrogen xkrogen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking good to me except for the minor comments and confirming about the importance of the redirect short-circuit behavior.

cc @findepi @findinpath @ebyhr @alexjo2144 since you have been involved in previous PRs related to HiveTableRedirectionsProvider (e.g. #12252) and the Hive redirection logic.

Comment on lines -3847 to -3854
Optional<String> icebergCatalogName = getIcebergCatalogName(session);
Optional<String> deltaLakeCatalogName = getDeltaLakeCatalogName(session);
Optional<String> hudiCatalogName = getHudiCatalogName(session);

if (icebergCatalogName.isEmpty() && deltaLakeCatalogName.isEmpty() && hudiCatalogName.isEmpty()) {
return Optional.empty();
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this we lost the ability to short-circuit the redirection lookup when no redirect catalogs are configured, when using the default behavior. That was added recently in 4b4bb6f (#18341)

@findepi was that behavior important? I see it was added as part of a broader PR

I think we could preserve this behavior by adding a method to the HiveTableRedirectionsProvider interface like boolean mayRedirect() which says if the provider will possibly redirect -- then we could leverage it here for short-circuiting. But not sure if this is overkill.

@findinpath
Copy link
Copy Markdown
Contributor

@architgyl , @xkrogen please do rework a bit the description of the PR.
The testing part is obvious from the CI run.
However, the reasoning behind the PR is partly mentioned in a comment in the PR.
It is not clear to me yet whether your PR is thought to address a more generic use case or rather a custom use case specific for your organisation.
Before going further with the code details of the PR , make it clear that this is a generic enough use case.

Specifying #12252 PR for related PRs would also be a helpful context for eventual reviewers.

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 30, 2023

s, including open-source formats like Hudi, Iceberg, and Delta

This remains supported.
See hive.delta-lake-catalog-name, hive.iceberg-catalog-name hive.hudi-catalog-name configs

empower administrators to configure redirections to a wide range of table formats
leverage external metadata stores for redirection control
perform redirections even when tables are absent in the Hive Metastore

Trino is design to provide best out of the box experience and we are committed to evolving it. And this requires removing functionalities that are not used. Trino is not a library. Hive connector is not a library and should not be viewed as such.
If someone maintains a fork of the Hive connector and implements some non-standard functionality on top of it, they can also maintain things like HiveTableRedirectionsProvider.

@xkrogen
Copy link
Copy Markdown
Member

xkrogen commented Oct 2, 2023

including open-source formats like Hudi, Iceberg, and Delta

This remains supported. See hive.delta-lake-catalog-name, hive.iceberg-catalog-name hive.hudi-catalog-name configs

Agreed -- @architgyl can we phrase this as "including but not limited to open-source formats like Hudi, Iceberg, and Delta"? Hopefully this will better convey the intent.

Trino is design to provide best out of the box experience and we are committed to evolving it. And this requires removing functionalities that are not used. Trino is not a library. Hive connector is not a library and should not be viewed as such.
If someone maintains a fork of the Hive connector and implements some non-standard functionality on top of it, they can also maintain things like HiveTableRedirectionsProvider.

Thanks for the perspective here @findepi ! I am trying to square it with what I see with regards to other similar interfaces. Please excuse me if my questions are naive as my involvement with Trino is relatively new. In #2603 it was made possible for administrators to specify additional modules when configuring a HiveConnectorFactory. The discussion there is sparse but my interpretation was that the intention was to allow for trino-hive to be used as a library, then configured/customized via bindings, to be able to build a Hive connector with some custom behavior without needing to fork it, which is exactly our goal here. This seems to be backed up by the fact that, since then, 5 optional bindings have been added to HiveModule (e.g. here) that allow for customizing various portions of the behavior. For example HiveMaterializedViewMetadataFactory only has one implementation in the codebase which points to a no-op implementation NoneHiveMaterializedViewMetadata, so IIUC, the only way to leverage that functionality is to extend the Hive connector with an additional module that binds a custom implementation to this interface. Hence we are trying to do the same thing here.

My expectation is that, as enterprises are increasingly moving away from Hive both as a table format and as a metadata catalog, there will be an increased demand for the ability to provide a smooth transition onto other formats/catalogs with some customized logic controlling the routing. HiveTableRedirectionsProvider allows for exactly this.

cc @phd3 @losipiuk @electrum for perspectives on this -- since you had some involvement with the original PR that added the HiveTableRedirectionsProvider interface (#9870) and/or the custom module support (#2603).

@electrum
Copy link
Copy Markdown
Member

electrum commented Oct 3, 2023

We have to look at these on a case by case basis, but this seems like a clear win to me, as the interface moves the logic out of the ~4000 line HiveMetadata class and actually results in slightly less code overall. The end result is easier to read and understand.

@architgyl architgyl requested a review from electrum October 10, 2023 17:26
@architgyl
Copy link
Copy Markdown
Author

Gentle reminder if someone can help take a look at the PR.

@xkrogen
Copy link
Copy Markdown
Member

xkrogen commented Oct 20, 2023

cc @findepi @findinpath @ebyhr @alexjo2144 @phd3 @losipiuk @electrum - anyone able/willing to review this?

@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Jan 10, 2024
@mosabua
Copy link
Copy Markdown
Member

mosabua commented Jan 11, 2024

👋 @findepi @findinpath @ebyhr @alexjo2144 @phd3 @losipiuk @electrum could someone please help out here..

@github-actions github-actions bot removed the stale label Jan 12, 2024
@github-actions
Copy link
Copy Markdown

github-actions bot commented Feb 2, 2024

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Feb 2, 2024
@github-actions
Copy link
Copy Markdown

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Feb 28, 2024
@electrum electrum reopened this Feb 28, 2024
@github-actions github-actions bot removed the stale label Feb 29, 2024
@github-actions
Copy link
Copy Markdown

This pull request has gone a while without any activity. Tagging the Trino developer relations team: @bitsondatadev @colebow @mosabua

@github-actions github-actions bot added the stale label Mar 21, 2024
@github-actions
Copy link
Copy Markdown

Closing this pull request, as it has been stale for six weeks. Feel free to re-open at any time.

@github-actions github-actions bot closed this Apr 11, 2024
@mosabua
Copy link
Copy Markdown
Member

mosabua commented Apr 11, 2024

Reopening again and adding stale-ignore label so that @electrum can follow up

@mosabua mosabua reopened this Apr 11, 2024
@mosabua mosabua added stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed. and removed stale labels Apr 11, 2024
@ebyhr ebyhr removed their request for review September 4, 2024 00:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cla-signed hive Hive connector stale-ignore Use this label on PRs that should be ignored by the stale bot so they are not flagged or closed.

Development

Successfully merging this pull request may close these issues.

6 participants