Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HIVE-26227: Add support of catalog related statements for Hive ql #3288

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

wecharyu
Copy link
Contributor

What changes were proposed in this pull request?

Implement the ddl statements related to catalog, the statements can refer to HIVE-26227.

Why are the changes needed?

To support basic ddl operation for catalog through Hive ql.

Does this PR introduce any user-facing change?

Yes, we should add these new statements to DDL Document.

How was this patch tested?

Add a qtest catalog.q, can be test by command:

mvn test -Dtest=TestMiniLlapLocalCliDriver -Dqfile=catalog.q

@wecharyu
Copy link
Contributor Author

@pvary @deniskuzZ: Could you also review this PR?

@boneanxs
Copy link

A very great work! it's much easier for us to manage catalogs with DDL.

,catalog.return_ratio
,catalog.return_rank
,catalog.currency_rank
,`catalog`.item
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a backward incompatible change. Could we make the catalog a non-reserved keyword?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sense, done

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.
Feel free to reach out on the [email protected] list if the patch is in need of reviews.

@zratkai
Copy link
Contributor

zratkai commented Dec 23, 2024

As iceberg REST catalogs are popular it would make sense to add a catalog type as well like HIVE_CATALOG, ICEBERG_REST_CATALOG etc. When adding a new catalog this must a be mandatory.

@github-actions github-actions bot removed the stale label Dec 24, 2024
@wecharyu
Copy link
Contributor Author

@zratkai Does the catalog name meet the requirements? For example, hive corresponds to HIVE_CATALOG, and iceberg_rest corresponds to ICEBERG_REST_CATALOG.

@deniskuzZ
Copy link
Member

deniskuzZ commented Feb 2, 2025

@wecharyu, I think we need to provide catalog type & connection details:

CREATE CATALOG [IF NOT EXISTS] <catalog_name> TYPE 'iceberg'
PROPERTIES (
  'catalog-type'='rest',
  'uri'='https://iceberg-with-rest:8181/'
)

see https://iceberg.apache.org/docs/1.4.0/flink-ddl/

TYPES:

  • Hive (Hive External / ACID tables) - default
  • Iceberg (HMS-backed Hive catalog, Hadoop, Rest Catalog )

@zhangbutao, @okumin WDYT?

@okumin
Copy link
Contributor

okumin commented Feb 3, 2025

I assume we'd like to implement something similar to a federate catalog of Glue Catalog stored in HMS and accessible from Hive. For example, it provides S3 Table integration. It sounds nice.

The type(Hive or Iceberg) + properties make sense to express arbitrary access to Iceberg REST catalogs.

@@ -932,7 +932,7 @@ nonReserved
:
KW_ABORT | KW_ADD | KW_ADMIN | KW_AFTER | KW_ANALYZE | KW_ARCHIVE | KW_ASC | KW_BEFORE | KW_BUCKET | KW_BUCKETS
| KW_CASCADE | KW_CBO | KW_CHANGE | KW_CHECK | KW_CLUSTER | KW_CLUSTERED | KW_CLUSTERSTATUS | KW_COLLECTION | KW_COLUMNS
| KW_COMMENT | KW_COMPACT | KW_COMPACTIONS | KW_COMPUTE | KW_CONCATENATE | KW_CONTINUE | KW_COST | KW_DATA | KW_DAY
| KW_COMMENT | KW_COMPACT | KW_COMPACTIONS | KW_COMPUTE | KW_CONCATENATE | KW_CONTINUE | KW_COST | KW_DATA | KW_DAY | KW_CATALOG | KW_CATALOGS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I confirmed CATALOG is not a reserved word in SQL:2023 👍

POSTHOOK: query: DESC CATALOG test_cat
POSTHOOK: type: DESCCATALOG
POSTHOOK: Input: catalog:test_cat
#### A masked pattern was here ####
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I slightly think this should not be masked while it might not be trivial to show it

Copy link
Contributor Author

@wecharyu wecharyu Feb 13, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's masked by the 'tmp' path, splitting the result to multi-lines could display some information in the test.

@zhangbutao
Copy link
Contributor

@wecharyu, I think we need to provide catalog type & connection details:

CREATE CATALOG [IF NOT EXISTS] <catalog_name> TYPE 'iceberg'
PROPERTIES (
  'catalog-type'='rest',
  'uri'='https://iceberg-with-rest:8181/'
)

see https://iceberg.apache.org/docs/1.4.0/flink-ddl/

TYPES:

  • Hive (Hive External / ACID tables) - default
  • Iceberg (HMS-backed Hive catalog, Hadoop, Rest Catalog )

@zhangbutao, @okumin WDYT?

I am fine with this syntax.

But this PR is really just a supplement to HIVE-18685. It just added sql capabilities on top of HIVE-18685 for ease of operation releated HMS catalog.

@deniskuzZ @okumin If we're on the same page, we want a multi-catalog capability like Trino. And the multi-catalog is different the HMS catalog of HIVE-18685. Multi-catalog can be used for federated query by using three-layer identifiers like catalog_name.dbName.tblName. For example, select * from hive_catalog.testhivedb.testhivetbl join iceberg_catalog.testdb.testicetbl on testhivetbl.id = testicetbl.id; we can also add other datasource in multi-catalog like jdbc catalog. BTW, HIVE-24396 added the data connector which can map a jdbc database instead of a jdbc table, but it can not map all external databases. With multi-catalog, we can map all external databases at once, just like trino jdbc catalog.

Now, I have not figured out how to achieve this multi-catalog ability. I think multi-catalog is beyond the scope of this PR. Of course, maybe we can implement the multi-catalog based on this PR & HIVE-18685. :)

@deniskuzZ
Copy link
Member

deniskuzZ commented Feb 8, 2025

ll external databases. With multi-catalog, w

Yes, I would pursue the multi-catalog capability. This PR mainly focused on new SQL for catalog registration.

@okumin
Copy link
Contributor

okumin commented Feb 9, 2025

I would like a single source of the overview of the current strategy so that everyone, including new people, can understand it.

  • The SQL syntax of Hive can process the semantics of the "catalog" of ANSI SQL. It means Hive users can directly identify a table with the fully qualified triplet, e.g., SELECT * FROM {catalog name}.{schema(database) name}.{table name}
  • Hive Metastore can store multiple types of catalogs. I think HMS already supports it
  • A catalog in Hive Metastore can federate with an external Iceberg REST Catalog
  • Hive Metastore can behave as an Iceberg REST Catalog
  • Hive Query Engine can support another catalog, such as Glue Catalog

@wecharyu
Copy link
Contributor Author

@deniskuzZ @zhangbutao @okumin Catalog type and properties are needed to support multiple catalogs in hive engine, I think it's better to raise a new PR for it.

@deniskuzZ
Copy link
Member

deniskuzZ commented Feb 13, 2025

@deniskuzZ @zhangbutao @okumin Catalog type and properties are needed to support multiple catalogs in hive engine, I think it's better to raise a new PR for it.

totally ok with that. I'll raise a new feature JIRA and add subtasks discussed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants