Improve performance of querying `system.jdbc.tables` for Hive, Iceberg, and Delta #24110

piotrrzysko · 2024-11-12T11:31:12Z

Description

For now, we are mainly focused on improving performance for connectors using Glue.

Locally, the execution time of the following query, for 6500 tables and 530 schemas, decreased from 3m 11s to 27s:

SELECT TABLE_CAT, TABLE_SCHEM, TABLE_NAME, TABLE_TYPE, REMARKS,
  TYPE_CAT, TYPE_SCHEM, TYPE_NAME,   SELF_REFERENCING_COL_NAME, REF_GENERATION
FROM system.jdbc.tables
WHERE TABLE_CAT = 'hive' AND TABLE_TYPE IN ('TABLE', 'VIEW')
ORDER BY TABLE_TYPE, TABLE_CAT, TABLE_SCHEM, TABLE_NAME;

Additional context and related issues

Release notes

( ) This is not user-visible or is docs only, and no release notes are required.
( ) Release notes are required. Please propose a release note for me.
(x) Release notes are required, with the following suggested text:

## Hive, Iceberg, Delta
* Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta. ({issue}`24110`)

shohamyamin · 2024-11-12T17:12:56Z

Hi there!

I came across this PR and was curious if it might address a similar issue to the one in my PR #23909, which focuses on improving the retrieval of column metadata for Iceberg tables. In my PR, I added parallelization to the streamTableColumns method in IcebergMetadata.java, enhancing performance for users pulling column metadata.

I noticed there haven’t been any changes related to the REST catalog in this PR. Would you consider applying similar improvements to the REST catalog as well? This would allow users working with REST catalog setups to benefit from the enhanced metadata retrieval. Also, could you provide some details about the use cases this PR aims to support?

Thanks!

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

piotrrzysko · 2024-11-14T13:15:22Z

@shohamyamin

I came across this PR and was curious if it might address a similar issue to the one in my #23909, which focuses on improving the retrieval of column metadata for Iceberg tables.

This PR focuses on improving the performance of table listing. It won't address column retrieval, but it should be relatively straightforward to extend the current changes to include columns as well.

I noticed there haven’t been any changes related to the REST catalog in this PR. Would you consider applying similar improvements to the REST catalog as well?

Currently, I'm mainly focused on improving performance for the Glue catalog, but the same approach can be applied to other catalogs as well.

Also, could you provide some details about the use cases this PR aims to support?

The goal is to improve the performance of queries like the one in the PR's description, which are often issued by DB tools to list tables.

lukasz-stec

LGTM, that will be a great improvement for glue catalogs with many schemas!.
Please provide benchmark results once you have it, both in PR description but also in the commit messages if applicable.

lukasz-stec · 2024-11-15T13:04:33Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

@piotrrzysko I would drop the question from the commit message.
The question was:

Is this resolution necessary? Could we instead use the
existing mapping between ExtendedRelationType and RelationType that's
already encapsulated in RelationType?

I guess the question here is, is OTHER_VIEW or OTHER_MATERIALIZED_VIEW possible in delta-lake? TRINO_MATERIALIZED_VIEW is not.
@raunaqmorarka @dain Do you know?

btw it is ok for me to have the resolveRelationType here to keep the existing functionality

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeConfig.java

lukasz-stec · 2024-11-15T13:26:34Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveConfig.java

I would drop the question from the commit message.
The question is

Question: Should we consider setting the default value of
"hive.metadata.parallelism" to 1 when using the "file" metastore?

I would not worry about file metastore. @raunaqmorarka WDYT?

nit: seems super easy to do however (in the module set default config), but yes, I don't think it matters unless this change has affected test runtime for tests which use the FileHMS.

piotrrzysko · 2024-11-17T19:47:03Z

Benchmark results for the following query (master vs. 8 thread vs. 16 threads):

SELECT * FROM system.jdbc.tables WHERE table_cat = 'hive'

The queried catalog had 234 schemas with a total of 5674 tables.

hashhar · 2024-11-19T10:59:59Z

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveConfig.java

nit: seems super easy to do however (in the module set default config), but yes, I don't think it matters unless this change has affected test runtime for tests which use the FileHMS.

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergModule.java

plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadataFactory.java

hashhar · 2024-11-20T15:37:14Z

/test-with-secrets sha=a443d3f9c7a936a810aac5f5b2b0a0e552f9beaf

github-actions · 2024-11-20T15:38:36Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/11936682245

...ino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/glue/IcebergGlueCatalogConfig.java

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergModule.java

The parameter for specifying the maximum number of threads fetching tables ("hive.metadata.parallelism") aligns with the naming convention used in the BigQuery connector ("bigquery.metadata.parallelism"). Parallelization has been introduced in HiveMetadata rather than in specific metastore implementations, primarily to avoid reintroducing a cache storing tables for all schemas, which was removed in trinodb@cb4d168. This approach attempts to parallelize table retrieval for all metastore types, even though not all support concurrent access. Currently, only the FileHiveMetastore does not support multithreaded access, making parallelization ineffective. Benchmark results for the following query: SELECT * FROM system.jdbc.tables WHERE table_cat = 'hive' show decrease in execution time from 55s to 15s. The queried catalog had 234 schemas with a total of 5674 tables.

Before introducing DeltaLakeMetadata::getRelationTypes, ConnectorMetadata::getRelationTypes was used to retrieve relation types for Delta Lake. The original implementation classified all tables as RelationType.TABLE, except those with the extended relational type TRINO_VIEW, which were classified as RelationType.VIEW. This is why the resolveRelationType method was added in this commit. Benchmark results for the following query: SELECT * FROM system.jdbc.tables WHERE table_cat = 'delta_lake' show decrease in execution time from 55s to 16s. The queried catalog had 234 schemas with a total of 5674 tables.

Parallelization has been implemented at the TrinoCatalog level, rather than in IcebergMetadata, because some catalogs (e.g., Nessie) seem to support optimized table retrieval across all schemas. Currently, parallelization has been added for Glue and Hive catalogs, but it can easily be extended to other catalogs as well. Benchmark results for the following query: SELECT * FROM system.jdbc.tables WHERE table_cat = 'iceberg' show decrease in execution time from 54s to 17s. The queried catalog had 234 schemas with a total of 5674 tables.

piotrrzysko · 2024-11-22T13:46:58Z

Rebased and addressed comments.

hashhar · 2024-11-22T13:52:39Z

/test-with-secrets sha=c84ce622d138423d1d8cd51e7e34a2db3447eb33

github-actions · 2024-11-22T13:54:03Z

The CI workflow run with tests that require additional secrets finished as failure: https://github.com/trinodb/trino/actions/runs/11973910005

hashhar · 2024-11-22T15:52:37Z

unrelated failure on test with secrets, merging. Thanks a lot @piotrrzysko. 😄 🚀

~~I've reworded release notes a little bit, please take a look.~~

mosabua · 2024-11-25T19:43:43Z

@hashhar @raunaqmorarka @piotrrzysko - can you explain this more so we can get a release notes entry that is more user facing. What does it actually improve? Is this for federated queries where these connectors need to access jdbc info? Also .. is Hudi connector also affected?

hashhar · 2024-11-26T10:01:51Z

this improves metadata retreival performance when accessing the system.jdbc.tables and any Hive, Iceberg or Delta catalogs are present/metadata is queried.

The proposed RN entry is indeed user-facing - people who were querying system.jdbc.tables will now see faster performance if they are using Hive, Iceberg or Delta.

cla-bot bot added the cla-signed label Nov 12, 2024

github-actions bot added iceberg Iceberg connector delta-lake Delta Lake connector hive Hive connector labels Nov 12, 2024

piotrrzysko changed the title ~~[WIP] Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta~~ [WIP] Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta Nov 12, 2024

raunaqmorarka reviewed Nov 12, 2024

View reviewed changes

piotrrzysko force-pushed the parallel-jdbc-table branch 2 times, most recently from e8bcc3f to f7f8579 Compare November 13, 2024 05:43

hashhar self-requested a review November 14, 2024 14:58

lukasz-stec approved these changes Nov 15, 2024

View reviewed changes

piotrrzysko force-pushed the parallel-jdbc-table branch from f7f8579 to 8f8bf88 Compare November 17, 2024 19:41

piotrrzysko changed the title ~~[WIP] Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta~~ Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta Nov 17, 2024

piotrrzysko marked this pull request as ready for review November 17, 2024 19:48

piotrrzysko force-pushed the parallel-jdbc-table branch from 8f8bf88 to 39b23ae Compare November 17, 2024 19:55

piotrrzysko requested a review from raunaqmorarka November 17, 2024 20:00

github-actions bot added the docs label Nov 17, 2024

piotrrzysko mentioned this pull request Nov 18, 2024

Parallelize tables retrieval from multiple catalogs #24159

Merged

hashhar reviewed Nov 19, 2024

View reviewed changes

piotrrzysko force-pushed the parallel-jdbc-table branch from d6980ee to a443d3f Compare November 20, 2024 11:23

hashhar approved these changes Nov 20, 2024

View reviewed changes

raunaqmorarka reviewed Nov 21, 2024

View reviewed changes

...ino-iceberg/src/main/java/io/trino/plugin/iceberg/catalog/glue/IcebergGlueCatalogConfig.java Outdated Show resolved Hide resolved

plugin/trino-iceberg/src/main/java/io/trino/plugin/iceberg/IcebergModule.java Outdated Show resolved Hide resolved

piotrrzysko force-pushed the parallel-jdbc-table branch from a443d3f to 535ac27 Compare November 21, 2024 17:46

piotrrzysko added 3 commits November 22, 2024 14:43

Move ExecutorUtil to trino-plugin-toolkit

ebc34a8

piotrrzysko added 2 commits November 22, 2024 14:44

Document parallel metadata loading for Hive, Iceberg, and Delta

c84ce62

piotrrzysko force-pushed the parallel-jdbc-table branch from 535ac27 to c84ce62 Compare November 22, 2024 13:45

hashhar merged commit d34e358 into trinodb:master Nov 22, 2024
97 of 98 checks passed

github-actions bot added this to the 466 milestone Nov 22, 2024

raunaqmorarka added the performance label Nov 22, 2024

mosabua mentioned this pull request Nov 25, 2024

Add Trino 466 release notes #24208

Merged

shohamyamin mentioned this pull request Dec 12, 2024

Improve performance when listing columns in Iceberg #23909

Merged

dejangvozdenac mentioned this pull request Oct 3, 2025

Extremely slow performance of iceberg planning stage with statistics on #26563

Open

Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta #24110

Improve performance of querying system.jdbc.tables for Hive, Iceberg, and Delta #24110

Uh oh!

Conversation

piotrrzysko commented Nov 12, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Additional context and related issues

Release notes

Uh oh!

shohamyamin commented Nov 12, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

piotrrzysko commented Nov 14, 2024

Uh oh!

lukasz-stec left a comment

Choose a reason for hiding this comment

Uh oh!

lukasz-stec Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

lukasz-stec Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

lukasz-stec Nov 15, 2024

Choose a reason for hiding this comment

Uh oh!

hashhar Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

piotrrzysko commented Nov 17, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hashhar Nov 19, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hashhar commented Nov 20, 2024

Uh oh!

github-actions bot commented Nov 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

piotrrzysko commented Nov 22, 2024

Uh oh!

hashhar commented Nov 22, 2024

Uh oh!

github-actions bot commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hashhar commented Nov 22, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

mosabua commented Nov 25, 2024

Uh oh!

hashhar commented Nov 26, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

6 participants

Improve performance of querying `system.jdbc.tables` for Hive, Iceberg, and Delta #24110

Improve performance of querying `system.jdbc.tables` for Hive, Iceberg, and Delta #24110

piotrrzysko commented Nov 12, 2024 •

edited

Loading

hashhar Nov 19, 2024 •

edited

Loading

piotrrzysko commented Nov 17, 2024 •

edited

Loading

hashhar Nov 19, 2024 •

edited

Loading

github-actions bot commented Nov 20, 2024 •

edited

Loading

github-actions bot commented Nov 22, 2024 •

edited

Loading

hashhar commented Nov 22, 2024 •

edited

Loading