-
Notifications
You must be signed in to change notification settings - Fork 3.4k
Get stats from hive metastore only for the necessary columns #16203
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Get stats from hive metastore only for the necessary columns #16203
Conversation
7a75b11 to
29b2138
Compare
5d763ff to
ee3f5fb
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/cache/CachingHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
ee3f5fb to
5cfb83d
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CA
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
5cfb83d to
fede049
Compare
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/cache/CachingHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/cache/CachingHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/cache/CachingHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
fede049 to
d777bb7
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ca
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetastoreClosure.java
Outdated
Show resolved
Hide resolved
.../trino-hive/src/main/java/io/trino/plugin/hive/metastore/SemiTransactionalHiveMetastore.java
Outdated
Show resolved
Hide resolved
d777bb7 to
d954002
Compare
|
rebased on the latest master (auto rebase were causing compilation errors) |
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/cache/CachingHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/cache/CachingHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveTableHandle.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
Before applyProjection is called or when projection pushdown is disabled, we should assume that all table columns are to be projected for populating HiveTableHandle#projectedColumns
d954002 to
7341c8f
Compare
lukasz-stec
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ca
plugin/trino-hive/src/main/java/io/trino/plugin/hive/metastore/cache/CachingHiveMetastore.java
Outdated
Show resolved
Hide resolved
plugin/trino-hive/src/main/java/io/trino/plugin/hive/HiveMetadata.java
Outdated
Show resolved
Hide resolved
HiveTableHandle contains the list of projected columns. Since only projected columns will be available to the engine, HiveMetadata#getTableStatistics can return statistics limited to those columns. This also requires CachingHiveMetastore to support caching statistics for a subset of table columns.
7341c8f to
6280a6d
Compare
| List<Column> requestedColumns = table.getDataColumns().stream() | ||
| .filter(column -> requestedColumnNames.contains(column.getName())) | ||
| .collect(toImmutableList()); | ||
| table = Table.builder(table).setDataColumns(requestedColumns).build(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we're building fake table object, hoping the callee will trust the table, and not eg use knowledge about column list from some other place
that's an opaque design, something explicit would probably be better.
Why not pass Set<String> requestedColumns (or Optional thereof`) down the call chain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not pass Set requestedColumns (or Optional thereof`) down the call chain?
That's an excellent idea. It also applies to the getPartitionStatistics, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
potentially yes
Description
HiveTableHandlecontains the list of projected columns. Since only projected columns will be available to the engine,HiveMetadata.getTableStatisticscan return statistics limited to those columns.This also requires
CachingHiveMetastoreto support caching statistics for a subset of table columns.Benchmarks
I tested the change on the glue metastore and table with 100 columns (details below) with caching stats disabled (
hive.metastore-stats-cache-ttl=0s)and a simple query (select count(c1) from test_col_stats_part) where planning dominates.There is a difference of about 20% for planning time/elapsed (so visible)
baseline 2.251s on average and only required stats 1.853s
This would have a potentially higher impact on a metastore under heavy load.
The table used:
I also ran standard tpch/tpcds on scale factor 1000, orc, partitioned. The results show mostly no change (except for some variability)
sf1k-orc-part-240223.pdf
Additional context and related issues
Release notes
( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x) Release notes are required, with the following suggested text: