Implement partitioned tpch/tpcds cbo plan test#11738
Implement partitioned tpch/tpcds cbo plan test#11738sopel39 merged 5 commits intotrinodb:masterfrom gaurav8297:gaurav8297/partition_test
Conversation
.../trino-hive/src/test/java/io/trino/plugin/hive/metastore/thrift/InMemoryThriftMetastore.java
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/resources/statistics/tpcds/sf1000/README.txt
Outdated
Show resolved
Hide resolved
...g/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
...g/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
...g/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
...ng/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/InMemoryMetastoreFactory.java
Outdated
Show resolved
Hide resolved
...ng/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/InMemoryMetastoreFactory.java
Outdated
Show resolved
Hide resolved
...enchto-benchmarks/src/test/java/io/trino/sql/planner/TestTpchUnPartitionedCostBasedPlan.java
Outdated
Show resolved
Hide resolved
...-benchto-benchmarks/src/test/java/io/trino/sql/planner/TestTpchPartitionedCostBasedPlan.java
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/resources/statistics/tpcds/sf1000/README.md
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/resources/statistics/tpcds/sf1000/README.md
Outdated
Show resolved
Hide resolved
...no-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractTpcdsCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
|
@raunaqmorarka PTAL |
|
Have we looked at making the TPC-H and TPC-DS connectors produce partitioned tables instead? I'm concerned about the amount of coupling between the benchmarks module and the Hive connector that this change is introducing. It will make it much harder to evolve the Hive connector in the future. |
We started with this but then the way hive connector fetches statistics for partitioned tables is different. It uses partition sampling and then estimates statistics across different samples. It also stores statistics in a different format
IIUIC, the majority of coupling is introduced to generate the gzip statistics files in |
|
@gaurav8297 Have you checked |
...enchto-benchmarks/src/test/java/io/trino/sql/planner/statistics/GlueStatisticsGenerator.java
Outdated
Show resolved
Hide resolved
...enchto-benchmarks/src/test/java/io/trino/sql/planner/statistics/GlueStatisticsGenerator.java
Outdated
Show resolved
Hide resolved
|
@raunaqmorarka PTAL again |
...hto-benchmarks/src/test/java/io/trino/sql/planner/TestHivePartitionedTpcdsCostBasedPlan.java
Outdated
Show resolved
Hide resolved
...chto-benchmarks/src/test/java/io/trino/sql/planner/TestHivePartitionedTpchCostBasedPlan.java
Outdated
Show resolved
Hide resolved
...ino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractHiveCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
...ino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractHiveCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/HiveMetadataRecorder.java
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/resources/hive.properties
Outdated
Show resolved
Hide resolved
...g/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/TestHiveTpchCostBasedPlan.java
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/HiveMetadataRecorder.java
Outdated
Show resolved
Hide resolved
...g/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
|
@raunaqmorarka PTAL |
...g/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/HiveMetadataRecorder.java
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/java/io/trino/sql/planner/HiveMetadataRecorder.java
Outdated
Show resolved
Hide resolved
...ino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractHiveCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
...-hive/src/test/java/io/trino/plugin/hive/metastore/recording/TestRecordingHiveMetastore.java
Outdated
Show resolved
Hide resolved
.../trino-hive/src/test/java/io/trino/plugin/hive/metastore/thrift/InMemoryThriftMetastore.java
Outdated
Show resolved
Hide resolved
In case of huge partitioned tables, the recording file could be huge in size due to partition level statistics. So, it's better to compress the recording file which essentially makes read/write faster.
...ino-benchto-benchmarks/src/test/java/io/trino/sql/planner/AbstractHiveCostBasedPlanTest.java
Outdated
Show resolved
Hide resolved
sopel39
left a comment
There was a problem hiding this comment.
nit: we might consider renaming packages here in the future as it's more related to Hive stats now rather than generic TPCH/TPCDS connector
There was a problem hiding this comment.
So, there are no min/max statistics for char-based columns.
Why?
There was a problem hiding this comment.
I don't think we support min/max statistics for varchar and char columns in hive connector.
Q: This change was from 2018. Is this still a case that min/max for char columns are not used by the optimizer?
There was a problem hiding this comment.
See StatsUtil#toStatsRepresentation, it's still the case, we rely more on NDV in that case
testing/trino-benchto-benchmarks/src/test/resources/hive_metadata/README.md
Outdated
Show resolved
Hide resolved
testing/trino-benchto-benchmarks/src/test/resources/hive_metadata/README.md
Outdated
Show resolved
Hide resolved
.../trino-hive/src/test/java/io/trino/plugin/hive/metastore/thrift/InMemoryThriftMetastore.java
Outdated
Show resolved
Hide resolved
|
@sopel39 PTAL |
testing/trino-benchto-benchmarks/src/main/resources/sql/presto/tpcds/q75.sql
Outdated
Show resolved
Hide resolved
.../trino-hive/src/test/java/io/trino/plugin/hive/metastore/thrift/InMemoryThriftMetastore.java
Outdated
Show resolved
Hide resolved
testing/trino-testing/src/main/java/io/trino/testing/AbstractTestQueryFramework.java
Outdated
Show resolved
Hide resolved
Currently recording metastore caches partition statistics and values for a set of partition rather than per each partition.
Instead of using connectorTableHandle, use TableMetadata to find table name in a generic way in JoinOrderPrinter.
Instead of using tpch/tpcds connector, use in-memory hive metastore with corresponding tables to depict the reality that is actual plans generated on the benchmark cluster. For instance, the algorithm used in hive metastore to calculate partition statistics is different from tpch/tpcds connectors.
|
Failed due to #11929 |
Description
Issue: #11466
Improvement
CBO and benchto
This includes two major changes
Related issues, pull requests, and links
Documentation
( ) No documentation is needed.
( ) Sufficient documentation is included in this PR.
( ) Documentation PR is available with #prnumber.
( ) Documentation issue #issuenumber is filed, and can be handled later.
Release notes
( ) No release notes entries required.
( ) Release notes entries required with the following suggested text: