Support partition based metadata query optimization for Iceberg by hantangwangd · Pull Request #22080 · prestodb/presto

hantangwangd · 2024-03-05T13:05:00Z

Description

We can take advantage of metadata such as partition values in Iceberg to do some metadata optimizations. For example, we can optimize min/max/distinct aggregation on Iceberg table's partition columns to a ValuesNode with or without AggregationNode above it. Or we can optimize query with filters that do not match any partitions to an empty ValuesNode.

This PR try to take advantage of Iceberg partition values to do metadata query optimizations. Fixed various problems encountered when dealing with partition values, and ensured proper behavior in cases of partition evolution.

This optimization is mainly using metadata message to optimize the cardinality-insensitive aggregations. That is, if we can enumerate the distinct tuples consistent of all the columns existing in aggregation keys and group-by keys through metadata, and the total amount of these tuples is not so large, we can do this optimization, translate the TableScanNode to a ValuesNode to gain some benefits. For example, for an Iceberg table partitioned by ARRAY['a', 'b'], the following queries could take advantage of this optimization:

select min(a), max(a), min(b), max(b) from iceberg_partition;
select distinct a, b from iceberg_partition;
select a, max(b), min(b) from iceberg_partition group by a;
select approx_distinct(a), min(a), b, from iceberg_partition group by b;

Note: the supported aggregation functions include the following: min, max, distinct, approx, approx_distinct. Additionally, all columns appearing in the aggregation keys and group-by keys must be identity partition columns.

Test Plan

Test specified aggregations on Iceberg table's partition columns would be optimized to aggregation on ValuesNode
Test specified aggregations without group by on Iceberg table's partition columns would be optimized to a single ValuesNode
Test query with filters do not match any partitions would be optimized to empty ValuesNode

Contributor checklist

Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
Documented new properties (with its default value), SQL syntax, functions, or other functionality.
If release notes are required, they follow the release notes guidelines.
Adequate tests were added if applicable.
CI passed.

Release Notes

== NO RELEASE NOTE ==

tdcmeehan

One thing I'm not sure about is: does the Iceberg spec guarantee reliability of the min/max values for all types, and will this code work in the presence of them being absent?

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergLogicalPlanner.java

tdcmeehan · 2024-03-07T19:50:13Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java

hantangwangd · 2024-03-08T00:49:36Z

One thing I'm not sure about is: does the Iceberg spec guarantee reliability of the min/max values for all types, and will this code work in the presence of them being absent?

I think it would work well in the absence of min/max values on some or all columns, as the filter on other columns except identity partition column just help to reduce the files which could be ensured to filter out. That is, they are something enhanced rather than mandatory.

So, if they work well using the min/max values, it would be great. And if it do not filter out any files, it would be OK as well. The only situation which could lead in trouble might be that, there exists wrong min/max values in data file which causing us to wrongly filter out some valid partitions. And if it really happens, I think that would be a bug in Iceberg implementation which need to be fixed.

hantangwangd · 2024-03-08T01:48:44Z

One thing I'm not sure about is: does the Iceberg spec guarantee reliability of the min/max values for all types, and will this code work in the presence of them being absent?

Sorry that the previous response may have missed the point. The strict requirement to apply this optimization is that all the columns(including group by key, aggregation key, filter key) used in tableScan below the aggregation must be identity partition column. So that it would work well in this scenario.

Referring here: https://github.com/prestodb/presto/blob/master/presto-main/src/main/java/com/facebook/presto/sql/planner/optimizations/MetadataQueryOptimizer.java#L201

tdcmeehan · 2024-03-08T13:49:11Z

Are there any scenarios where the min/max may be off slightly or corrupted? For example, if I have a huge string column, is it guaranteed that the max will not be truncated and be faithfully represented in the manifest files?

hantangwangd · 2024-03-08T14:30:35Z

Are there any scenarios where the min/max may be off slightly or corrupted? For example, if I have a huge string column, is it guaranteed that the max will not be truncated and be faithfully represented in the manifest files?

In Iceberg spec, it described as follows:

Scan predicates are also used to filter data and delete files using column bounds and counts that are stored by field id in manifests. The same filter logic can be used for both data and delete files because both store metrics of the rows either inserted or deleted. If metrics show that a delete file has no rows that match a scan predicate, it may be ignored just as a data file would be ignored [2].

Referring to: https://iceberg.apache.org/spec/#scan-planning

So I think it should ensure that the file filtering through min/max values would not make mistake, that is, it should never filter out valid files. I would try reading the code in iceberg to figure it out. But I think it shouldn't meet the scenarios as you mentioned, for Iceberg write data file and corresponding metadata in an atomic transaction. And if it indeed need to truncate the huge string column for max value(I'm not sure whether it need to do so), just truncate the string to a shorter but bigger one would be ok.

tdcmeehan · 2024-03-08T15:45:52Z

@hantangwangd would you be able to double check this from the Iceberg community? If so, I'd actually be supportive of somehow ensuring metadata optimizations are enabled by default somehow just for the Iceberg connector.

CC: @nastra @rdblue

hantangwangd · 2024-03-08T16:31:18Z

@tdcmeehan OK, I would try to figure out it.

hantangwangd · 2024-03-11T06:39:27Z

Hi @tdcmeehan, I found that the Iceberg community has supported aggregation push down for Spark, referring to issue #6622. It rely on the min/max/count stats of files to support the specified aggregation(min/max/count) pushing down. It's a little different from this PR, and is more similar with the issue #21885 you have risen(will discribe later).

Referring to https://github.com/apache/iceberg/blob/main/spark/v3.5/spark/src/main/java/org/apache/iceberg/spark/source/SparkScanBuilder.java#L202
In that implementation, it does several checks to make sure that the stats could be rely on:

Aggregation has no group by key(maybe it's too strict, and the TODO comment also indicate that grouping by identity partition columns should be supported as well)
Specific aggregations(count/count(*)/max/min) directly on a column or star
Validate the metrics mode(All/Truncate/Counts/None) for each column in step 2:
a. count(*) is always supported
b. "None" for a specific column is always not supported
c. "Counts" could not support "min/max" aggregation
d. "Truncate" could not support "min/max" aggregation on columns of "String" type(That's one of the situation you have mentioned)
Do not exists any delete file
All files contains the corresponding type of stats on columns in step 2
After all these checks are passed, then spark could rely on the stats and successfully push down the aggregations to Iceberg.

That PR demonstrate two things:

In some situations(existing delete files or min/max value for STRING in a Truncate mode), the stats could not be rely on as it's not correct or exact.
Otherwise, if the stats exists, we can rely on them and believe that they are exactly correct.

It differs from this PR in several aspects:

It rely on the stats about aggregated columns, and do not require aggregated columns to be identity partition columns. So it must check that the stats are reliable. And in our PR, we do not rely on stats, but require that the aggregated columns all be identity partition columns.
Opposed to this PR, it could support "count()/count(*)" based on the stats, but could not support "distinct".
It only support aggregation without group by key, and then totally push down the aggregation functions to Iceberg. However, we support aggregations with or without group by. If without group by key, we totally push the aggregation down to Iceberg as well; and if with group by key, we just replace the tableScanNode with a valuesNode, and the aggregation would be calculated in presto engine.

To summarize, I think we can get the following conclusion:

If meeting the requirements as #6622 does, we can rely on the stats to do the metadata optimization described in your issue Optimize simple Iceberg queries using manifest file information #21885 .
If existing delete files, we might even get wrong partition values on identity partition columns(for example, got a partition where all rows have been deleted), so our implementation in this PR should also check that and fallback to a normal aggregation if existing delete files(see the test cases in newly added commit for this PR).

Any misunderstanding please let me know, would be greatly appreciated for any information! @nastra @rdblue

tdcmeehan · 2024-03-11T13:35:55Z

@hantangwangd thanks for the pointers. I also came across apache/iceberg#113 which confirms your findings. I think the above optimizations could be written as a connector-specific optimizer in Iceberg--what do you think? My worry about optimize_metadata_queries is it's too generic, as you need to understand the particulars of the underlying connector to correctly perform the optimizations. As of now, I'm not sure how to use it safely if Iceberg may truncate the String columns.

hantangwangd · 2024-03-11T15:13:42Z

@tdcmeehan Makes sense, it would be more flexible and safe to written as a connector-specific optimizer in Iceberg. I will do it.

hantangwangd · 2024-03-15T13:41:52Z

Hi @tdcmeehan , I have written a connector-specific metadata optimizer in Iceberg. It was written as a logical only connector optimizer placed after IcebergPlanOptimizer and IcebergFilterPushdown, as it should be able to do optimization both for filter_push_down enabled and disabled. There are some newly added test cases to show the scenarios that this optimization could support, and the tiny differences between filter_push_down enabled and disabled. That is mainly because when filter_push_down enabled, the filter on non-identity partition columns would be push down as well.

Please take a look, thanks!

ZacBlanco

Took a first pass! Really neat optimizations. Had a few minor things

ZacBlanco · 2024-03-18T21:24:03Z

...to-iceberg/src/main/java/com/facebook/presto/iceberg/optimizer/IcebergMetadataOptimizer.java

+                RowExpression result = evaluateMinMax(
+                        functionMetadataManager.getFunctionMetadata(node.getAggregations().get(outputVariable).getFunctionHandle()),
+                        inputColumnValues.get(inputVariable));


I'm curious about the benefit of calculating the aggregation result here vs just replacing the table scan with VALUES and letting the engine calculate the aggregation. I just worry about the complexity of maintaining this code.

Intuitively, I think it could have benefits because it further optimizes min/max to greatest/least which can do calculating in batch.

Have you done any benchmarks to compare running the aggregations in the here in the connector vs only replacing the table scans?

I didn't do that, but it worth a try. I will figure it out.

I wrote a benchmark test to compare min/max with least/greatest on ValuesNode with different rows. The result is as follows:

Benchmark (recordCount) Mode Cnt Score Error Units BenchmarkMinMaxWithLeastGreatest.baseline 2 * 5 avgt 10 24.250 ± 0.304 ms/op BenchmarkMinMaxWithLeastGreatest.baseline 2 * 50 avgt 10 24.429 ± 0.383 ms/op BenchmarkMinMaxWithLeastGreatest.baseline 20 * 50 avgt 10 24.463 ± 0.445 ms/op BenchmarkMinMaxWithLeastGreatest.baseline 200 * 50 avgt 10 24.233 ± 0.382 ms/op BenchmarkMinMaxWithLeastGreatest.testWithFurtherFlag 2 * 5 avgt 10 36.249 ± 1.835 ms/op BenchmarkMinMaxWithLeastGreatest.testWithFurtherFlag 2 * 50 avgt 10 36.407 ± 1.333 ms/op BenchmarkMinMaxWithLeastGreatest.testWithFurtherFlag 20 * 50 avgt 10 43.463 ± 0.607 ms/op BenchmarkMinMaxWithLeastGreatest.testWithFurtherFlag 200 * 50 avgt 10 191.785 ± 2.027 ms/op BenchmarkMinMaxWithLeastGreatest.testWithoutFurtherFlag 2 * 5 avgt 10 34.020 ± 0.798 ms/op BenchmarkMinMaxWithLeastGreatest.testWithoutFurtherFlag 2 * 50 avgt 10 34.072 ± 0.894 ms/op BenchmarkMinMaxWithLeastGreatest.testWithoutFurtherFlag 20 * 50 avgt 10 49.647 ± 1.118 ms/op BenchmarkMinMaxWithLeastGreatest.testWithoutFurtherFlag 200 * 50 avgt 10 286.988 ± 72.066 ms/op

The benchmark test code could be found here: hantangwangd@9943a6a

Seems when row count is less than 1000, the two are roughly similar, with least/greatest have a little higher overhead. But when row count is larger, least/greatest would be obviously faster. The reason might be that they have the time complexity of n/100.

By the way, the problems about huge ValuesNode make it more worthy to do further optimization by pre-calculating with least/greatest.

I think we need a way to disable these optimization in this PR in the case that they cause issues on the coordinator due to memory pressure/GC issues. Either through having some flag which sets a max row count in a ValuesNode, or a boolean flag to disable it altogether.

Sure, makes sense. I think a limit on row count for this optimization would be better than a boolean flag, as the queries on not so large number partitions would still get benefits when memory pressure is not so high, and we can explicitly define the value 0 as completely disable this optimization and document that.

I think the default value should be larger than optimizer.optimize-metadata-queries-call-threshold's default value 100 used in MetadataQueryOptimizer, for Iceberg have no need to call metastore to confirm whether each partition is valid like hive does. Maybe set to 1000, any thoughts?

I have thought about the limitation on partitions number in this optimization, and I think we should apply the limit when getting the partitions. When we get table layout for an Iceberg table we should not urgently load all the partitions there, but just return a lazily loaded data structure, it should be something like LazyBlock. We can loaded the data by calling assureLoaded() when necessary, and if it return false or throw an exception, we would simply skip this optimization. And when all the partitions are already loaded, we could just do the optimization. And if that's a feasible solution, then may be a subsequent separate PR would be better, what's you opinion?

I think that sounds like a reasonable solution. I am okay with waiting for a PR as long as we have a config flag added.

I think a limit on row count for this optimization would be better than a boolean flag, as the queries on not so large number partitions would still get benefits when memory pressure is not so high, and we can explicitly define the value 0 as completely disable this optimization and document that.

I agree with this.

I think the default value should be larger than optimizer.optimize-metadata-queries-call-threshold's default value 100 used in MetadataQueryOptimizer, for Iceberg have no need to call metastore to confirm whether each partition is valid like hive does. Maybe set to 1000, any thoughts

I think the concern about metadata calls is an adjacent issue, but one that can't be avoided from the iceberg side. Getting the list of partitions requires generating a scan plan which means loading the manifests. As long as those manifests are being cached and not requiring a additional metadata lookups, I think we don't need to worry because during actual scan planning those manifests would need to have been loaded anyways without this optimization.

The real issue is the memory-performance tradeoff. From the benchmarks we know that there is a benefit at ~1k rows, but I'm not sure how much a few ms is going to save in the grand scheme of things (at least according to your benchmark). I don't know the point at which memory usage could become an issue. I am happy to leave it at 1k for now. If you end up using this and prod and find a better value we can always tweak the default later.

Related: If we are adding configuration, we should add documentation for it as well.

Sure, I will do this.

...ceberg/src/main/java/com/facebook/presto/iceberg/optimizer/IcebergPlanOptimizerProvider.java

presto-iceberg/src/test/java/com/facebook/presto/iceberg/TestIcebergLogicalPlanner.java

presto-spi/src/main/java/com/facebook/presto/spi/plan/SortNode.java

...to-iceberg/src/main/java/com/facebook/presto/iceberg/optimizer/IcebergMetadataOptimizer.java

aaneja · 2024-04-23T13:29:44Z

Newbie question on Iceberg - Can Iceberg partitions end up with empty files in them ? If so, we need to see if we need to account for this like we do in the MetadataQueryOptimizer
Related : Do you think there is scope for refactoring/extracting common code from the MetadataQueryOptimizer, say by another connector that wants to employ this optimization ? I'm concerned about the code duplication here

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergConfig.java

...to-iceberg/src/main/java/com/facebook/presto/iceberg/optimizer/IcebergMetadataOptimizer.java

hantangwangd · 2024-04-23T17:39:41Z

@aaneja Thanks for your review and message.

Can Iceberg partitions end up with empty files in them ? If so, we need to see if we need to account for this like we do in the MetadataQueryOptimizer

I think the answer for this question is no. We get partition informations by planning all the valid files in current snapshot of the table. Iceberg is based on immutable file format, we can only delete datas/rows via deleting the whole data files or writing delete files. So there would not exist any valid file that do not have any data.

And if we find that any partition contains delete files, we would get empty partition informations directly and skip this optimization as we might meet the situation that one partition contains valid data files and delete files but eventually does not contain any data. So after these checks there is no need to check the validity of partitions again.

Do you think there is scope for refactoring/extracting common code from the MetadataQueryOptimizer, say by another connector that wants to employ this optimization ? I'm concerned about the code duplication here

I think this optimization is very closely linked to the implementation details of the underlying storage connector. One example is the question discussed above.

And further more, for Iceberg, we may have a series of changes that would affect this optimization, for example:

We plan to limit the number of partitions that can be loaded and apply this optimization when retrieving the table layout. So here we would just get a lazily loaded data structure.
We might further support metadata optimization for Iceberg using file metrics, as discussed above.
We would eventually find a consistent way to merge the different behaviors between filter_push_down on and off, which currently used for java worker and native worker. This would also affect the predicate judgement logic here.
Moreover, when considering the partitioned join and grouped execution for Iceberg, the message maintained in DiscretePredicates is almost certainly insufficient. So we might change or abandon it for a more powerful data structure.
...

To sum up, extracting common code is the thing that we should always consider, and there indeed exists some extract space. But I think we can hold a while and consider about it once the evolution of Iceberg connector stabilizes a bit. What's your opinion?

aaneja

The common pattern in the optimization is that if the Agg column(s) has a constrained TupleDomain, we can rewrite this Agg as a Values node
If we can extract this into a connector agnostic implementation, we reduce the test area significantly.

The Iceberg (or other connector) specific features that supply this TupleDomain can then evolve independently (with a much more easy to cover test area)

I don't have strong opinions on doing this right away, but that's how I see this evolving.

hantangwangd · 2024-04-24T12:13:05Z

Yes, as you said, currently there is a common pattern. I think this optimization is mainly about using metadata message to optimize the cardinality-insensitive aggregations. That is, if we can enumerate the distinct tuples consistent of all the columns existing in aggregation keys and group-by keys through metadata, and the total amount of these tuples is not so large, we can do this optimization, translate the TableScanNode to a ValuesNode to gain some benefits.

So the connector needs to determine whether it can exactly retrieve all the distinct tuples on specified columns under the maybe existing filter predicate through metadata.

This common pattern is a somewhat coarse grained pattern. I think we can just wait a moment for it to go stable and make sure that no further optimizations would be added to evolve the pattern. Then we can re-check which part is connector independent and invariable at a more finer granularity, and extract the common pattern which maybe not exactly the same as current one.

ZacBlanco

Two unrelated questions, otherwise LGTM

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java

ZacBlanco · 2024-04-24T14:19:27Z

presto-iceberg/src/main/java/com/facebook/presto/iceberg/IcebergUtil.java

-            if (type.equals(DATE) || type.equals(TIME) || type.equals(TIMESTAMP)) {
+            if (type.equals(TIMESTAMP) || type.equals(TIME)) {
+                return MICROSECONDS.toMillis(parseLong(valueString));
+            }


I noticed this earlier when authoring the histogram PR. Iceberg specifies TIME and TIMESTAMP as microsecond precision. However, while we might be able to read/write partition metadata with the correct values here, I think there is issues with inter-engine interop here.

If you write a TIME[STAMP] field in Spark to an Iceberg table, I believe it should be microsecond precision. But if you write it with Presto, won't it write with millisecond-precision? I haven't done much investigation into this but it seems like we could be inadvertently writing incompatible data with Presto if we don't resolve the data to microseconds.

I don't think it's something we need to fix in this PR, but it may be worth updating the supported types to only include TIMESTAMP_MICROSECONDS and similar instead of standard TIMESTAMP for Iceberg tables..

Agree, currently the TimestampType/TimeType indeed bring a lot inconvenient. We really need to start considering and updating the supported type in Iceberg to consist with the spec.

tdcmeehan · 2024-05-08T13:38:06Z

presto-spi/src/main/java/com/facebook/presto/spi/relation/RowExpressionService.java

@@ -24,6 +24,8 @@ public interface RowExpressionService

    ExpressionOptimizer getExpressionOptimizer();


Did you consider just using the existing getExpressionOptimizer and using the optimize overload that accepts a variable resolver? We can use an identity resolver for this case.

Thanks for the reminder, I have overlooked the method in ExpressionOptimizer which can accept the parameter variableResolver. Will fix the code to use this method!

hantangwangd · 2024-05-08T18:39:30Z

@tdcmeehan I have change the logic in IcebergMetadataOptimizer.evaluateMinMax() to use the method in ExpressionOptimizer as follows. Then drop interface ExpressionInterpreter and its related code and classes. Please take a look when available, thanks a lot.

        Object reducedValue = rowExpressionService.getExpressionOptimizer().optimize(
                            new CallExpression(
                                    Optional.empty(),
                                    scalarFunctionName,
                                    functionHandle,
                                    returnType,
                                    partitionedArguments),
                            Level.EVALUATED,
                            connectorSession,
                            variableReferenceExpression -> null);

tdcmeehan · 2024-05-08T19:39:49Z

Nice work!

…ng (#23645) ## Description Currently, when querying Iceberg tables, we always eagerly load all partition values of the table in `PickTableLayout` or `IcebergFilterPushdown` during the optimization phase. Due to the fact that this eagerly loaded partition values are currently only used in metadata based optimization rules, in many cases this information is not used at all. It can result in a lot of waste of resources and performance in the following cases: - For queries that cannot be optimized based on metadata, we do not use these partition values at all, so we shouldn't load them eagerly. - For tables with a huge number of partitions that are not suitable for metadata optimization, we need to limit the max number that can be loaded in the loading phase, rather than loading all of them first and then determine whether they exceed the threshold. (to be implemented in a subsequent PR) For details on queries that can be optimized using partitions message, see here: #22080. This PR makes the partition loading behavior lazy and only executes the loading when necessary. In this way, we can avoid a lot of unnecessary loading in many scenarios, as well as the resulting resource consumption and performance loss. The benchmark's results also support the above conclusion. We execute regular query statements, and query statements which are applicable for further reducible metadata optimization (which means always load and use all partitions) on tables with two different partition numbers: `300 * 2` and `1600 * 4`. The selected queries are as follows: ``` // Table `iceberg_partition` is partitioned by column `a` and `b`. create table iceberg_partition(a int, b int, c double) with (partitioning = ARRAY['a', 'b']); // This query will always be optimized using the partitions information, regardless of the partitions count. select min(a), max(a), min(b), max(b) from iceberg_partition; // This query does not need partitions information at all. select a, c from iceberg_partition where b >= xxx; ``` The benchmark test result before this change is as follows: ``` ----Before this change, always load partitions eagerly---- Benchmark (recordCount) Mode Cnt Score Error Units BenchmarkIcebergLazyLoading.testFurtherOptimize 300 * 2 avgt 10 39.971 ± 1.133 ms/op BenchmarkIcebergLazyLoading.testFurtherOptimize 1600 * 4 avgt 10 66.593 ± 2.253 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 300 * 2 avgt 10 166.274 ± 5.041 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 1600 * 4 avgt 10 659.756 ± 69.708 ms/op ``` While the benchmark test result after this change is as follows: ``` ----After this change, lazy load partitions and check the max threshold in loading phase---- Benchmark (recordCount) Mode Cnt Score Error Units BenchmarkIcebergLazyLoading.testFurtherOptimize 300 * 2 avgt 10 50.806 ± 2.471 ms/op BenchmarkIcebergLazyLoading.testFurtherOptimize 1600 * 4 avgt 10 77.579 ± 1.025 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 300 * 2 avgt 10 155.059 ± 4.522 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 1600 * 4 avgt 10 596.955 ± 41.619 ms/op ``` Due to the issues mentioned above, we found that this change significantly improves the performance of queries that are not suitable for metadata optimization. This aligns with expectations, it introduces a minimal fixed cost for workloads that load all partitions. However, for most queries that don't need to load the table partitions, the performance gain scales significantly with the partitions count. ## Motivation and Context Make partitions loading for iceberg table lazy to avoid unnecessary loading ## Impact N/A ## Test Plan - Make sure the change do not affect existing tests - Newly added test case in `TestIcebergLogicalPlanner` to show the behaviors with different max partition thresholds - Newly added benchmark tests in `BenchmarkIcebergLazyLoading` to show the improvement in performance ## Contributor checklist - [x] Please make sure your submission complies with our [development](https://github.com/prestodb/presto/wiki/Presto-Development-Guidelines#development), [formatting](https://github.com/prestodb/presto/wiki/Presto-Development-Guidelines#formatting), [commit message](https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#commit-formatting-and-pull-requests), and [attribution guidelines](https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#attribution). - [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [x] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [x] Adequate tests were added if applicable. - [x] CI passed. ## Release Notes ``` == RELEASE NOTES == Iceberg Connector Changes * Improve partition loading for Iceberg tables by making it lazy, preventing unnecessary loading. ```

…ng (prestodb#23645) ## Description Currently, when querying Iceberg tables, we always eagerly load all partition values of the table in `PickTableLayout` or `IcebergFilterPushdown` during the optimization phase. Due to the fact that this eagerly loaded partition values are currently only used in metadata based optimization rules, in many cases this information is not used at all. It can result in a lot of waste of resources and performance in the following cases: - For queries that cannot be optimized based on metadata, we do not use these partition values at all, so we shouldn't load them eagerly. - For tables with a huge number of partitions that are not suitable for metadata optimization, we need to limit the max number that can be loaded in the loading phase, rather than loading all of them first and then determine whether they exceed the threshold. (to be implemented in a subsequent PR) For details on queries that can be optimized using partitions message, see here: prestodb#22080. This PR makes the partition loading behavior lazy and only executes the loading when necessary. In this way, we can avoid a lot of unnecessary loading in many scenarios, as well as the resulting resource consumption and performance loss. The benchmark's results also support the above conclusion. We execute regular query statements, and query statements which are applicable for further reducible metadata optimization (which means always load and use all partitions) on tables with two different partition numbers: `300 * 2` and `1600 * 4`. The selected queries are as follows: ``` // Table `iceberg_partition` is partitioned by column `a` and `b`. create table iceberg_partition(a int, b int, c double) with (partitioning = ARRAY['a', 'b']); // This query will always be optimized using the partitions information, regardless of the partitions count. select min(a), max(a), min(b), max(b) from iceberg_partition; // This query does not need partitions information at all. select a, c from iceberg_partition where b >= xxx; ``` The benchmark test result before this change is as follows: ``` ----Before this change, always load partitions eagerly---- Benchmark (recordCount) Mode Cnt Score Error Units BenchmarkIcebergLazyLoading.testFurtherOptimize 300 * 2 avgt 10 39.971 ± 1.133 ms/op BenchmarkIcebergLazyLoading.testFurtherOptimize 1600 * 4 avgt 10 66.593 ± 2.253 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 300 * 2 avgt 10 166.274 ± 5.041 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 1600 * 4 avgt 10 659.756 ± 69.708 ms/op ``` While the benchmark test result after this change is as follows: ``` ----After this change, lazy load partitions and check the max threshold in loading phase---- Benchmark (recordCount) Mode Cnt Score Error Units BenchmarkIcebergLazyLoading.testFurtherOptimize 300 * 2 avgt 10 50.806 ± 2.471 ms/op BenchmarkIcebergLazyLoading.testFurtherOptimize 1600 * 4 avgt 10 77.579 ± 1.025 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 300 * 2 avgt 10 155.059 ± 4.522 ms/op BenchmarkIcebergLazyLoading.testNormalQuery 1600 * 4 avgt 10 596.955 ± 41.619 ms/op ``` Due to the issues mentioned above, we found that this change significantly improves the performance of queries that are not suitable for metadata optimization. This aligns with expectations, it introduces a minimal fixed cost for workloads that load all partitions. However, for most queries that don't need to load the table partitions, the performance gain scales significantly with the partitions count. ## Motivation and Context Make partitions loading for iceberg table lazy to avoid unnecessary loading ## Impact N/A ## Test Plan - Make sure the change do not affect existing tests - Newly added test case in `TestIcebergLogicalPlanner` to show the behaviors with different max partition thresholds - Newly added benchmark tests in `BenchmarkIcebergLazyLoading` to show the improvement in performance ## Contributor checklist - [x] Please make sure your submission complies with our [development](https://github.com/prestodb/presto/wiki/Presto-Development-Guidelines#development), [formatting](https://github.com/prestodb/presto/wiki/Presto-Development-Guidelines#formatting), [commit message](https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#commit-formatting-and-pull-requests), and [attribution guidelines](https://github.com/prestodb/presto/wiki/Review-and-Commit-guidelines#attribution). - [x] PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced. - [x] Documented new properties (with its default value), SQL syntax, functions, or other functionality. - [x] If release notes are required, they follow the [release notes guidelines](https://github.com/prestodb/presto/wiki/Release-Notes-Guidelines). - [x] Adequate tests were added if applicable. - [x] CI passed. ## Release Notes ``` == RELEASE NOTES == Iceberg Connector Changes * Improve partition loading for Iceberg tables by making it lazy, preventing unnecessary loading. ```

hantangwangd requested a review from a team as a code owner March 5, 2024 13:05

hantangwangd requested a review from presto-oss March 5, 2024 13:05

tdcmeehan self-assigned this Mar 5, 2024

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from a2cf6a9 to a922048 Compare March 5, 2024 16:27

hantangwangd requested review from ZacBlanco, imjalpreet and tdcmeehan March 5, 2024 17:25

tdcmeehan reviewed Mar 7, 2024

View reviewed changes

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch 2 times, most recently from 5b6ea5e to 38633b4 Compare March 10, 2024 18:10

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch 2 times, most recently from 3d14925 to bd9465f Compare March 15, 2024 09:43

tdcmeehan requested review from aaneja and vivek-bharathan March 15, 2024 18:42

ZacBlanco requested changes Mar 19, 2024

View reviewed changes

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from bd9465f to 688df0d Compare March 27, 2024 09:37

hantangwangd requested a review from jaystarshot as a code owner March 27, 2024 09:37

hantangwangd requested a review from ZacBlanco March 27, 2024 10:55

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from 688df0d to b8c0bb2 Compare April 1, 2024 11:07

hantangwangd requested a review from steveburnett as a code owner April 1, 2024 11:07

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch 2 times, most recently from 17cec68 to 1c57fc3 Compare April 13, 2024 02:45

hantangwangd marked this pull request as ready for review April 13, 2024 03:14

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from 1c57fc3 to e0155f1 Compare April 18, 2024 16:43

aaneja requested changes Apr 23, 2024

View reviewed changes

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from e0155f1 to 4f76ba5 Compare April 23, 2024 17:26

aaneja previously approved these changes Apr 24, 2024

View reviewed changes

ZacBlanco previously approved these changes Apr 24, 2024

View reviewed changes

hantangwangd dismissed stale reviews from ZacBlanco and aaneja via 2bd2d5d May 4, 2024 00:26

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from 4f76ba5 to 2bd2d5d Compare May 4, 2024 00:26

hantangwangd requested a review from feilong-liu as a code owner May 4, 2024 00:26

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from 2bd2d5d to 1c764d9 Compare May 8, 2024 04:02

tdcmeehan reviewed May 8, 2024

View reviewed changes

hantangwangd added 3 commits May 9, 2024 01:58

Support partition based metadata query optimization for Iceberg

6ae7b93

Fix TimestampType/TimeType values when get partitions for Iceberg table

d3f3b58

Add configuration to set metadata optimization threshold for Iceberg

92fc1bd

hantangwangd force-pushed the support_metadata_optimization_in_iceberg branch from 1c764d9 to 92fc1bd Compare May 8, 2024 18:00

tdcmeehan approved these changes May 8, 2024

View reviewed changes

hantangwangd merged commit e61f5af into prestodb:master May 9, 2024

hantangwangd deleted the support_metadata_optimization_in_iceberg branch May 9, 2024 00:36

hantangwangd mentioned this pull request Dec 31, 2025

feat(plugin-iceberg): Lazy load partitions to avoid unnecessary loading #23645

Merged

6 tasks

hantangwangd mentioned this pull request Mar 13, 2026

feat(plugin-iceberg): Push down min/max/count based on file stats #27085

Open

7 tasks

		@@ -24,6 +24,8 @@ public interface RowExpressionService

		ExpressionOptimizer getExpressionOptimizer();

Conversation

hantangwangd commented Mar 5, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Plan

Contributor checklist

Release Notes

Uh oh!

tdcmeehan left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

hantangwangd commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hantangwangd commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdcmeehan commented Mar 8, 2024

Uh oh!

hantangwangd commented Mar 8, 2024

Uh oh!

tdcmeehan commented Mar 8, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hantangwangd commented Mar 8, 2024

Uh oh!

hantangwangd commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tdcmeehan commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hantangwangd commented Mar 11, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

hantangwangd commented Mar 15, 2024

Uh oh!

ZacBlanco left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hantangwangd Mar 27, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hantangwangd Mar 28, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

aaneja commented Apr 23, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

hantangwangd commented Mar 5, 2024 •

edited

Loading

hantangwangd commented Mar 8, 2024 •

edited

Loading

hantangwangd commented Mar 8, 2024 •

edited

Loading

tdcmeehan commented Mar 8, 2024 •

edited

Loading

hantangwangd commented Mar 11, 2024 •

edited

Loading

tdcmeehan commented Mar 11, 2024 •

edited

Loading

hantangwangd commented Mar 11, 2024 •

edited

Loading

hantangwangd Mar 27, 2024 •

edited

Loading

hantangwangd Mar 28, 2024 •

edited

Loading

aaneja commented Apr 23, 2024 •

edited

Loading

ZacBlanco Apr 24, 2024 •

edited

Loading

hantangwangd commented May 8, 2024 •

edited

Loading