Materialize tables during planning by electrum · Pull Request #23426 · trinodb/trino

electrum · 2024-09-15T00:05:31Z

Release notes

(x) Release notes are required, with the following suggested text:

# General
* Improve query performance by materializing small tables during planning. ({issue}`23426`)

wendigo · 2024-09-17T06:21:21Z

Can you share benchmarks?

sajjoseph · 2024-09-17T10:07:08Z

Could this help with CTEs - especially with similar subplans?
I know @sopel39 has this PR - #22827

Looking forward to see these PRs getting merged soon.

sopel39 · 2024-09-18T10:53:20Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/MaterializeFilteredTableScan.java

Will this support dynamic filters? IMO it should, then it could replace #22527 potentially cc @raunaqmorarka @Dith3r

What would it mean to support them? This seems like an alternative.

What would it mean to support them?

There are queries (both user and benchmarks), where there are cascading DFs. So you have 3 table scans:

-> date_dim#1 -> date_dim#2 -> super_large_fact_table

where date_dim#2 depends on DF originating from date_dim#1 and
super_large_fact_table depends on DF originating from date_dim#2.

super_large_fact_table scan size depends on date_dim#2 getting filtered by DF from date_dim#1, which is what #22527 addresses by explicitly waiting for DFs.

To support DFs here you would have to collect them during planning when you materialize table scans (just as it happens during actual execution) and apply them on depending table scans.
Alternatively, we could still use #22527, but it would have to be adjusted so that it waits for DFs on top of ValuesNode.

If this PR clashes with #22527 it will introduce pretty substantial regressions

cc @raunaqmorarka @Dith3r

Thanks for the detailed explanation. I'm still confused as to how dynamic filters are relevant here as this happens during planning, so it's not dynamic at all. After the table is materialized, we should apply further pushdown based on the actual values.

My expectation is that there is lots of follow up work to be done in the planner to take better advantage of values, and to fix bugs such as introducing unnecessary cross joins. This is follow up work for someone such as yourself who understands the optimizer.

Also, how can conflicts with the other PR be a regression if it's not merged yet?

Thanks for the detailed explanation. I'm still confused as to how dynamic filters are relevant here as this happens during planning

DFs in this case are used for narrowing materialized table content, which is important for downstream joins. This can be done "statically" during planning as you materialize tables.

After the table is materialized, we should apply further pushdown based on the actual values.

Yes, that could work if the table is not too large.

I see that DynamicFilter is simply TupleDomain which is actually a subset of the pushdown that happens in applyFilter(), so we shouldn't need anything additional here. We seem to collect that properly for values today.

I see that DynamicFilter is simply TupleDomain which is actually a subset of the pushdown that happens

Keep in mind that TupleDomain was optimized recently to be more memory efficient and to handle larger sets. When we start doing predicate pushdown using large number of Expressions I think this will kill the planner, so the materialization row limit should be quite low IMO

sopel39 · 2024-09-18T10:55:27Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/AbstractMaterializeTableScan.java

JDBC queries can be huge while they only yield few results. DF can change JDBC query duration from hours to seconds

Do we have a way to estimate that? We would want to skip those.

I don't think we have a way to estimate the cost of the work done by the pushed down operations into table scan.
Every pushdown into a JDBC table handle will result in io.trino.sql.planner.plan.TableScanNode#statistics being revised with the estimate of the resulting output and we lose the information about estimate of original table at that point. One possibility is to remember the original table scan output estimate in a separate field and rely on that.
We had a somewhat similar need for avoiding a particular optimization on JDBC connectors in #22355 and ended up adding a new API io.trino.spi.connector.ConnectorMetadata#allowSplittingReadIntoMultipleSubQueries for that. Maybe we can explore generalizing/re-using that for this case too.
cc: @martint

raunaqmorarka · 2024-09-19T04:48:01Z

core/trino-main/src/main/java/io/trino/sql/planner/OptimizerConfig.java

+    private boolean materializeTable = true;
+    private int materializeTableMaxEstimatedRowCount = 50_000;
+    private int materializeTableMaxActualRowCount = 100_000;
+    private Duration materializeTableTimeout = new Duration(5, SECONDS);


5 seconds is quite long as a default, 1-2 seconds seems more reasonable

raunaqmorarka · 2024-09-19T04:52:54Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/AbstractMaterializeTableScan.java

+            }
+            List<Split> batch = getFutureValue(splitSource.getNextBatch(1000)).getSplits();
+            for (Split split : batch) {
+                if (!split.isRemotelyAccessible() && !split.getAddresses().contains(currentNode)) {


For distributed caching implementations, split.getAddresses() is never going to contain current node when coordinator is not included for scheduling (which is the normal case in production). So I suggest dropping that check.

We want to materialize as many tables as possible. Checking for current node allows us to materialize certain system tables, which allows queries such as the below to work:

SELECT * FROM t WHERE ds = (SELECT max(ds) FROM "t$partitions")

raunaqmorarka · 2024-09-19T05:03:20Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/AbstractMaterializeTableScan.java

+        implements Rule<T>
+        permits MaterializeFilteredTableScan, MaterializeTableScan
+{
+    private static final int MAX_SPLITS = 10_000;


This seems rather big for 100K rows

raunaqmorarka · 2024-09-19T05:09:07Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/AbstractMaterializeTableScan.java

+                }
+            }
+            splits.addAll(batch);
+            if (splits.size() > MAX_SPLITS) {


Do we need to pull out a large number of splits upfront ?
It would be nicer to iterate over smaller batches of splits as they become available. This way we can cut off on maxRows threshold or timeout without having to generate a lot of splits which also potentially take up a lot of memory.

raunaqmorarka · 2024-09-19T05:12:07Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/MaterializeFilteredTableScan.java

+        TableScanNode tableScan = captures.get(TABLE_SCAN);
+
+        Constraint constraint = Optional.of(filter.getPredicate())
+                .map(predicate -> filterConjuncts(predicate, expression -> !DynamicFilters.isDynamicFilter(expression)))


static import

raunaqmorarka · 2024-09-19T05:34:37Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/AbstractMaterializeTableScan.java

+        return Optional.empty();
+    }
+
+    private Optional<List<Expression>> doMaterializeTable(Session session, TableHandle table, List<ColumnHandle> columns, List<Type> types, Constraint constraint)


How do we ensure that we don't end up re-running this for the same table many times during the planning process ?
I think any change to Filter or Table scan nodes in the pushIntoTableScanRulesExceptJoins iterative optimizer loop would keep re-triggering this Rule

Wouldn't we want to run this again after a filter is pushed? A table might be too large initially, but comes eligible when when more information is available.

A table might be too large initially, but comes eligible when when more information is available.

But that would mean multiple few secs delays to bail-out, right? Splits would be enumerated multiple times. JDBC queries would also be triggered on remote system multiple times.

I think it should have just one shot after all eligible pushdowns, BUT then we also want materialized predicate to be used for further pushdowns. It's chicken and egg problem.

My guess it should just try once and predicates derived from materialized tables should be applied on ValuesNodes. It would solve 90% of small table cases this way.

raunaqmorarka · 2024-09-19T05:47:01Z

core/trino-main/src/main/java/io/trino/sql/planner/optimizations/MaterializeTableScan.java

+    @Override
+    public Pattern<TableScanNode> getPattern()
+    {
+        return tableScan()


We usually store the Pattern in a private static final member variable

raunaqmorarka · 2024-09-19T06:05:02Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/AbstractMaterializeTableScan.java

+    public final Result apply(T node, Captures captures, Context context)
+    {
+        int maxRows = getMaterializeTableMaxEstimatedRowCount(context.getSession());
+        PlanNodeStatsEstimate stats = context.getStatsProvider().getStats(node);


I think this is potentially problematic for filtered tables.
The scan materialization is only going to benefit from the part of the filter that is enforced by the connector page or split source, but the estimate is going to include the filtering from the entire predicate.
Maybe this should just be estimate of output of TableScan as that should include any filtering that we gain from enforcedConstraint.

That's a good point. How do I fetch the stats for that?

raunaqmorarka · 2024-09-19T06:09:06Z

...rino-main/src/main/java/io/trino/sql/planner/optimizations/AbstractMaterializeTableScan.java

I don't think we have a way to estimate the cost of the work done by the pushed down operations into table scan.
Every pushdown into a JDBC table handle will result in io.trino.sql.planner.plan.TableScanNode#statistics being revised with the estimate of the resulting output and we lose the information about estimate of original table at that point. One possibility is to remember the original table scan output estimate in a separate field and rely on that.
We had a somewhat similar need for avoiding a particular optimization on JDBC connectors in #22355 and ended up adding a new API io.trino.spi.connector.ConnectorMetadata#allowSplittingReadIntoMultipleSubQueries for that. Maybe we can explore generalizing/re-using that for this case too.
cc: @martint

raunaqmorarka · 2024-09-19T06:13:46Z

core/trino-main/src/main/java/io/trino/sql/planner/PlanOptimizers.java

+                .add(new MaterializeFilteredTableScan(plannerContext, splitManager, pageSourceManager, nodeManager, executor))
+                .add(new MaterializeTableScan(plannerContext, splitManager, pageSourceManager, nodeManager, executor))


This might be too early in the planning process for this optimization. Ideally, we want to delay this until after as much predicate and projection pushdown as possible has happened.
I think we should at least do it after or near io.trino.sql.planner.optimizations.MetadataQueryOptimizer runs as that is a similar optimization as this one and we probably want that one to run that one first.

I have NO IDEA where to put this optimization. I just looked around and guessed. If you can tell me exactly where to put it, that would be great. Or if you want to take over this PR, I'm happy for that too..

electrum · 2024-09-20T17:12:42Z

Thanks @sopel39 and @raunaqmorarka for the valuable feedback. Based on your feedback and discussions with @dain, it seems like trying to do this generically in the engine for all connectors is problematic. I'll close this and go back to the original idea of letting connectors make this decision.

cla-bot bot added the cla-signed label Sep 15, 2024

electrum force-pushed the materialize branch 9 times, most recently from 6a2b8d6 to 64ff37b Compare September 17, 2024 02:03

electrum force-pushed the materialize branch from 64ff37b to 6322739 Compare September 17, 2024 07:47

github-actions bot added the jdbc Relates to Trino JDBC driver label Sep 17, 2024

electrum force-pushed the materialize branch from 6322739 to 93cd43e Compare September 17, 2024 07:58

electrum force-pushed the materialize branch 2 times, most recently from 58b509a to c01ebd9 Compare September 18, 2024 02:35

github-actions bot added the iceberg Iceberg connector label Sep 18, 2024

electrum force-pushed the materialize branch from 6c38325 to 6da54da Compare September 18, 2024 03:41

sopel39 reviewed Sep 18, 2024

View reviewed changes

electrum added 8 commits September 18, 2024 11:54

Add TPCDS for Iceberg development query runners

237dbc0

Rename compatibility product test config files

b684002

Optimize isDynamicFilter planner check

7aecc32

Use PlanTester session in tests

b2fef75

Use QueryRunner session in TestQueryManager

c1f579a

Add ValuesNode constructor for zero rows

7ffc4eb

Limit number of Values node rows in printed plan

35f5d2d

Materialize tables during planning

27f3691

electrum force-pushed the materialize branch from 6da54da to 27f3691 Compare September 18, 2024 18:54

raunaqmorarka reviewed Sep 19, 2024

View reviewed changes

electrum closed this Sep 20, 2024

electrum deleted the materialize branch September 20, 2024 17:12

		.add(new MaterializeFilteredTableScan(plannerContext, splitManager, pageSourceManager, nodeManager, executor))
		.add(new MaterializeTableScan(plannerContext, splitManager, pageSourceManager, nodeManager, executor))

Conversation

electrum commented Sep 15, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Release notes

Uh oh!

wendigo commented Sep 17, 2024

Uh oh!

sajjoseph commented Sep 17, 2024

Uh oh!

sopel39 Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

sopel39 Sep 18, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

electrum Sep 20, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

electrum commented Sep 20, 2024

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

5 participants

electrum commented Sep 15, 2024 •

edited

Loading

sopel39 Sep 18, 2024 •

edited

Loading

sopel39 Sep 18, 2024 •

edited

Loading

electrum Sep 20, 2024 •

edited

Loading