Skip to content

Pushdown dereference expression in query plan#1435

Closed
qqibrow wants to merge 2 commits intotrinodb:masterfrom
qqibrow:pushdown_rebase
Closed

Pushdown dereference expression in query plan#1435
qqibrow wants to merge 2 commits intotrinodb:masterfrom
qqibrow:pushdown_rebase

Conversation

@qqibrow
Copy link
Copy Markdown
Contributor

@qqibrow qqibrow commented Sep 3, 2019

it aims to do:

  1. Pushdown dereferences down above tableScan, which can be used by pushdownProjectionToTableScan
  2. Reduce shuffling cost:
    1. Dereference before exchange can avoid shuffle whoe struct in exchange.
    2. Dereference exact once can avoid shuffling the same subField twice in exchange. For example,
      query:
    explain WITH t(msg) AS (SELECT * FROM (VALUES ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y 
    DOUBLE))))) 
    SELECT b.msg.x FROM t a, t b WHERE a.msg.x = b.msg.x
    
    One side of join outputs [field_6:row(x bigint, y double), expr_19:bigint] but in fact field_6 contains expr_19.

@cla-bot cla-bot bot added the cla-signed label Sep 3, 2019
@qqibrow
Copy link
Copy Markdown
Contributor Author

qqibrow commented Sep 3, 2019

PRODUCT_TESTS_SUITE=suite-4 failed but I didn't find which test case it failed. https://travis-ci.com/prestosql/presto/jobs/230815571

@findepi
Copy link
Copy Markdown
Member

findepi commented Sep 3, 2019

PRODUCT_TESTS_SUITE=suite-4 failed

@qqibrow it's unfortunately flaky. Retriggered.

@qqibrow qqibrow requested a review from findepi September 4, 2019 15:30
@findepi findepi requested review from martint and removed request for findepi September 5, 2019 07:12
@martint martint mentioned this pull request Sep 9, 2019
Copy link
Copy Markdown
Member

@martint martint left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial comments. I'm looking at the PushDownDereferences class now.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this placed here? Ideally, it should be closer to (or part of) the IterativeOptimizer that calls all the other push/prune/merge/etc rules (around line 300 in this file).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should be at least below TransformUncorrelatedInPredicateSubqueryToSemiJoin, where SimiJoinNode is first generated.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to make sure it won't "undo" decisions made by InlineProjections. Otherwise, it's possible to end up with an infinite optimization loop. It may be necessary to adjust the heuristics in InlineProjections.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Place closing ) in previous line.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Capitalize SQL keywords (FROM, CROSS JOIN, etc)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The intermediate SELECT * is unnecessary. Also, please format the query like this for better readability:

WITH t(msg) AS (VALUES ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE))))
SELECT b.msg.x 
FROM t a, t b
WHERE a.msg.y = b.msg.y

Similar comment applies to all the queries below.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Implement this test or remove it.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to have these tests, but it would be ideal to also add rule-specific tests. See io.prestosql.sql.planner.iterative.rule.TestXXX for some examples.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Relying inheritance to reuse behavior makes the code harder to follow and reason about (is it just about code reuse? Is there shared state that's important? etc). It'd be better to model it via some form of delegation. E.g., The rule can take another object that performs the job of pushDownDereferences.

@phd3
Copy link
Copy Markdown
Member

phd3 commented Oct 8, 2019

I can also help with the review of this one, once it's updated with @martint 's comments.

@qqibrow
Copy link
Copy Markdown
Contributor Author

qqibrow commented Oct 13, 2019

@phd3 sure. thanks. Working on this.

@qqibrow qqibrow force-pushed the pushdown_rebase branch 4 times, most recently from a4f0734 to f921a6e Compare November 4, 2019 05:20
Co-authored-by: Zhenxiao Luo <luoz@uber.com>
@qqibrow
Copy link
Copy Markdown
Contributor Author

qqibrow commented Nov 4, 2019

@martint @findepi @phd3 Addressed all comments. Please take a look. thanks

Copy link
Copy Markdown
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@qqibrow Thanks for your work on this. I've added some comments.

// When nested child and parent dereferences both exist, Pushdown rule will be trigger one more time
// and lead to runtime error. E.g. [msg.foo, msg.foo.bar] => [exp, exp.bar] (should stop here but
// since there are still dereferences, pushdown rule will trigger again)
if (dereferences.stream().anyMatch(exp -> baseExists(exp, dereferences))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if input variable expressions is [x.y.z + f(x.y) + a.b] then this code will not pushdown a.b, right?

I think there is an alternative approach to doing this.

Given the expressions, we only create required symbols. i.e. for [x.y.z + f(x.y) + a.b], we can create symbols for x.y and a.b.

We can return ImmutableMap.of() in cases where all bases of all dereference expressions are symbolreferences. In the example you provided, the optimizer will stop the second time around there since the base exp is a symbolreference.

Copy link
Copy Markdown
Contributor Author

@qqibrow qqibrow Nov 21, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. it will not pushdown a.b. I don't know how to stop iteration in this case so this is just a walkaround, not final solution.

We can return ImmutableMap.of() in cases where all bases of all dereference expressions are symbolreferences

How does this work? It will stop the first time for [f(x.y) + a.b] because x and a are both symbol references.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, let's say the variable expressions is a single element list [x.y.z + f(x.y) + a.b] . dereferences would come out to be {x.y.z, x.y, a.b}. Now here's where we can change the implementation. We can return ImmutableMap.of() if ALL elements in this set of expressions dereferences are 1-level dereferences. i.e. the bases are symbolreferences. It doesn't happen the first time, since x.y (the base of x.y.z) is not a symbol reference. In this new implementation, we also need to make sure that we only create symbol references for "superset" references. i.e. in this example, we only create symbols for x.y and a.b.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think optimizing such pushdowns here will impact the effectiveness of all individual pushdown rules, since this method is being used everywhere.

private static boolean validPushDown(DereferenceExpression dereference)
{
Expression base = dereference.getBase();
return (base instanceof SymbolReference) || (base instanceof DereferenceExpression);
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate on why in other cases the pushdown should not be considered valid?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)).x

}
}

static class PushDownDereferenceThroughUnnest
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we also implement ExtractFromUnnest rule since this can have an optional predicate based on recent changes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on when we run PushdownDereferences I guess. If it runs before filter is empty then that's fine.

@phd3
Copy link
Copy Markdown
Member

phd3 commented Nov 11, 2019

I've a concern about the placement of PushDownDereferences rule in the list of optimizers. If we want to be able to pushdown predicates+projections as in #1953, we need to make sure that all dereference projections are right above table scan. If there is a filter, that should be above the dereference projections.

For a query like the following, current PR produces a plan that has FilterNode in between ProjectNode and ValuesNode at the end of all optimizations.

WITH t(msg) AS (VALUES ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE)))) SELECT msg.x FROM t where msg.y = 5

This happens because PredicatePushDown rule undoes the dereference projection pushdown done by this rule. Is this expected or we should be protecting the dereference projections against a filter pushdown?

If we go with the current placement of PushdownDereferences, we need to invoke the connector pushdown rules right after this rule is invoked, to capture the connector pushdown before the plan is mutated by PredicatePushdown (or any other optimizers later). What do you think @martint @qqibrow ?

@martint
Copy link
Copy Markdown
Member

martint commented Nov 13, 2019

Is this expected or we should be protecting the dereference projections against a filter pushdown?

Yes. We need to be careful about making sure rules don't undo each other's work, as this will lead to an infinite loop once the rules are placed in the same IterativeOptimizer. In particular, we'll need to adjust PredicatePushdown to not push filters below a dereference-expression-only projection. Ideally, we should migrate Predicate pushdown to a Rule-based implementation instead of a Visitor, but that's an orthogonal issue. We should also make sure InlineProjections doesn't undo a pushdown of dereference expressions into a separate projection.

@martint martint self-requested a review November 14, 2019 10:50
@qqibrow
Copy link
Copy Markdown
Contributor Author

qqibrow commented Nov 21, 2019

This happens because PredicatePushDown rule undoes the dereference projection pushdown done by this rule.

Your optimizer PushdownDereferenceToTableScan needs to run right after PushdownDereferences. You can refer the ordering to my previous diff

InlineProjections doesn't undo a pushdown of dereference expressions into a separate projection.

That's done in current work. Please review.

I am concerned that current PushdownDereferences doesn't cover all PlanNode. So there might be cases that this doesn't work.

Copy link
Copy Markdown
Member

@phd3 phd3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it will help to add tests that cover more complex expressions and multiple level of dereferences.

// When nested child and parent dereferences both exist, Pushdown rule will be trigger one more time
// and lead to runtime error. E.g. [msg.foo, msg.foo.bar] => [exp, exp.bar] (should stop here but
// since there are still dereferences, pushdown rule will trigger again)
if (dereferences.stream().anyMatch(exp -> baseExists(exp, dereferences))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, let's say the variable expressions is a single element list [x.y.z + f(x.y) + a.b] . dereferences would come out to be {x.y.z, x.y, a.b}. Now here's where we can change the implementation. We can return ImmutableMap.of() if ALL elements in this set of expressions dereferences are 1-level dereferences. i.e. the bases are symbolreferences. It doesn't happen the first time, since x.y (the base of x.y.z) is not a symbol reference. In this new implementation, we also need to make sure that we only create symbol references for "superset" references. i.e. in this example, we only create symbols for x.y and a.b.

leftNode,
rightNode,
joinNode.getCriteria(),
ImmutableList.<Symbol>builder().addAll(leftNode.getOutputSymbols()).addAll(rightNode.getOutputSymbols()).build(),
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will add symbols being referenced in filters and equijoin clauses too as output symbols. (reverse of pruning). I think we should just add joinNode.getOutputSymbols() and pushdownDereferences.values() here.

// When nested child and parent dereferences both exist, Pushdown rule will be trigger one more time
// and lead to runtime error. E.g. [msg.foo, msg.foo.bar] => [exp, exp.bar] (should stop here but
// since there are still dereferences, pushdown rule will trigger again)
if (dereferences.stream().anyMatch(exp -> baseExists(exp, dereferences))) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think optimizing such pushdowns here will impact the effectiveness of all individual pushdown rules, since this method is being used everywhere.

@phd3
Copy link
Copy Markdown
Member

phd3 commented Nov 27, 2019

There is an interesting case of dereference expressions buried inside lambda expressions, which I believe we're missing here. For example, the following test case:

WITH t(msg) AS (VALUES ROW(CAST(ROW(1, 2.0) AS ROW(x BIGINT, y DOUBLE))))" +
                        "SELECT try(1.0/msg.x) FROM t where msg.y = 5

The try(...) function is converted to a lambda expression. so msg.x is not pushed down. I believe this is thew case with explicit lambdas and any functions that convert to lambdas. The reasons seems to be DefaultExpressionTraversalVisitor not taking into account lambda expressions, from which the visitor in extractDereferenceExpressions is extended.

@martint martint removed their assignment Jan 22, 2020
@phd3
Copy link
Copy Markdown
Member

phd3 commented Jan 29, 2020

superceded by #2672

@martint martint closed this Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

4 participants