Evaluate project node on values node#23245
Conversation
|
Please add some actual query tests with a bit more complex expressions like map/array constructors etc. which is quite common. Also add multiple columns tests. |
9471378 to
9776440
Compare
|
Fixed @kaikalur @ZacBlanco |
ZacBlanco
left a comment
There was a problem hiding this comment.
Great job! Minor nits on my first pass
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...rc/test/java/com/facebook/presto/sql/planner/iterative/rule/TestEvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
| * <p/> | ||
| * Plan before optimizer: | ||
| * <pre> | ||
| * ProjectNode (outputVariables) |
There was a problem hiding this comment.
How about multiple projects - it could happen. Will this rule be applied again?
There was a problem hiding this comment.
Yes. It is an iterative optimizer. It could optimize iteratively.
There was a problem hiding this comment.
Add a test case for two projections for completeness?
| RowExpressionInterpreter elementInterpreter = new RowExpressionInterpreter(element, | ||
| functionAndTypeManager, | ||
| context.getSession().toConnectorSession(), | ||
| EVALUATED); |
There was a problem hiding this comment.
Does this handle non-deterministic functions? I think we should not apply this rule for non-deterministic like
select random(x) from (values 1,2) AS T(x)
There was a problem hiding this comment.
What's the reasoning not evaluate a non-deterministic function? Is there a big difference evaluating it on a worker later on vs the coordinator?
There was a problem hiding this comment.
In general there could be surprises so we generally short-circuit lot of optimizations when we see non-deterministic expressions. Also I'm not a fan of RowExpressionInterpreter :( it's implementing the "engine" for constants and I'm sure it works differently from the java and differently differently from native lol
There was a problem hiding this comment.
non-deterministic function excluded.
There was a problem hiding this comment.
please add a test like: select random(x) from (values 1,2) AS T(x)
9776440 to
9147462
Compare
| * <p/> | ||
| * Plan before optimizer: | ||
| * <pre> | ||
| * ProjectNode (outputVariables) |
There was a problem hiding this comment.
Add a test case for two projections for completeness?
presto-main/src/main/java/com/facebook/presto/sql/analyzer/FeaturesConfig.java
Outdated
Show resolved
Hide resolved
f87c93c to
38803a5
Compare
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
presto-tests/src/main/java/com/facebook/presto/tests/AbstractTestQueries.java
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
...in/src/main/java/com/facebook/presto/sql/planner/iterative/rule/EvaluateProjectOnValues.java
Outdated
Show resolved
Hide resolved
38803a5 to
6738069
Compare
| List<List<RowExpression>> rows = source.getRows(); | ||
| List<VariableReferenceExpression> valuesOutputVariables = source.getOutputVariables(); | ||
| Set<Map.Entry<VariableReferenceExpression, RowExpression>> projectAssignmentEntries = projectNode.getAssignments().entrySet(); | ||
| List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList()); |
There was a problem hiding this comment.
| List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList()); | |
| List<VariableReferenceExpression> projectOutputVariables = projectNode.getOutputVariables(); |
| List<VariableReferenceExpression> valuesOutputVariables = source.getOutputVariables(); | ||
| Set<Map.Entry<VariableReferenceExpression, RowExpression>> projectAssignmentEntries = projectNode.getAssignments().entrySet(); | ||
| List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList()); | ||
| List<RowExpression> projectRowExpressions = projectAssignmentEntries.stream().map(Map.Entry::getValue).collect(toImmutableList()); |
There was a problem hiding this comment.
| List<RowExpression> projectRowExpressions = projectAssignmentEntries.stream().map(Map.Entry::getValue).collect(toImmutableList()); | |
| List<RowExpression> projectRowExpressions = projectNode.getAssignments().getExpressions().stream.collect(toImmutableList()); |
There was a problem hiding this comment.
Althought I haven`t found an example,is it possible that projectOutputVariables and projectRowExpressions does not match because projectNode.getAssignments().getExpressions() returns a Collection of RowExpression rather than a List of RowExpression? @feilong-liu
There was a problem hiding this comment.
I think it should be fine. The map in assignment is backed by unmodifiable LinkedHashMap which is order preserving.
| if (!projectRowExpressions.stream() | ||
| .filter(expression -> expression instanceof CallExpression) | ||
| .map(callExpression -> functionAndTypeManager.getFunctionMetadata(((CallExpression) callExpression).getFunctionHandle())) | ||
| .allMatch(FunctionMetadata::isDeterministic)) { |
There was a problem hiding this comment.
This only checks if the function is deterministic or not, but not checking the arguments of the function. Use RowExpressionDeterminismEvaluator here.
| List<List<Object>> rowValues = rows.stream().map(row -> { | ||
| //Prepare to set up variable resolver | ||
| verify(row.size() == valuesOutputVariables.size(), "Output variable does not match its value in ValuesNode"); | ||
| Map<String, Object> valuesMapForResolver = Streams.zip(valuesOutputVariables.stream(), | ||
| row.stream(), | ||
| (valuesOutputVariable, element) -> { | ||
| RowExpressionInterpreter elementInterpreter = new RowExpressionInterpreter(element, | ||
| functionAndTypeManager, | ||
| context.getSession().toConnectorSession(), | ||
| EVALUATED); | ||
| return new AbstractMap.SimpleImmutableEntry<String, Object>(valuesOutputVariable.getName(), elementInterpreter.evaluate()); | ||
| }).collect(toImmutableMap(Map.Entry::getKey, Map.Entry::getValue)); | ||
| VariableResolver variableResolver = new Interpreters.LambdaVariableResolver(valuesMapForResolver); | ||
| //evaluate each row of the ProjectNode | ||
| return projectRowExpressions.stream().map(rowExpression -> new RowExpressionInterpreter( | ||
| rowExpression, | ||
| functionAndTypeManager, | ||
| context.getSession().toConnectorSession(), | ||
| OPTIMIZED).optimize(variableResolver)) | ||
| .collect(toImmutableList()); | ||
| }).collect(toImmutableList()); |
There was a problem hiding this comment.
This part seems a bit complex and I have a hard time to understand it. Can you check the ExpressionRewriter in CommonSubExpressionRewriter class, and see if you can rewrite the expression in project assignments with values in Values node, and it should be later evaluated by RowExpressionOptimizer.
There was a problem hiding this comment.
I used ExpressionRewriter to rewrite the expression in ProjectNode. However, evaluating ProjectNode requires evaluating ValuesNode first to provide the resolver to the optimize function of RowExpressionOptimizer. The overall implementation does not simplify too much.
There was a problem hiding this comment.
If I'm understanding correctly, I think what we actually want here is the RowExpressionVariableInliner? The ExpressionRewriter replaces a RowExpression with a VariableReferenceExpression. I think we want the other way around when constructing expressions to evaluate. If we can re-write the projection input expression to replace the reference coming from the ValuesNode, then we only need to run the Interpreter once. Is that correct @feilong-liu ?
There was a problem hiding this comment.
@ZacBlanco You are right, we need to use RowExpressionVariableInliner here.
@jackychen718 Can you simplify the logic here with the RowExpressionVariableInliner?
Also is it possible to get rid of the RowExpressionInterpreter in this optimizer? I think the RowExpressionOptimizer which ran later should be able to evaluate it (worth to try e2e to verify)
| List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList()); | ||
| List<RowExpression> projectRowExpressions = projectAssignmentEntries.stream().map(Map.Entry::getValue).collect(toImmutableList()); | ||
| // exclude non-deterministic function | ||
| if (!projectRowExpressions.stream() |
There was a problem hiding this comment.
Should also check deterministic for the expressions in Values node
|
I'm actually wondering if we should just fireup a localqueryrunner and get the results (after checking there are no non-deterministic expressions in the whole subtree) and create values node from it |
|
It is a good idea but I have difficulty to use localqueryrunner to evaluate a PlanNode. Mostly,it is use to execute a SQL string. Could you please give me some guidance? @kaikalur |
|
Looking at: And also executeInternal that just gets a plan executes it, looks like you can try and just refactor that out intp executePlan method. In fact, this could come in quite handy for long running queries say for example if we have reliable stats that a semijoin rhs is cheap and small, we can turn it into an IN clause etc. |
6738069 to
921b893
Compare
|
I can see how using a LocalQueryRunner could be beneficial because if you have an entirely deterministic subplan, you can optimize it entirely in the coordinator without having to juggle with the expressions or other plan nodes. However, I think we should keep the scope of this PR small. Using the LocalQueryRunner will be quite a large change since it also has to be moved out of the test-scope within The Another issue that could occur is the coordinator's expression evaluation result diverging on native and java clusters. I know there have been some discrepancies between results of native and Java function implementations that we've been trying to resolve, but I believe they are minimal, if any still exist. If the function is one of the known ones with diverging behavior, then there could be issues when this rule is applied. However, you would still run into this problem if you were to use the LocalQueryRunner since it is Java-based. The way I see it, the headache of the |
|
We are working on resolving these inconsistencies. Keeping a forked eval will lock in the inconsistencies forever. |
Why? Eval is eval is eval so if/when we switch it should be fixed. |
|
@kaikalur so this point, eval is a leaky abstraction, and impacts the runtime and the optimizer. As Presto moves from Java to C++, we are abstracting the leaky parts into plugins, and these plugins are implemented differently between the Java eval and the C++ eval. This would be considered to be a new leak in our eval abstraction, and we could hide this behind a plugin if we chose. The underlying implementation could use something like a local query runner for the Java eval, and perhaps the sidecar for C++. The important thing is that we simply don't hardcode the LocalQueryRunner, but so long as we don't do that and we are principled around the design of the plugin, then we could do it two different ways for both eval engines. |
Mechanism aside my bigger issue is we should have a single way to eavaluate something - I dont like the word plugin. Its integral to Presto it should be core. |
|
Presto doesn't work without plugins. Some plugins are enabled by default, and some can be overridden. Core plugins are enabled by default--for example, the system connector, the built in function namespace manager. It's no different. Plugins abstracting the eval are enabled by default, but can be overridden for early adopters of C++. Over time, C++ will become the default. |
I don' consider builtin function namespace manager as plugin - it's core. Eval should not be a plugin. Its like the kernel it does basic things and implements SQL semantics which plugins are free to break lol |
|
I don't know why you don't consider the built in function namespace manager a plugin, it implements |
| ExpressionOptimizer expressionOptimizer = new RowExpressionOptimizer(metadata); | ||
|
|
||
| //rewrite ProjectNode assignment expressions | ||
| List<List<RowExpression>> rowExpressionsList = rows.stream().map(row -> { |
There was a problem hiding this comment.
I see we're operating over lists of lists for the following three blocks. Can the outer list be converted to a for loop, and within this loop we operate over a single list? I.e. can we unnest this? It would simplify this code.
The way I see "plugins" - they are there to suppor different behavior from core and something like builtin functions are core. |
In fact, we have other hacks like "metadata query optimization" which eval even more differently. For the longterm project health, there should be one and only one way to eval the core sql constructs. Yes connector plugins do their own thing - that also bothers me. We need to find a way to make all these use the same eval engine consistently. |
@kaikalur we have been discussing exactly this at the native worker working group meetings, perhaps let's brainstorm there and see if our proposal matches the outcomes you are expecting. https://calendar.google.com/calendar/u/0/embed?src=linuxfoundation.org_vrjlva5b0u73ps75fvnv5sasi4@group.calendar.google.com&ctz=America/Los_Angeles |
73dbbc3 to
1b8c927
Compare
| for (List<RowExpression> rowExpressions : rows) { | ||
| verify(rowExpressions.size() == valuesOutputVariables.size(), "Output variable does not match its value in ValuesNode"); | ||
| Map<VariableReferenceExpression, RowExpression> mapping = Streams.zip( | ||
| valuesOutputVariables.stream(), | ||
| rowExpressions.stream(), | ||
| SimpleImmutableEntry::new) | ||
| .collect(toImmutableMap(Map.Entry::getKey, Map.Entry::getValue)); | ||
| List<RowExpression> rowExpressionsInProject = projectRowExpressions.stream() | ||
| .map(expression -> inlineVariables(mapping, expression)) | ||
| .collect(toImmutableList()); | ||
| rowExpressionsList.add(rowExpressionsInProject); | ||
| } | ||
|
|
||
| //Evaluate ProjectNode assignment expressions | ||
| List<List<Object>> rowValuesList = new ArrayList<>(); | ||
| for (List<RowExpression> rowExpressionsInProject : rowExpressionsList) { | ||
| List<Object> rowValues = rowExpressionsInProject.stream() | ||
| .map(rowExpression -> expressionOptimizer.optimize( | ||
| rowExpression, | ||
| OPTIMIZED, | ||
| context.getSession().toConnectorSession(), | ||
| variable -> variable)) | ||
| .collect(toImmutableList()); | ||
| rowValuesList.add(rowValues); | ||
| } | ||
|
|
||
| //Form the ValuesNode transformed from ProjectNode | ||
| List<List<RowExpression>> rowExpressionsListInValuesNode = new ArrayList<>(); | ||
| for (List<Object> rowValues : rowValuesList) { | ||
| List<RowExpression> rowExpressionsInValuesNode = Streams.zip( | ||
| rowValues.stream(), | ||
| projectOutputVariables.stream(), | ||
| (elementValue, projectOutputVariable) -> | ||
| (RowExpression) new ConstantExpression(elementValue, projectOutputVariable.getType()) | ||
| ).collect(toImmutableList()); | ||
| rowExpressionsListInValuesNode.add(rowExpressionsInValuesNode); | ||
| } |
There was a problem hiding this comment.
Can't these three individual for loops be collapsed into a single for loop?
There was a problem hiding this comment.
Sorry for the misunderstanding.
1b8c927 to
734abf5
Compare
| Optional<PlanNode> optionalSource = context.getLookup().resolveGroup(projectNode.getSource()).findFirst(); | ||
| if (!optionalSource.isPresent()) { | ||
| return Result.empty(); | ||
| } | ||
| ValuesNode source = (ValuesNode) optionalSource.get(); |
There was a problem hiding this comment.
Check the example here
It will help to simplify the code here
|
Am I correct in reading the current version of the code just inlines the projections, and delegates the evaluation to something else like |
734abf5 to
c163b25
Compare
|
Changed the rule to be |
|
@jackychen718 feel free to merge this PR. |
|
@elharo @ClarenceThreepwood @jaystarshot @presto-oss Please review the pr to merge. |
Description
Fix #23196
Motivation and Context
When someone projects an expression from values node, we should be able to inline and simplify those into the value node itself. #23196
Impact
Add another optimization rule: InlineProjectionsOnValues
Test Plan
unit test to verify it works properly.
Contributor checklist
Release Notes