Skip to content

Evaluate project node on values node#23245

Merged
ZacBlanco merged 1 commit intoprestodb:masterfrom
jackychen718:EvaluateProjectOnValues
Aug 29, 2024
Merged

Evaluate project node on values node#23245
ZacBlanco merged 1 commit intoprestodb:masterfrom
jackychen718:EvaluateProjectOnValues

Conversation

@jackychen718
Copy link
Contributor

@jackychen718 jackychen718 commented Jul 18, 2024

Description

Fix #23196

Motivation and Context

When someone projects an expression from values node, we should be able to inline and simplify those into the value node itself. #23196

Impact

Add another optimization rule: InlineProjectionsOnValues

Test Plan

unit test to verify it works properly.

Contributor checklist

  • Please make sure your submission complies with our development, formatting, commit message, and attribution guidelines.
  • PR description addresses the issue accurately and concisely. If the change is non-trivial, a GitHub Issue is referenced.
  • Documented new properties (with its default value), SQL syntax, functions, or other functionality.
  • If release notes are required, they follow the release notes guidelines.
  • Adequate tests were added if applicable.
  • CI passed.

Release Notes

== RELEASE NOTES ==

General Changes
* Add session property ``inline_projections_on_values`` and configuration property ``optimizer.inline-projections-on-values`` to evaluate project node on values node :pr:`23245`.

@kaikalur
Copy link
Contributor

Please add some actual query tests with a bit more complex expressions like map/array constructors etc. which is quite common. Also add multiple columns tests.

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch from 9471378 to 9776440 Compare July 18, 2024 06:09
@jackychen718
Copy link
Contributor Author

Fixed @kaikalur @ZacBlanco

Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job! Minor nits on my first pass

* <p/>
* Plan before optimizer:
* <pre>
* ProjectNode (outputVariables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about multiple projects - it could happen. Will this rule be applied again?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. It is an iterative optimizer. It could optimize iteratively.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case for two projections for completeness?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

RowExpressionInterpreter elementInterpreter = new RowExpressionInterpreter(element,
functionAndTypeManager,
context.getSession().toConnectorSession(),
EVALUATED);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this handle non-deterministic functions? I think we should not apply this rule for non-deterministic like

select random(x) from (values 1,2) AS T(x)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reasoning not evaluate a non-deterministic function? Is there a big difference evaluating it on a worker later on vs the coordinator?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general there could be surprises so we generally short-circuit lot of optimizations when we see non-deterministic expressions. Also I'm not a fan of RowExpressionInterpreter :( it's implementing the "engine" for constants and I'm sure it works differently from the java and differently differently from native lol

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-deterministic function excluded.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add a test like: select random(x) from (values 1,2) AS T(x)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch from 9776440 to 9147462 Compare July 18, 2024 16:59
kaikalur
kaikalur previously approved these changes Jul 18, 2024
* <p/>
* Plan before optimizer:
* <pre>
* ProjectNode (outputVariables)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test case for two projections for completeness?

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch 3 times, most recently from f87c93c to 38803a5 Compare July 19, 2024 14:20
Copy link
Contributor

@ZacBlanco ZacBlanco left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

last set of comments

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch from 38803a5 to 6738069 Compare July 19, 2024 20:31
@jackychen718 jackychen718 marked this pull request as ready for review July 19, 2024 20:32
@jackychen718 jackychen718 requested a review from presto-oss July 19, 2024 20:32
List<List<RowExpression>> rows = source.getRows();
List<VariableReferenceExpression> valuesOutputVariables = source.getOutputVariables();
Set<Map.Entry<VariableReferenceExpression, RowExpression>> projectAssignmentEntries = projectNode.getAssignments().entrySet();
List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList());
List<VariableReferenceExpression> projectOutputVariables = projectNode.getOutputVariables();

List<VariableReferenceExpression> valuesOutputVariables = source.getOutputVariables();
Set<Map.Entry<VariableReferenceExpression, RowExpression>> projectAssignmentEntries = projectNode.getAssignments().entrySet();
List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList());
List<RowExpression> projectRowExpressions = projectAssignmentEntries.stream().map(Map.Entry::getValue).collect(toImmutableList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
List<RowExpression> projectRowExpressions = projectAssignmentEntries.stream().map(Map.Entry::getValue).collect(toImmutableList());
List<RowExpression> projectRowExpressions = projectNode.getAssignments().getExpressions().stream.collect(toImmutableList());

Copy link
Contributor Author

@jackychen718 jackychen718 Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Althought I haven`t found an example,is it possible that projectOutputVariables and projectRowExpressions does not match because projectNode.getAssignments().getExpressions() returns a Collection of RowExpression rather than a List of RowExpression? @feilong-liu

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it should be fine. The map in assignment is backed by unmodifiable LinkedHashMap which is order preserving.

if (!projectRowExpressions.stream()
.filter(expression -> expression instanceof CallExpression)
.map(callExpression -> functionAndTypeManager.getFunctionMetadata(((CallExpression) callExpression).getFunctionHandle()))
.allMatch(FunctionMetadata::isDeterministic)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This only checks if the function is deterministic or not, but not checking the arguments of the function. Use RowExpressionDeterminismEvaluator here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

Comment on lines 112 to 142
List<List<Object>> rowValues = rows.stream().map(row -> {
//Prepare to set up variable resolver
verify(row.size() == valuesOutputVariables.size(), "Output variable does not match its value in ValuesNode");
Map<String, Object> valuesMapForResolver = Streams.zip(valuesOutputVariables.stream(),
row.stream(),
(valuesOutputVariable, element) -> {
RowExpressionInterpreter elementInterpreter = new RowExpressionInterpreter(element,
functionAndTypeManager,
context.getSession().toConnectorSession(),
EVALUATED);
return new AbstractMap.SimpleImmutableEntry<String, Object>(valuesOutputVariable.getName(), elementInterpreter.evaluate());
}).collect(toImmutableMap(Map.Entry::getKey, Map.Entry::getValue));
VariableResolver variableResolver = new Interpreters.LambdaVariableResolver(valuesMapForResolver);
//evaluate each row of the ProjectNode
return projectRowExpressions.stream().map(rowExpression -> new RowExpressionInterpreter(
rowExpression,
functionAndTypeManager,
context.getSession().toConnectorSession(),
OPTIMIZED).optimize(variableResolver))
.collect(toImmutableList());
}).collect(toImmutableList());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This part seems a bit complex and I have a hard time to understand it. Can you check the ExpressionRewriter in CommonSubExpressionRewriter class, and see if you can rewrite the expression in project assignments with values in Values node, and it should be later evaluated by RowExpressionOptimizer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used ExpressionRewriter to rewrite the expression in ProjectNode. However, evaluating ProjectNode requires evaluating ValuesNode first to provide the resolver to the optimize function of RowExpressionOptimizer. The overall implementation does not simplify too much.

Copy link
Contributor

@ZacBlanco ZacBlanco Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I'm understanding correctly, I think what we actually want here is the RowExpressionVariableInliner? The ExpressionRewriter replaces a RowExpression with a VariableReferenceExpression. I think we want the other way around when constructing expressions to evaluate. If we can re-write the projection input expression to replace the reference coming from the ValuesNode, then we only need to run the Interpreter once. Is that correct @feilong-liu ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ZacBlanco You are right, we need to use RowExpressionVariableInliner here.

@jackychen718 Can you simplify the logic here with the RowExpressionVariableInliner?

Also is it possible to get rid of the RowExpressionInterpreter in this optimizer? I think the RowExpressionOptimizer which ran later should be able to evaluate it (worth to try e2e to verify)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

List<VariableReferenceExpression> projectOutputVariables = projectAssignmentEntries.stream().map(Map.Entry::getKey).collect(toImmutableList());
List<RowExpression> projectRowExpressions = projectAssignmentEntries.stream().map(Map.Entry::getValue).collect(toImmutableList());
// exclude non-deterministic function
if (!projectRowExpressions.stream()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should also check deterministic for the expressions in Values node

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@kaikalur
Copy link
Contributor

I'm actually wondering if we should just fireup a localqueryrunner and get the results (after checking there are no non-deterministic expressions in the whole subtree) and create values node from it

@jackychen718
Copy link
Contributor Author

jackychen718 commented Jul 23, 2024

It is a good idea but I have difficulty to use localqueryrunner to evaluate a PlanNode. Mostly,it is use to execute a SQL string. Could you please give me some guidance? @kaikalur

@kaikalur
Copy link
Contributor

kaikalur commented Jul 23, 2024

Looking at:

return session.getRuntimeStats().profileNanos(

And also executeInternal that just gets a plan executes it, looks like you can try and just refactor that out intp executePlan method. In fact, this could come in quite handy for long running queries say for example if we have reliable stats that a semijoin rhs is cheap and small, we can turn it into an IN clause etc.

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch from 6738069 to 921b893 Compare July 23, 2024 18:06
@jackychen718 jackychen718 requested a review from ZacBlanco July 23, 2024 18:20
@ZacBlanco
Copy link
Contributor

I can see how using a LocalQueryRunner could be beneficial because if you have an entirely deterministic subplan, you can optimize it entirely in the coordinator without having to juggle with the expressions or other plan nodes. However, I think we should keep the scope of this PR small. Using the LocalQueryRunner will be quite a large change since it also has to be moved out of the test-scope within presto-main. It will probably bring some other headaches, not to mention the actual latency of using the query runner within the rule and figuring out whether it's acceptable or not for the optimizer.

The RowExpressionInterpreter, at least on the Java side, uses the exact same generated bytecode that the the workers use to evaluate the functions, so I don't think there should be any worry about differences between execution results if you're sticking to Java.

Another issue that could occur is the coordinator's expression evaluation result diverging on native and java clusters. I know there have been some discrepancies between results of native and Java function implementations that we've been trying to resolve, but I believe they are minimal, if any still exist. If the function is one of the known ones with diverging behavior, then there could be issues when this rule is applied. However, you would still run into this problem if you were to use the LocalQueryRunner since it is Java-based.

The way I see it, the headache of the RowExpressionInterpreter is the complexity of using it within the rule to evaluate all of the VALUES and project expressions and having to juggle those around. If we confine the scope for this PR to match only Project->Values I think the complexity is acceptable. We could open another issue for evaluating entirely deterministic sub-plans.

@tdcmeehan
Copy link
Contributor

We are working on resolving these inconsistencies. Keeping a forked eval will lock in the inconsistencies forever.

@kaikalur
Copy link
Contributor

We are working on resolving these inconsistencies. Keeping a forked eval will lock in the inconsistencies forever.

Why? Eval is eval is eval so if/when we switch it should be fixed.

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Jul 24, 2024

@kaikalur so this point, eval is a leaky abstraction, and impacts the runtime and the optimizer. As Presto moves from Java to C++, we are abstracting the leaky parts into plugins, and these plugins are implemented differently between the Java eval and the C++ eval. This would be considered to be a new leak in our eval abstraction, and we could hide this behind a plugin if we chose. The underlying implementation could use something like a local query runner for the Java eval, and perhaps the sidecar for C++. The important thing is that we simply don't hardcode the LocalQueryRunner, but so long as we don't do that and we are principled around the design of the plugin, then we could do it two different ways for both eval engines.

@kaikalur
Copy link
Contributor

@kaikalur so this point, eval is a leaky abstraction, and impacts the runtime and the optimizer. As Presto moves from Java to C++, we are abstracting the leaky parts into plugins, and these plugins are implemented differently between the Java eval and the C++ eval. This would be considered to be a new leak in our eval abstraction, and we could hide this behind a plugin if we chose. The underlying implementation could use something like a local query runner for the Java eval, and perhaps the sidecar for C++. The important thing is that we simply don't hardcode the LocalQueryRunner, but so long as we don't do that and we are principled around the design of the plugin, then we could do it two different ways for both eval engines.

Mechanism aside my bigger issue is we should have a single way to eavaluate something - I dont like the word plugin. Its integral to Presto it should be core.

@tdcmeehan
Copy link
Contributor

tdcmeehan commented Jul 24, 2024

Presto doesn't work without plugins. Some plugins are enabled by default, and some can be overridden. Core plugins are enabled by default--for example, the system connector, the built in function namespace manager. It's no different. Plugins abstracting the eval are enabled by default, but can be overridden for early adopters of C++. Over time, C++ will become the default.

@kaikalur
Copy link
Contributor

kaikalur commented Jul 24, 2024

Presto doesn't work without plugins. Some plugins are enabled by default, and some can be overridden. Core plugins are enabled by default--for example, the system connector, the built in function namespace manager. It's no different. Plugins abstracting the eval are enabled by default, but can be overridden for early adopters of C++. Over time, C++ will become the default.

I don' consider builtin function namespace manager as plugin - it's core. Eval should not be a plugin. Its like the kernel it does basic things and implements SQL semantics which plugins are free to break lol

@tdcmeehan
Copy link
Contributor

I don't know why you don't consider the built in function namespace manager a plugin, it implements FunctionNamespaceManager (which is a plugin) and is instantiated alongside other function namespace managers. It is both core and a plugin. Same for the system connector, for our block encodings, types, and resource groups. Whatever is core is defined by us and what is loaded by default, so I wouldn't place so much emphasis on the word plugin--think rather in terms of outcomes.

ExpressionOptimizer expressionOptimizer = new RowExpressionOptimizer(metadata);

//rewrite ProjectNode assignment expressions
List<List<RowExpression>> rowExpressionsList = rows.stream().map(row -> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see we're operating over lists of lists for the following three blocks. Can the outer list be converted to a for loop, and within this loop we operate over a single list? I.e. can we unnest this? It would simplify this code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

@kaikalur
Copy link
Contributor

I don't know why you don't consider the built in function namespace manager a plugin, it implements FunctionNamespaceManager (which is a plugin) and is instantiated alongside other function namespace managers. It is both core and a plugin. Same for the system connector, for our block encodings, types, and resource groups. Whatever is core is defined by us and what is loaded by default, so I wouldn't place so much emphasis on the word plugin--think rather in terms of outcomes.

The way I see "plugins" - they are there to suppor different behavior from core and something like builtin functions are core.

@kaikalur
Copy link
Contributor

I don't know why you don't consider the built in function namespace manager a plugin, it implements FunctionNamespaceManager (which is a plugin) and is instantiated alongside other function namespace managers. It is both core and a plugin. Same for the system connector, for our block encodings, types, and resource groups. Whatever is core is defined by us and what is loaded by default, so I wouldn't place so much emphasis on the word plugin--think rather in terms of outcomes.

The way I see "plugins" - they are there to suppor different behavior from core and something like builtin functions are core.

In fact, we have other hacks like "metadata query optimization" which eval even more differently. For the longterm project health, there should be one and only one way to eval the core sql constructs. Yes connector plugins do their own thing - that also bothers me. We need to find a way to make all these use the same eval engine consistently.

@tdcmeehan
Copy link
Contributor

I don't know why you don't consider the built in function namespace manager a plugin, it implements FunctionNamespaceManager (which is a plugin) and is instantiated alongside other function namespace managers. It is both core and a plugin. Same for the system connector, for our block encodings, types, and resource groups. Whatever is core is defined by us and what is loaded by default, so I wouldn't place so much emphasis on the word plugin--think rather in terms of outcomes.

The way I see "plugins" - they are there to suppor different behavior from core and something like builtin functions are core.

In fact, we have other hacks like "metadata query optimization" which eval even more differently. For the longterm project health, there should be one and only one way to eval the core sql constructs. Yes connector plugins do their own thing - that also bothers me. We need to find a way to make all these use the same eval engine consistently.

@kaikalur we have been discussing exactly this at the native worker working group meetings, perhaps let's brainstorm there and see if our proposal matches the outcomes you are expecting. https://calendar.google.com/calendar/u/0/embed?src=linuxfoundation.org_vrjlva5b0u73ps75fvnv5sasi4@group.calendar.google.com&ctz=America/Los_Angeles

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch from 73dbbc3 to 1b8c927 Compare July 24, 2024 16:45
@jackychen718 jackychen718 requested a review from tdcmeehan July 24, 2024 18:06
Comment on lines 118 to 124
for (List<RowExpression> rowExpressions : rows) {
verify(rowExpressions.size() == valuesOutputVariables.size(), "Output variable does not match its value in ValuesNode");
Map<VariableReferenceExpression, RowExpression> mapping = Streams.zip(
valuesOutputVariables.stream(),
rowExpressions.stream(),
SimpleImmutableEntry::new)
.collect(toImmutableMap(Map.Entry::getKey, Map.Entry::getValue));
List<RowExpression> rowExpressionsInProject = projectRowExpressions.stream()
.map(expression -> inlineVariables(mapping, expression))
.collect(toImmutableList());
rowExpressionsList.add(rowExpressionsInProject);
}

//Evaluate ProjectNode assignment expressions
List<List<Object>> rowValuesList = new ArrayList<>();
for (List<RowExpression> rowExpressionsInProject : rowExpressionsList) {
List<Object> rowValues = rowExpressionsInProject.stream()
.map(rowExpression -> expressionOptimizer.optimize(
rowExpression,
OPTIMIZED,
context.getSession().toConnectorSession(),
variable -> variable))
.collect(toImmutableList());
rowValuesList.add(rowValues);
}

//Form the ValuesNode transformed from ProjectNode
List<List<RowExpression>> rowExpressionsListInValuesNode = new ArrayList<>();
for (List<Object> rowValues : rowValuesList) {
List<RowExpression> rowExpressionsInValuesNode = Streams.zip(
rowValues.stream(),
projectOutputVariables.stream(),
(elementValue, projectOutputVariable) ->
(RowExpression) new ConstantExpression(elementValue, projectOutputVariable.getType())
).collect(toImmutableList());
rowExpressionsListInValuesNode.add(rowExpressionsInValuesNode);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't these three individual for loops be collapsed into a single for loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry for the misunderstanding.

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch from 1b8c927 to 734abf5 Compare July 24, 2024 21:17
@jackychen718 jackychen718 requested a review from tdcmeehan July 25, 2024 00:46
Comment on lines 88 to 92
Optional<PlanNode> optionalSource = context.getLookup().resolveGroup(projectNode.getSource()).findFirst();
if (!optionalSource.isPresent()) {
return Result.empty();
}
ValuesNode source = (ValuesNode) optionalSource.get();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check the example here

private static final Capture<ProjectNode> CHILD = newCapture();

It will help to simplify the code here

@tdcmeehan
Copy link
Contributor

Am I correct in reading the current version of the code just inlines the projections, and delegates the evaluation to something else like SimplifyRowExpressions? If so, should we rename the optimizer rule to InlineProjectionsOnValues or something similar?

@jackychen718 jackychen718 force-pushed the EvaluateProjectOnValues branch from 734abf5 to c163b25 Compare July 30, 2024 22:58
@jackychen718
Copy link
Contributor Author

Changed the rule to be InlineProjectionsOnValues

Copy link
Contributor

@tdcmeehan tdcmeehan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@tdcmeehan
Copy link
Contributor

@jackychen718 feel free to merge this PR.

@jackychen718
Copy link
Contributor Author

@elharo @ClarenceThreepwood @jaystarshot @presto-oss Please review the pr to merge.

@ZacBlanco ZacBlanco merged commit 539a448 into prestodb:master Aug 29, 2024
@jaystarshot jaystarshot mentioned this pull request Nov 1, 2024
25 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Pushdown projects into value node

6 participants