Skip to content

Add expression evaluation RFC#13

Merged
tdcmeehan merged 1 commit intoprestodb:mainfrom
tdcmeehan:exp
Aug 26, 2024
Merged

Add expression evaluation RFC#13
tdcmeehan merged 1 commit intoprestodb:mainfrom
tdcmeehan:exp

Conversation

@tdcmeehan
Copy link
Contributor

No description provided.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan : Had a quick look. Looks good overall. Had just one comment.

Copy link

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan Tim, thank you for working on this. Some comments.

>
>Body: JSON list of serialized row expressions
>
> Response: JSON list of optimized serialized row expressions

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expected the response to be a list of constant values that represent the results of evaluating the expressions. Is this not the case? Why are you saying "optimized ... expressions" and not "values"?

Copy link
Contributor Author

@tdcmeehan tdcmeehan Jun 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For most cases, they will be values, however I think optimized expression better conveys that there are cases when they won't be simple constants (as I somehow understand from the word value). Some examples I have in mind:

  • CAST(int_column AS INT) => int_column
    • We would expect the expression optimizer to remove the unneeded cast
  • IF(condition, 123, 0/0) => IF(condition, 123, CAST(FAIL(...) AS INTEGER)) -- also, similar for CASE statement
    • We would expect the expression optimizer to only fail the query if the false condition is reached
  • COALESCE(val1, val2, val1, val3) => COALESCE(val1, val2, val3)
    • We would expect the expression optimizer to remove the unneeded and redundant statements in the coalesce

I planned to add the mechanism of how we determine what may be constant folded, and how we batch them and send them to the sidecar, as a part of this document (I am returning from PTO and wanted to get this out in draft state before I left).

Additionally, we have not yet gotten to testing this with Velox end to end. There is already a difference between the two that I know of, which is the Java expression optimizer will attempt to optimize CAST(JSON_PARSE(...) AS ROW/ARRAY/MAP) to a special CAST function that avoids materializing the JSON and just directly creates the ROW/ARRAY/MAP. I expect that there will be other differences, and we can configure the new SPI to announce optimizations that are not supported, as I am sure there will be other differences discovered during testing (and if there are any at the top of your mind, please do share them!).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan Tim, thank you for clarifying. I feel this is going beyond constant folding and it not supported in Velox / Prestissimo. I wonder if it makes sense to reduce the scope of this proposal to just constant folding, e.g. taking an expression with no column references and returning a value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While we wanted to push down as much as possible to the underlying evaluation engine, since we don't make any distinction between general expression optimization and the more specific action of reducing constant expressions to literal values, I understand that this might be considered to be out of scope for Velox, and we can make this distinction.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mbasmanova Consider the following expressions with references to other columns:

  • ABS(COALESCE(123, column1)) => 123
  • IF(true, 123, column2) => 123

Do we consider the simplification of these expressions within scope for constant folding in Velox?

Copy link
Contributor Author

@tdcmeehan tdcmeehan Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if it's possible to avoid having to make multiple passes for each expression in the case of nested expressions which are able to constant fold into values.

If we take the above examples and combine them in a way:

CppFlipSign(IF(MyCppFunc(COALESCE(123, column1)) < 456, 456, column2))

Then whether or not Velox supports constant folding of special form expressions where there is a reference to a different column seems to make a big difference. If Velox is able to constant fold those examples, then we can send the whole expression in one call. If it can't, then it would require multiple passes between Presto and Velox to constant fold the expression.

Presto: COALESCE(123, column1) => 123 (contains a column reference, yet because of the properties of this special form expression, we can infer that we can just collapse it to 123)
Velox: MyCppFunc(123) => 123 (suppose for simplicity this just returns the input)
Presto: IF(123 < 456, 456, column2) => 456 (contains a column reference, yet because of the properties of this special form expression, we can infer that we can just collapse it to 456)
Velox: CppFlipSign(456) => -456

In this case, at least two calls are unavoidable, because Velox cannot determine that the special form expressions are constant foldable, and the Presto coordinator can't execute the function by design. I briefly looked at the code in Velox, can confirmed that it appears we only check if something is constant foldable if all of the inputs are constant expressions. However, I'm wondering if we should add more sophistication to this process? I am guessing that the original motivation behind constant folding expressions with only constant inputs would also apply to inferring the unique properties of special form expressions and constant folding such expressions where possible?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tdcmeehan Tim, these are all good questions / points. My initial thinking is to restrict Velox to "true" constant folding, e.g. evaluating expressions that do not reference any columns and producing a "value" as a result. I see how this may result in extra trips between coordinator and side-car, but I feel the alternative adds a lot of complexity. If you feel it is important to push this logic into the side-car, than, perhaps, a next step would be to describe all the optimizations we would expect to implement and then consider whether it makes sense to do that in Velox (or Prestissimo layer) or not.

What do you think?

Copy link
Contributor Author

@tdcmeehan tdcmeehan Jul 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this design works best when there's only a single hop, so it seems like the question is do we do this in Prestissimo or Velox. I'll mention the optimizations needed to achieve general constant folding support which is at parity with Java (which includes expressions that contain column references), and leave it as a detail to be determined where the code lives, but that ultimately some optimizations should move to C++.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We may need to build some optimizer logic in C++ in Prestissimo to make this work. Note that conversion of expressions to Velox is one-way. It is not possible to go back cleanly.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think multiple passes in this use case is really an issue performance-wise. Making the constant-folding in Velox more powerful might be useful in other use cases on execution side though.

@tdcmeehan tdcmeehan force-pushed the exp branch 2 times, most recently from 7e0c9ba to 01e6178 Compare July 3, 2024 20:31
@tdcmeehan tdcmeehan marked this pull request as ready for review July 3, 2024 20:31
@tdcmeehan
Copy link
Contributor Author

@Yuhta @mbasmanova @pedroerp @aditi-pandit thank you very much for your thoughtful feedback. I have marked this RFC as ready for review and I have addressed your many points. Please let me know if you have more suggestions or feedback, thank you!


A key feature of unfenced functions is that they can be used to implement constant folding. This is because the function can be evaluated at query planning time, and the result can be used as a constant in the query plan. This can be a significant performance improvement, as it can reduce the amount of work that needs to be done at query execution time. For example, if an unfenced function is evaluated over a partition column, the result can be used to prune partitions before the query is executed.

Currently in Presto C++, there is no constant folding support during query optimization for functions which are written in C++. In practice, constant folding works for functions which have a corresponding Java implementation, but not for functions which are implemented in C++. It works by essentially just using a forked set of functions written in Java to evaluate the constant expressions, and then during execution relying on the C++ functions to evaluate the non-constant expressions. This has a few drawbacks:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit : It might be more accurate to say that constant folding is broken instead of saying no constant folding support as it does exist in a way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have reworded it, please let me know what you think.

#### Note on integration with Velox
Currently Velox only supports constant folding for expressions whose entire arguments are all constant expressions (i.e., constant values). Constant folding, however, entails many specific scenarios which may contain non-constant values (outlined below). It will be determined at a later time if these optimizations would go into Velox or into Prestissimo, however it is presumed they must be added in C++ in order to prevent multiple hops between the coordinator and the sidecar.

#### Optimizations required for feature parity with Java constant folding support
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks ! This is great to know.

Copy link
Contributor

@aditi-pandit aditi-pandit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the revisions !

@tdcmeehan tdcmeehan merged commit 8357a91 into prestodb:main Aug 26, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants