Skip to content

Push down dereference expression#14637

Closed
zhenxiao wants to merge 1 commit intoprestodb:masterfrom
zhenxiao:dereference
Closed

Push down dereference expression#14637
zhenxiao wants to merge 1 commit intoprestodb:masterfrom
zhenxiao:dereference

Conversation

@zhenxiao
Copy link
Collaborator

Co-authored-by: qqibrow qqibrow@gmail.com

To fix #14517, we need to pushdown dereference expressions as a first step.

continue #5547 #13180

== RELEASE NOTES ==

General Changes
* push down dereference expression

@mbasmanova mbasmanova self-assigned this Jun 11, 2020
@mbasmanova mbasmanova requested a review from a team June 11, 2020 23:49
Copy link
Contributor

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhenxiao. I left some comments (some of them are questions I got while trying to understand the patch).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just keep one dereference in this case msg.foo?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I didn't understand this part. Why do we need an extra project on top? The project below the TargetNode should be enough right?

Is it for the following case?
FilterNode(predicate(a.x = 5), output=a)
-->
Project(row(a_x) as a)
FilterNode(predicate(a_x=5)
ProjectNode(deref(a.x) as a_x, output=a_x)

If this is the can we push the dereference through Filter and avoid the extra project on top?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a projectNode on top is to keep all upper symbols unchanged

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For this case:
Filter(predicate(a.x) = 5, output = a.y)
Project(a)
-->
Filter(predicate(a_x) = 5, output = a_y)
Project(a, deref(a.x) as a_x, deref(a.y) as a_y)

Is the new project correct, given that we are adding the existing project symbols and the deref expressions as new projects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't we have a Project(msg.x as a_x, msg.y as a_y) in between the filter and values nodes?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

image

only a_y appears in top ProjectNode, so a_y is pushed down. a_x only appears in FilterNode, it is not pushed down.

Copy link
Contributor

@mbasmanova mbasmanova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@zhenxiao I took a first pass and made some comments. I don't understand this change completely yet. Thanks for adding tests for the new optimizer rule. I think end-to-end correctness tests are needed as well. Also, would you update commit message to document the motivation for this change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this TODO and the logic in the following if statement. Looks like [msg.foo, msg.foo.bar] can is not being processed at all. Is this intentional? Shouldn't this case result in pushing down msg.foo?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad. this is wrong comment
not a TODO, comment should be:
DereferenceExpression with the same base will cause unnecessary rewritten
I will fix

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran this test with coverage and I see some code paths in PushDownDereferences rule not covered. Would you add tests to ensure full coverage of the new rule?

@mbasmanova mbasmanova requested a review from a team June 16, 2020 13:49
Reading all subfields in one column could not leverage lazy block efficiently.
e.g.
SELECT msg.a, msg.b, msg.c
FROM table
WHERE msg.a = 10

Currently, we are reading the entire msg.[a,b,c] as one Block. This doesn’t utilize the lazy block loading during filter evaluation for msg.a = 10.
Ideally, we should read  msg.a, msg.b and msg.c as separate blocks, so that only load msg.a first and then decide whether to load msg.b and msg.c.

This is the first step to pushdown dereference expressions. Following steps are:
add connector specific optimizer to generate separate columns for subfields
optimize file scan for columnar formats, e.g. Parquet

Co-authored-by: qqibrow qqibrow@gmail.com
Copy link
Collaborator Author

@zhenxiao zhenxiao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you @mbasmanova @vkorukanti
comments are addressed

could you please take another look?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bad. this is wrong comment
not a TODO, comment should be:
DereferenceExpression with the same base will cause unnecessary rewritten
I will fix

@zhenxiao zhenxiao requested a review from mbasmanova June 18, 2020 01:09
@zhenxiao
Copy link
Collaborator Author

to be continued in: #14829

@zhenxiao zhenxiao closed this Jul 10, 2020
@zhenxiao zhenxiao deleted the dereference branch August 2, 2020 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Lazy Block loading is not utilized efficiently in case of nested column project pushdown on Parquet tables

3 participants