Skip to content

Fix column in parquet dereference pushdown#16377

Merged
vkorukanti merged 1 commit intoprestodb:masterfrom
chliang71:columnname-dereference-pushdown
Jul 7, 2021
Merged

Fix column in parquet dereference pushdown#16377
vkorukanti merged 1 commit intoprestodb:masterfrom
chliang71:columnname-dereference-pushdown

Conversation

@chliang71
Copy link
Copy Markdown
Contributor

@chliang71 chliang71 commented Jul 2, 2021

This is to fix this issue.

Test plan - (Please fill in how you tested your changes)

Tested in our internal cluster. Updated unit test

== RELEASE NOTES ==

General Changes
* Fixes a bug in Parquet dereference pushdown. The issue causes column name not picked up properly, resulting in inconsistent query result in certain cases.

@chliang71
Copy link
Copy Markdown
Contributor Author

With this change, the check from this patch is failing, so removed for now, will update with new unit test.

Copy link
Copy Markdown
Contributor

@vkorukanti vkorukanti left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM (minor comments).

Please add a test. You can look at HiveLogicalPlanner to construct a specific plan without relying on getting SQL converted to a desired plan.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add check to make sure regularHiveColumnHandles.get(name) returns non-null?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a Preconditions check

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need this mapping (columnName in HiveColumnHandle -> HiveColumnHandle)? Expressions should be referring to the name in Assignments. Is that not the case in some cases?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry are you referring to the second putAll of the two here? If remove this mapping, I would get
nested column [X.Y]'s base column X is not present in table scan output
error in the immediate following logic. This places uses the subfield name to look up back to HiveColumnHandle, and now subfile has the same name as the original HiveColumnHandle, so I used the same map to do this lookup.

@chliang71 chliang71 force-pushed the columnname-dereference-pushdown branch from df155aa to b6fec0d Compare July 3, 2021 00:37
@chliang71 chliang71 force-pushed the columnname-dereference-pushdown branch from b6fec0d to b90b7c7 Compare July 3, 2021 02:16
@chliang71
Copy link
Copy Markdown
Contributor Author

Updated the unit test from this change, I think this test case actually shows the change. Instead of having column name of x_1.a, now it is x_a. cc. @vkorukanti

@chliang71
Copy link
Copy Markdown
Contributor Author

@vkorukanti @zhenxiao do you mind helping merge this PR? thanks!

@vkorukanti vkorukanti merged commit 109d9a0 into prestodb:master Jul 7, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Parquet Dereference Pushdown issue

2 participants