Skip to content

Validate no dangling columns in TableScanNode.enforcedConstraint#6963

Closed
findepi wants to merge 2 commits intotrinodb:masterfrom
findepi:findepi/debug-0c0235
Closed

Validate no dangling columns in TableScanNode.enforcedConstraint#6963
findepi wants to merge 2 commits intotrinodb:masterfrom
findepi:findepi/debug-0c0235

Conversation

@findepi
Copy link
Copy Markdown
Member

@findepi findepi commented Feb 19, 2021

No description provided.

@cla-bot cla-bot bot added the cla-signed label Feb 19, 2021
@findepi
Copy link
Copy Markdown
Member Author

findepi commented Feb 19, 2021

cc @martint @sopel39 @kokosing @losipiuk @erichwang

i think we should have a contract about the enforcedConstraint and keeping some 'dangling' columns can backfire.
In particular my PR #6874 hit a check here

checkArgument(newHandle != null, "Mapping not found for handle %s", handle);

while it could be handled by defensive coding in that particular place, it feels a wrong thing to do. I am fixing the immediate obstacle in #6959.

Questions

  • should we make it a design contract? (unless it's already the case)
  • can we have a run-time check for the contract (my primary concern is about the cost of doing the check)

@findepi findepi force-pushed the findepi/debug-0c0235 branch from bc03f4b to 904fd5d Compare February 21, 2021 20:34
@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Feb 22, 2021

Why do we keep dangling columns there?

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Feb 22, 2021

@sopel39 that's exactly my question. I don't think there is a reason, but wanted to hear some confirmation.
Also, the other question is -- can we have this type of validation in TS ctor.

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Feb 22, 2021

I don't think we should keep dangling columns, What would be their purpose if they cannot be referenced anywhere from the plan?

@erichwang
Copy link
Copy Markdown
Contributor

erichwang commented Feb 23, 2021

The enforced constraint caching was a side effect of previously not having an easy mechanism to retain to retain prior plan information after a given predicate push down optimization. For example, in this contrived demonstration:
SELECT COUNT(*) FROM TABLE a JOIN TABLE b ON a.col = b.partitioncol WHERE a.col = '2021-01-01'
You could have an optimization path where the inference b.partitioncol = '2021-01-01' get's pushed into TS(B). If the plan does not remember that TS(B) is already handling the partition pruning, than a subsequent run of the optimizer might decide to apply b.partitioncol = '2021-01-01' as a predicate on the TS(B) side redundantly. There are other variations of planning oddities that might occur if this information is lost. The dangling columns is a residual effect of this.

The hope was that the new iterative optimizer would allow secondary information (such as predicate relationships) to be retained alongside plans (but not as part of them). If this is possible now, then we should totally do that there. If not, than we should see if there is somewhere else to cache this type of information (since it has been some time since the original discussion). If the dangling columns are the main issue, it's possible they can be pruned in the current optimizer order (since I don't think we ever add a column back after it has been pruned), but this is not an impossibility in the future which we need to think about.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Feb 23, 2021

You could have an optimization path where the inference b.partitioncol = '2021-01-01' get's pushed into TS(B). If the plan does not remember that TS(B) is already handling the partition pruning, than a subsequent run of the optimizer might decide to apply b.partitioncol = '2021-01-01' as a predicate on the TS(B) side redundantly.

In the case of your (contrived) example query, I would expect optimizer to push the predicate into Join's sources without retaining it, so it becomes

- Cross Join
   - a WHERE a.col = '2021-01-01'
   - b WHERE partitioncol = '2021-01-01'

once it's a Cross Join, it no longer matters whether partitioncol predicate is pushed into TS and removed from the plan, or retained. It will not be re-applied from the other side of the join.

Anyway, if there is case where Filter would be applied and re-applied, this would be concerning, as it could lead to overhead or optimizer loop.
Are you saying that the enforcedConstraint is protecting us against this?
How is the ColumnHandle associated back with symbols, once it's temporarily "disassociated"?

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Feb 23, 2021

@findepi subsequent PPD executions might apply the same predicate over and over again. For example, partitioning predicate can be derived from join sides. See my old PR that tries to address that: prestodb/presto#11537. IIRC the problem there was getting expression equivalence for expressions like x IN (1, 2, ... n)

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Feb 23, 2021

I think we should remove dangling columns from enforced constraint

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Feb 23, 2021

I think we should remove dangling columns from enforcedConstraint. Do we actually need enforcedConstraint still?

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Feb 23, 2021

Do we actually need enforcedConstraint still?

I believe so, but someone would need to go over the code and check the usags.

I think we should remove dangling columns from enforcedConstraint.

Thanks @sopel39 .
@erichwang are you still concerned about this? Or your concern was more about enforcedConstraint in general, not the dangling/unreferenced columns in it?

Anyway, this leads us to the second question:

  • can we have a run-time check for the contract (my primary concern is about the cost of doing the check)

@erichwang
Copy link
Copy Markdown
Contributor

Do we actually need enforcedConstraint still?

Yes, we still need this until we can come up with a better way to represent plan metadata that is not directly part of the plan.

I think we should remove dangling columns from enforcedConstraint.

This is more of a philosophical question than a practical one. Practically speaking, I don't think the dangling columns are needed because pruned columns are not currently brought back in any way in our planner. Philosophically, doing so would result in some lost information that in theory could be used again if old pruned columns were added back.

However, in later planner developments, it looks like we aren't preserving the full context anyways, so I have no problems just trimming out the dangling columns at this point. If we do so, please remember to clean up the assignments field too (which was intentionally not pruned before to handle the coverage of the unenforcedConstraints).

@sopel39
Copy link
Copy Markdown
Member

sopel39 commented Feb 24, 2021

Yes, we still need this until we can come up with a better way to represent plan metadata that is not directly part of the plan.

Technically we could use getTableProperties as in io.trino.sql.planner.EffectivePredicateExtractor.Visitor#visitTableScan.
Non-enforced constraint are still preserved as filters on top of TS. For example, we drop enforcedConstraint (don't translate it) as part of io.trino.sql.planner.iterative.rule.ApplyTableScanRedirection.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Feb 24, 2021

@erichwang thanks for your feedback!

f we do so, please remember to clean up the assignments field too (which was intentionally not pruned before to handle the coverage of the unenforcedConstraints).

Interesting, i didn't know that.
I wish it was called out in the code comment.

Philosophically, doing so would result in some lost information that in theory could be used again if old pruned columns were added back.

@martint can you please comment ACK here that I am not removing something dear to you?

@erichwang
Copy link
Copy Markdown
Contributor

if we do so, please remember to clean up the assignments field too (which was intentionally not pruned before to handle the coverage of the unenforcedConstraints).

Interesting, i didn't know that.
I wish it was called out in the code comment.

Actually, let me walk that back about the assignments. It looks like the usage has expanded since I last saw this. The assignments are used in EffectivePredicateExtractor now to convert arbitrary table ColumnHandles returned by metadata (which don't know about what projections are needed): https://github.com/trinodb/trino/blob/master/core/trino-main/src/main/java/io/trino/sql/planner/EffectivePredicateExtractor.java#L255
So it does still need to contain everything. Sorry should have caught that.

@martint
Copy link
Copy Markdown
Member

martint commented Feb 25, 2021

@martint can you please comment ACK here that I am not removing something dear to you?

No, dangling columns should be removed. Once they are not present in the table scan they are, for all intents and purposes, "gone" from the plan.

@findepi
Copy link
Copy Markdown
Member Author

findepi commented Feb 25, 2021

thanks @erichwang @sopel39 @martint for your comments.

I will work on getting the tests succeed with the check and ask for proper review then.

@findepi findepi force-pushed the findepi/debug-0c0235 branch from 904fd5d to 8705994 Compare March 6, 2021 21:55
@findepi findepi force-pushed the findepi/debug-0c0235 branch from 8705994 to 07d94ea Compare March 9, 2021 12:07
@findepi findepi marked this pull request as ready for review March 9, 2021 12:08
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we move this to a checkNoDanglingColumns function?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i prefer not to call functions within constructor (in general at least), so if there isn't a compelling reason to do so, would prefer to keep it here

Copy link
Copy Markdown
Contributor

@erichwang erichwang Mar 10, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean to make it a static function. We call plenty of those here (like checkArg, checkState, etc). The reason is that (A) conveniently gives a name to the operation for documentation purposes (B) limits the scope of what this block of code is doing so that it doesn't have creep in the future.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure, i can do this, if you feel strongly about it.
i'd prefer to keep as is, though

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea let's do it. I usually prefer that than having to leave more comments, which this would need without that function context)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's also add a comment here that assignments can contain dangling columns (for now)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean? Also, is it related to checkArgument(assignments.keySet().containsAll(outputs), "assignments does not cover all of outputs"); check above?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That check verifies that assignments contains outputs, but assignments can still have more symbols in it than what output produces. So in that sense assignments can have symbol to column handle mappings that are no longer referenced. But as I mentioned in a comment above, this is needed for some obscure reasons in the EffectivePredicateExtractor.

@findepi findepi force-pushed the findepi/debug-0c0235 branch from 07d94ea to 68475a0 Compare March 15, 2021 21:33
findepi added 2 commits March 15, 2021 22:35
The test was constructing `TableScan` with a `Domain` on a non-projected
column, but such columns should be removed from the plan together with
associated constraints. The test input is fixed to adhere to this.
@findepi findepi force-pushed the findepi/debug-0c0235 branch from 68475a0 to 6bc1821 Compare March 15, 2021 21:35
@findepi findepi closed this May 19, 2021
@findepi findepi deleted the findepi/debug-0c0235 branch May 19, 2021 14:31
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Development

Successfully merging this pull request may close these issues.

5 participants