Skip to content

Various aggregation/indexing fixes#2292

Merged
max-hoffman merged 11 commits intomainfrom
max/ambiguous-agg
Jan 30, 2024
Merged

Various aggregation/indexing fixes#2292
max-hoffman merged 11 commits intomainfrom
max/ambiguous-agg

Conversation

@max-hoffman
Copy link
Contributor

@max-hoffman max-hoffman commented Jan 24, 2024

re: #2271

Use expression ids to fix expression indexes most of the time. This makes it easier to resolve definitions once in planbuilder, rather than having to redo all that work a second time during assignExecIndexes. This should be more reliable for most queries and make it easier to make alias resolving refactors.

Many expressions now implement sql.IdExpression and absorb unique expression ids. Trigger variables and stored procedure params still lack ids and fallback to string matching behavior.

The biggest lingering issue relates to how dual tables, aliases, and subquery expressions interact. Dual tables have a schema with one column named with an empty string. Subquery expressions expect the +1 column offset from a dual table parent schema (if I'm reading correctly, it might depend on other context). On the other hand, that column can make it difficult to deduplicate projections and throw off execution indexing. There is one query that I haven't been able to get working that has a combination of dual table and alias issues that I think needs a heavier lift to manage correctly.

                        SELECT
			"testing" AS s,
			(SELECT max(i)
			 FROM (SELECT * FROM mytable) mytable
			 RIGHT JOIN
				((SELECT i2, s2 FROM othertable ORDER BY i2 ASC)
				 UNION ALL
				 SELECT CAST(4 AS SIGNED) AS i2, "not found" AS s2 FROM DUAL) othertable
				ON i2 = i) AS rj
			FROM DUAL`

This query splits the target projection into two levels to account for the dual table column and alias dependency. One thing that does fix this is inlining the alias reference to avoid computing the rj subquery a second time in the second alias. But that alias replacement breaks some of the TestOrderByGroupBy tests that also have alias issues.

Copy link
Contributor

@fulghum fulghum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very cool change to get almost all expressions to have unique IDs. Code looks good, just a couple thoughts and a couple questions where I wasn't sure I understood the exact reason for changes.

}

func (e *Alias) WithId(id sql.ColumnId) *Alias {
func (e *Alias) WithId(id sql.ColumnId) sql.IdExpression {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) would be nice to update line 89 to a compiler check for sql.IdExpression to make sure this class stays compliant. Looks like some of the updated types declare the compiler check, but not all.


for i, col := range s {
ret[i] = NewGetFieldWithTable(i, 0, col.Type, col.DatabaseSource, col.Source, col.Name, col.Nullable)
ret[i] = NewGetFieldWithTable(i+1, 0, col.Type, col.DatabaseSource, col.Source, col.Name, col.Nullable)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) You mentioned the +1 being for subqueries/dual in the PR description. Just wondering if it's worth explaining anything here, or if there's another way to model this. It seems a little bit magic/mysterious to adjust by 1 here. I imagine it would be hard for people to reverse engineer the intent reading this in the future.

I keep wondering if there's a better way for us to model the dual table that could tidy this up... it seems like that "fake" column with an empty name causes some wrinkles in other places. I know it's probably a separate follow up, but if you have any clever ideas here for modeling dual more cleanly I think they'd be worth tracking.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SchemasToGetFields is a weird short-term gap for update joins that I've failed to get rid of. Can add comment for dual table offset. The dual table generally is definitely a problem. I'm wondering if it can just be flattened with its parent node. Accounting for subquery expressions is probably the tricky part. Subqe, alias references in subqe, dual table columns all kind of collide in unsavory ways. They should be split out into LATERAL joins, eliminating unnecessary dual tables.

Comment on lines +787 to +789
//if c.col != "" {
proj = append(proj, c.scalarGf())
//}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(minor) Might as well remove this commented out code if it's not needed.

Suggested change
//if c.col != "" {
proj = append(proj, c.scalarGf())
//}
proj = append(proj, c.scalarGf())

" └─ Having\n" +
" ├─ GreaterThan\n" +
" │ ├─ sum(lineitem.l_quantity):50!null\n" +
" │ ├─ sum(lineitem.l_quantity):0!null\n" +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand why this index changed from 50 to 0. Is this expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The motivating bug was that HAVING expressions resolve the first instance of a definition. Looks like there is an alias on this aggregate, and we now use the id to match the first instance.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the general case we try to match the latest definition of a variable, because that one is always in the closest scope. In this cases there is some weirdness where we're considering an alias and its getField child from an aggregate into a HAVING as equivalent. James needs this PR to keep making progress here.

}

func (s *scope) resolveColumn(db, table, col string, checkParent bool) (scopeColumn, bool) {
func (s *scope) resolveColumn(db, table, col string, checkParent, chooseFirst bool) (scopeColumn, bool) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do callers know when to set chooseFirst to true? The functionality is pretty obvious from the name, but wondering if there's a way to help people know when to apply it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not totally clear to me either yet, I can add a comment. I think having and grouping scopes were the original motivation. There's more context here dolthub/dolt#6676

case *expression.Star:
tableName := strings.ToLower(e.Table)
if tableName == "" && len(inScope.cols) == 0 {
if tableName == "" && len(inScope.cols) == 1 && inScope.cols[0].col == "" {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is only to support dual, right? Might be worth a helper function, like isDualColumn() or even testing if cols[0].tableName == 'dual' or something, just to make it super obvious to future people reading this code in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add a "dual" table name to this column, shouldn't hurt anything

Comment on lines +448 to +459
case *plan.Window:
for _, e := range n.SelectExprs {
if ide, ok := e.(sql.IdExpression); ok {
s.ids = append(s.ids, ide.Id())
}
}
case *plan.GroupBy:
for _, e := range n.SelectedExprs {
if ide, ok := e.(sql.IdExpression); ok {
s.ids = append(s.ids, ide.Id())
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are there any interfaces we can use to make sure we're grabbing all the right types and not missing any? Seems tricky to make sure to we have every struct listed and to keep it up to date as we add new ones. For example, could we use Projector to collapse Window, GroupBy, and Project into a single case?

Copy link
Contributor Author

@max-hoffman max-hoffman Jan 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Projector is a good idea. Covering all types would look something like the memo IR absorbing later half of transformation rules, or sql.Node implementing the relational properties* we track for memo nodes. Need everything to implement a common interface.

@max-hoffman max-hoffman merged commit 7d89166 into main Jan 30, 2024
@max-hoffman max-hoffman deleted the max/ambiguous-agg branch January 30, 2024 22:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants