Various aggregation/indexing fixes#2292
Conversation
fulghum
left a comment
There was a problem hiding this comment.
Very cool change to get almost all expressions to have unique IDs. Code looks good, just a couple thoughts and a couple questions where I wasn't sure I understood the exact reason for changes.
| } | ||
|
|
||
| func (e *Alias) WithId(id sql.ColumnId) *Alias { | ||
| func (e *Alias) WithId(id sql.ColumnId) sql.IdExpression { |
There was a problem hiding this comment.
(minor) would be nice to update line 89 to a compiler check for sql.IdExpression to make sure this class stays compliant. Looks like some of the updated types declare the compiler check, but not all.
|
|
||
| for i, col := range s { | ||
| ret[i] = NewGetFieldWithTable(i, 0, col.Type, col.DatabaseSource, col.Source, col.Name, col.Nullable) | ||
| ret[i] = NewGetFieldWithTable(i+1, 0, col.Type, col.DatabaseSource, col.Source, col.Name, col.Nullable) |
There was a problem hiding this comment.
(minor) You mentioned the +1 being for subqueries/dual in the PR description. Just wondering if it's worth explaining anything here, or if there's another way to model this. It seems a little bit magic/mysterious to adjust by 1 here. I imagine it would be hard for people to reverse engineer the intent reading this in the future.
I keep wondering if there's a better way for us to model the dual table that could tidy this up... it seems like that "fake" column with an empty name causes some wrinkles in other places. I know it's probably a separate follow up, but if you have any clever ideas here for modeling dual more cleanly I think they'd be worth tracking.
There was a problem hiding this comment.
SchemasToGetFields is a weird short-term gap for update joins that I've failed to get rid of. Can add comment for dual table offset. The dual table generally is definitely a problem. I'm wondering if it can just be flattened with its parent node. Accounting for subquery expressions is probably the tricky part. Subqe, alias references in subqe, dual table columns all kind of collide in unsavory ways. They should be split out into LATERAL joins, eliminating unnecessary dual tables.
sql/planbuilder/aggregates.go
Outdated
| //if c.col != "" { | ||
| proj = append(proj, c.scalarGf()) | ||
| //} |
There was a problem hiding this comment.
(minor) Might as well remove this commented out code if it's not needed.
| //if c.col != "" { | |
| proj = append(proj, c.scalarGf()) | |
| //} | |
| proj = append(proj, c.scalarGf()) |
| " └─ Having\n" + | ||
| " ├─ GreaterThan\n" + | ||
| " │ ├─ sum(lineitem.l_quantity):50!null\n" + | ||
| " │ ├─ sum(lineitem.l_quantity):0!null\n" + |
There was a problem hiding this comment.
I'm not sure I understand why this index changed from 50 to 0. Is this expected?
There was a problem hiding this comment.
The motivating bug was that HAVING expressions resolve the first instance of a definition. Looks like there is an alias on this aggregate, and we now use the id to match the first instance.
There was a problem hiding this comment.
In the general case we try to match the latest definition of a variable, because that one is always in the closest scope. In this cases there is some weirdness where we're considering an alias and its getField child from an aggregate into a HAVING as equivalent. James needs this PR to keep making progress here.
| } | ||
|
|
||
| func (s *scope) resolveColumn(db, table, col string, checkParent bool) (scopeColumn, bool) { | ||
| func (s *scope) resolveColumn(db, table, col string, checkParent, chooseFirst bool) (scopeColumn, bool) { |
There was a problem hiding this comment.
How do callers know when to set chooseFirst to true? The functionality is pretty obvious from the name, but wondering if there's a way to help people know when to apply it.
There was a problem hiding this comment.
It's not totally clear to me either yet, I can add a comment. I think having and grouping scopes were the original motivation. There's more context here dolthub/dolt#6676
sql/planbuilder/project.go
Outdated
| case *expression.Star: | ||
| tableName := strings.ToLower(e.Table) | ||
| if tableName == "" && len(inScope.cols) == 0 { | ||
| if tableName == "" && len(inScope.cols) == 1 && inScope.cols[0].col == "" { |
There was a problem hiding this comment.
This is only to support dual, right? Might be worth a helper function, like isDualColumn() or even testing if cols[0].tableName == 'dual' or something, just to make it super obvious to future people reading this code in the future.
There was a problem hiding this comment.
I can add a "dual" table name to this column, shouldn't hurt anything
sql/analyzer/fix_exec_indexes.go
Outdated
| case *plan.Window: | ||
| for _, e := range n.SelectExprs { | ||
| if ide, ok := e.(sql.IdExpression); ok { | ||
| s.ids = append(s.ids, ide.Id()) | ||
| } | ||
| } | ||
| case *plan.GroupBy: | ||
| for _, e := range n.SelectedExprs { | ||
| if ide, ok := e.(sql.IdExpression); ok { | ||
| s.ids = append(s.ids, ide.Id()) | ||
| } | ||
| } |
There was a problem hiding this comment.
Are there any interfaces we can use to make sure we're grabbing all the right types and not missing any? Seems tricky to make sure to we have every struct listed and to keep it up to date as we add new ones. For example, could we use Projector to collapse Window, GroupBy, and Project into a single case?
There was a problem hiding this comment.
Projector is a good idea. Covering all types would look something like the memo IR absorbing later half of transformation rules, or sql.Node implementing the relational properties* we track for memo nodes. Need everything to implement a common interface.
re: #2271
Use expression ids to fix expression indexes most of the time. This makes it easier to resolve definitions once in planbuilder, rather than having to redo all that work a second time during
assignExecIndexes. This should be more reliable for most queries and make it easier to make alias resolving refactors.Many expressions now implement
sql.IdExpressionand absorb unique expression ids. Trigger variables and stored procedure params still lack ids and fallback to string matching behavior.The biggest lingering issue relates to how dual tables, aliases, and subquery expressions interact. Dual tables have a schema with one column named with an empty string. Subquery expressions expect the +1 column offset from a dual table parent schema (if I'm reading correctly, it might depend on other context). On the other hand, that column can make it difficult to deduplicate projections and throw off execution indexing. There is one query that I haven't been able to get working that has a combination of dual table and alias issues that I think needs a heavier lift to manage correctly.
This query splits the target projection into two levels to account for the dual table column and alias dependency. One thing that does fix this is inlining the alias reference to avoid computing the
rjsubquery a second time in the second alias. But that alias replacement breaks some of theTestOrderByGroupBytests that also have alias issues.