Various aggregation/indexing fixes by max-hoffman · Pull Request #2292 · dolthub/go-mysql-server

max-hoffman · 2024-01-24T22:50:19Z

Use expression ids to fix expression indexes most of the time. This makes it easier to resolve definitions once in planbuilder, rather than having to redo all that work a second time during assignExecIndexes. This should be more reliable for most queries and make it easier to make alias resolving refactors.

Many expressions now implement sql.IdExpression and absorb unique expression ids. Trigger variables and stored procedure params still lack ids and fallback to string matching behavior.

The biggest lingering issue relates to how dual tables, aliases, and subquery expressions interact. Dual tables have a schema with one column named with an empty string. Subquery expressions expect the +1 column offset from a dual table parent schema (if I'm reading correctly, it might depend on other context). On the other hand, that column can make it difficult to deduplicate projections and throw off execution indexing. There is one query that I haven't been able to get working that has a combination of dual table and alias issues that I think needs a heavier lift to manage correctly.

                        SELECT
			"testing" AS s,
			(SELECT max(i)
			 FROM (SELECT * FROM mytable) mytable
			 RIGHT JOIN
				((SELECT i2, s2 FROM othertable ORDER BY i2 ASC)
				 UNION ALL
				 SELECT CAST(4 AS SIGNED) AS i2, "not found" AS s2 FROM DUAL) othertable
				ON i2 = i) AS rj
			FROM DUAL`

This query splits the target projection into two levels to account for the dual table column and alias dependency. One thing that does fix this is inlining the alias reference to avoid computing the rj subquery a second time in the second alias. But that alias replacement breaks some of the TestOrderByGroupBy tests that also have alias issues.

fulghum

Very cool change to get almost all expressions to have unique IDs. Code looks good, just a couple thoughts and a couple questions where I wasn't sure I understood the exact reason for changes.

fulghum · 2024-01-25T22:39:17Z

sql/expression/alias.go

 }

-func (e *Alias) WithId(id sql.ColumnId) *Alias {
+func (e *Alias) WithId(id sql.ColumnId) sql.IdExpression {


(minor) would be nice to update line 89 to a compiler check for sql.IdExpression to make sure this class stays compliant. Looks like some of the updated types declare the compiler check, but not all.

fulghum · 2024-01-25T22:50:04Z

sql/expression/get_field.go


 	for i, col := range s {
-		ret[i] = NewGetFieldWithTable(i, 0, col.Type, col.DatabaseSource, col.Source, col.Name, col.Nullable)
+		ret[i] = NewGetFieldWithTable(i+1, 0, col.Type, col.DatabaseSource, col.Source, col.Name, col.Nullable)


(minor) You mentioned the +1 being for subqueries/dual in the PR description. Just wondering if it's worth explaining anything here, or if there's another way to model this. It seems a little bit magic/mysterious to adjust by 1 here. I imagine it would be hard for people to reverse engineer the intent reading this in the future.

I keep wondering if there's a better way for us to model the dual table that could tidy this up... it seems like that "fake" column with an empty name causes some wrinkles in other places. I know it's probably a separate follow up, but if you have any clever ideas here for modeling dual more cleanly I think they'd be worth tracking.

SchemasToGetFields is a weird short-term gap for update joins that I've failed to get rid of. Can add comment for dual table offset. The dual table generally is definitely a problem. I'm wondering if it can just be flattened with its parent node. Accounting for subquery expressions is probably the tricky part. Subqe, alias references in subqe, dual table columns all kind of collide in unsavory ways. They should be split out into LATERAL joins, eliminating unnecessary dual tables.

fulghum · 2024-01-25T22:51:37Z

sql/planbuilder/aggregates.go

+		//if c.col != "" {
+		proj = append(proj, c.scalarGf())
+		//}


(minor) Might as well remove this commented out code if it's not needed.

Suggested change

//if c.col != "" {

proj = append(proj, c.scalarGf())

//}

proj = append(proj, c.scalarGf())

fulghum · 2024-01-26T01:07:00Z

enginetest/queries/tpch_plans.go

 			"                 └─ Having\n" +
 			"                     ├─ GreaterThan\n" +
-			"                     │   ├─ sum(lineitem.l_quantity):50!null\n" +
+			"                     │   ├─ sum(lineitem.l_quantity):0!null\n" +


I'm not sure I understand why this index changed from 50 to 0. Is this expected?

The motivating bug was that HAVING expressions resolve the first instance of a definition. Looks like there is an alias on this aggregate, and we now use the id to match the first instance.

In the general case we try to match the latest definition of a variable, because that one is always in the closest scope. In this cases there is some weirdness where we're considering an alias and its getField child from an aggregate into a HAVING as equivalent. James needs this PR to keep making progress here.

fulghum · 2024-01-26T01:12:45Z

sql/planbuilder/scope.go

 }

-func (s *scope) resolveColumn(db, table, col string, checkParent bool) (scopeColumn, bool) {
+func (s *scope) resolveColumn(db, table, col string, checkParent, chooseFirst bool) (scopeColumn, bool) {


How do callers know when to set chooseFirst to true? The functionality is pretty obvious from the name, but wondering if there's a way to help people know when to apply it.

It's not totally clear to me either yet, I can add a comment. I think having and grouping scopes were the original motivation. There's more context here dolthub/dolt#6676

fulghum · 2024-01-26T19:08:33Z

sql/planbuilder/project.go

 		case *expression.Star:
 			tableName := strings.ToLower(e.Table)
-			if tableName == "" && len(inScope.cols) == 0 {
+			if tableName == "" && len(inScope.cols) == 1 && inScope.cols[0].col == "" {


This is only to support dual, right? Might be worth a helper function, like isDualColumn() or even testing if cols[0].tableName == 'dual' or something, just to make it super obvious to future people reading this code in the future.

I can add a "dual" table name to this column, shouldn't hurt anything

fulghum · 2024-01-26T19:38:23Z

sql/analyzer/fix_exec_indexes.go

+		case *plan.Window:
+			for _, e := range n.SelectExprs {
+				if ide, ok := e.(sql.IdExpression); ok {
+					s.ids = append(s.ids, ide.Id())
+				}
+			}
+		case *plan.GroupBy:
+			for _, e := range n.SelectedExprs {
+				if ide, ok := e.(sql.IdExpression); ok {
+					s.ids = append(s.ids, ide.Id())
+				}
+			}


Are there any interfaces we can use to make sure we're grabbing all the right types and not missing any? Seems tricky to make sure to we have every struct listed and to keep it up to date as we add new ones. For example, could we use Projector to collapse Window, GroupBy, and Project into a single case?

Projector is a good idea. Covering all types would look something like the memo IR absorbing later half of transformation rules, or sql.Node implementing the relational properties* we track for memo nodes. Need everything to implement a common interface.

max-hoffman added 9 commits January 24, 2024 08:17

Fix agg ids

aabf1e1

most plans except updateSource working

cf074be

simple queries passing

a188012

fix tests

8189db6

comments

f5bf85f

various exprsesion id fixes

89f5130

cleanup

1d0c405

parse_test output

473da8c

comment for error query

173bf43

fulghum approved these changes Jan 26, 2024

View reviewed changes

max-hoffman added 2 commits January 30, 2024 13:53

Merge branch 'main' into max/ambiguous-agg

a59b811

jason's comments

ec28769

max-hoffman merged commit 7d89166 into main Jan 30, 2024

max-hoffman deleted the max/ambiguous-agg branch January 30, 2024 22:54

BrewTestBot mentioned this pull request Feb 1, 2024

dolt 1.32.5 Homebrew/homebrew-core#161477

Merged

Uh oh!

Conversation

max-hoffman commented Jan 24, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fulghum left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

max-hoffman Jan 30, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

max-hoffman commented Jan 24, 2024 •

edited

Loading

max-hoffman Jan 30, 2024 •

edited

Loading