Conversation
bbfc312 to
b239ec0
Compare
…ver into max/join-estimation
enginetest/histogram_test.go
Outdated
| runSuite(t, tests, 1000, 10, false) | ||
| } | ||
|
|
||
| func runSuite(t *testing.T, tests []statsTest, rowCnt, bucketCnt int, debug bool) { |
There was a problem hiding this comment.
Can you add a docstring explaining the what runSuite is evaluating. I gather that it's getting the histogram of the join between the tables, but what is it asserting about the histogram?
| } | ||
| } | ||
|
|
||
| func testHistogram(ctx *sql.Context, table *plan.ResolvedTable, fields []int, buckets int) ([]sql.HistogramBucket, error) { |
There was a problem hiding this comment.
This feels like a lot of logic for a test. I'm worried we're going to spend a lot of time testing the test. Is there not some equivalent code that GMS uses to compute histograms?
There was a problem hiding this comment.
The Dolt side is unique and I made the GMS side reservoir sampling to skip steps while boostrapping. We could probably use this one for the GMS implementation as a default, doubt anyone will earnestly use the memory stats soon.
| expRight []sql.HistogramBucket | ||
| }{ | ||
| { | ||
| left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
What determines how many buckets in the expected output? I was hoping the tests would make it clear but I'm not sure why, for instance, there's no 0 upper bound in the output of the second test.
There was a problem hiding this comment.
related to other comment, I don't know how to split (-infinity, 10) into (-infinity, 0) and (0,10). So I extend (-infinity, 0) to (-infinity, 10) by stealing (0,10) from (0,50).
There was a problem hiding this comment.
Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }?
There was a problem hiding this comment.
I'm not against tracking lower bound, i just didn't see that in other Dbs implementations so I was suspicious of whether it was practical. Pretty easy improvement later if so
There was a problem hiding this comment.
I ended up having to implement lower bounds for the blog demo to work, it makes a big difference. On the Dolt side I'm going to leave the bound in-memory, storing it as a row with the rest of histograms was a bit awkward.
| exp sql.Histogram | ||
| }{ | ||
| { | ||
| left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
I want to make sure I'm understanding this example right.
On the left side of the merge, there are 20 distinct values, assumed to be uniformly distributed between 0 and 10.
On the right side of the merge, there are 10 distinct values, assumed to be uniformly distributed between 0 and 20.
We expect the histogram for the merge to say that there are 10 distinct values uniformly distributed between 0 and 10.
This doesn't square with my intuition, since I would expect only half of the values from the right would be possible join candidates, so the expected number of rows should be at most 5.
I may not be properly understanding what the histogram represents or how it's being computed though.
There was a problem hiding this comment.
I think what you are suggesting is that we could truncate the RHS to a bound value of 10, and delete half of the contents. But the valid ranges are negative infinity to bound value. Most of the keys might be negative, all of the keys could overlap, or none could. Different databases handle the first bucket special case differently, I haven't had a lot of time to test alternatives.
There was a problem hiding this comment.
Ah right, these have -infinity as the minimum bound. The result here makes sense then.
sql/stats/join_test.go
Outdated
| }, | ||
| expLeft: []sql.HistogramBucket{ | ||
| &Bucket{RowCnt: 12, DistinctCnt: 12, BoundVal: sql.Row{10}, BoundCnt: 1}, | ||
| &Bucket{RowCnt: 6, DistinctCnt: 6, BoundVal: sql.Row{20}, BoundCnt: 1}, |
There was a problem hiding this comment.
Shouldn't this be 2 for the row bound by 20, and 6 for the row bound by 50?
If I'm reading left correctly, then the range from 0 to 50 has a density of 2 rows per 10 values. So the bucket with a range of 10 (10-20) should have 2, and the bucket with a range of 30 (20-50) should have 6.
There was a problem hiding this comment.
yeah you're right, I'll see what's going on there
There was a problem hiding this comment.
Looks like a I swapped the cut fractions, good eye.
| expRight []sql.HistogramBucket | ||
| }{ | ||
| { | ||
| left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Hmm... could we avoid this by requiring that the first bucket has a count of 0 (and thus establishes a lower bound for all buckets? Or by using a struct { lowerBound sql.Row, buckets []sql.HistogramBucket }?
| &Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{30}, BoundCnt: 1}, | ||
| &Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{50}, BoundCnt: 1}, | ||
| }, | ||
| expLeft: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Similar to above, couldn't the result of this be:
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{0}, BoundCnt: 1},
&Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, BoundCnt: 1},
&Bucket{RowCnt: 23, DistinctCnt: 3, BoundVal: sql.Row{20}, BoundCnt: 1},
&Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{30}, BoundCnt: 1},
&Bucket{RowCnt: 3, DistinctCnt: 3, BoundVal: sql.Row{40}, BoundCnt: 1},
| exp sql.Histogram | ||
| }{ | ||
| { | ||
| left: []sql.HistogramBucket{ |
There was a problem hiding this comment.
Ah right, these have -infinity as the minimum bound. The result here makes sense then.
sql/stats/join_test.go
Outdated
| &Bucket{RowCnt: 20, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{1}, {2}}, McvsCnt: []uint64{5, 5}, BoundCnt: 1}, | ||
| }, | ||
| right: []sql.HistogramBucket{ | ||
| &Bucket{RowCnt: 10, DistinctCnt: 10, BoundVal: sql.Row{10}, McvVals: []sql.Row{{2}}, McvsCnt: []uint64{4}, BoundCnt: 1}, |
There was a problem hiding this comment.
My first thought looking at this test was: "Every row in the right is distinct, so there can't be more than 20 rows in the join." Then I was surprised that the expected result was 30.
I think the bucket here is impossible, since it's not possible to have 10 values, all distinct, and also have one value with a count of 4.
There was a problem hiding this comment.
Yeah this tracks, will fix
Code and tests related to estimating output distributions for joins. Limited to numeric types.