[SPARK-19408][SQL] filter estimation on two columns of same table#17415
[SPARK-19408][SQL] filter estimation on two columns of same table#17415ron8hu wants to merge 11 commits intoapache:masterfrom
Conversation
|
ok to test |
|
cc @sameeragarwal @cloud-fan @gatorsmile This Jira is not on Spark 2.2 blocker list. If time permits, we can include it in Spark 2.2. If not, we can wait for a maintenance release. Thanks. |
|
Test build #75186 has finished for PR 17415 at commit
|
|
retest this please |
|
Test build #75223 has finished for PR 17415 at commit
|
There was a problem hiding this comment.
I think for EqualTo, there is no complete overlap.
There was a problem hiding this comment.
I just revised the code to handle EqualNullSafe separately from EqualTo.
There was a problem hiding this comment.
why the new ndv only look at ndvLeft?
There was a problem hiding this comment.
My bad. I should remove the line that has been commented out. This line is replaced by the following code:
if (rowCountValue != 0) {
// Need to check attributeStats one by one because we may have multiple output columns.
// Due to update operation, the output columns may be in different order.
expectedColStats.foreach { kv =>
val filterColumnStat = filterStats.attributeStats.get(kv._1).get
assert(filterColumnStat == kv._2)
}
|
It sounds like we have not supported a very common constant filter. Let me take a quick fix on that. : ) Our optimizer rule |
d6a53ef to
9830a8f
Compare
There was a problem hiding this comment.
We use Equality above, but we did not handle EqualNullSafe. That will cause a strange case mismatch error.
There was a problem hiding this comment.
nullCount might not be simply set to zero if we also support EqualNullSafe
There was a problem hiding this comment.
Could we use white list here? It is also easy for us to see which data types are assumed to support in the implementation.
I am afraid we might easily forget updating this if we support new data type in the future.
There was a problem hiding this comment.
The current code is written in such a way that we do not have too deep indentation. Some engineers do not like deep indentation as they often put screen monitor vertically.
Let's handle it when the need occurs. I think, with good test case coverage, we will be able to catch anything we miss.
|
Test build #75288 has finished for PR 17415 at commit
|
|
Test build #75284 has finished for PR 17415 at commit
|
9830a8f to
7abed99
Compare
There was a problem hiding this comment.
a given column -> two given columns. Both two columns' ColumnStat are updated.
There was a problem hiding this comment.
(maxLeft <= minRight, minLeft > maxRight)?
There was a problem hiding this comment.
Good catch. fixed.
There was a problem hiding this comment.
(maxLeft < minRight, minLeft >= maxRight)?
There was a problem hiding this comment.
(minLeft >= maxRight, maxLeft < minRight)?
There was a problem hiding this comment.
Good catch. Fixed.
There was a problem hiding this comment.
(minLeft > maxRight, maxLeft <= minRight)?
|
Test build #75335 has finished for PR 17415 at commit
|
64bf43e to
70ac70c
Compare
|
Test build #75367 has finished for PR 17415 at commit
|
70ac70c to
9b98ff1
Compare
|
Test build #75369 has finished for PR 17415 at commit
|
|
Test build #75375 has finished for PR 17415 at commit
|
There was a problem hiding this comment.
Once no overlap, is it still meaningful to keep min, max?
There was a problem hiding this comment.
we need one more condition: minLeft == minRight, like @gatorsmile suggested #17415 (comment)
There was a problem hiding this comment.
You said "we need one more condition: minLeft == minRight". Note that this condition is already included.
@gatorsmile was suggesting "(minRight == maxRight) && (minLeft == minRight) && (maxLeft == maxRight)". This implies all 4 values (minLeft, maxLeft, minRight, maxRight) are equal. This is not what I mean by complete overlap. I initially defined "complete overlap" as "complete range overlap". For example, we have a test case: test("cint = cint4"). If we use @gatorsmile's definition, then the case "cint = cint4" will become partial overlap with selectivity 0.33, which will under-estimate the selectivity. In order to avoid out-of-memory error, I prefer over-estimating rather than under-estimating.
Also it should be noted that "complete range overlap" should cover "complete point overlap".
There was a problem hiding this comment.
Estimation is always hard to be accurate. That is why user-provided hints are very useful for getting the right plan.
I did a search. There are many different estimators. Uniform estimators, length estimator, digram estimator, minimum variance estimators, and histogram estimators. Is that possible we can consider the data distributions when deciding the selectivity?
There was a problem hiding this comment.
Yes, estimation is always hard to be accurate. We may consider supporting hints in the future.
There was a problem hiding this comment.
Good point. We changed condition to:
(minLeft == minRight) && (maxLeft == maxRight) && allNotNull
&& (colStatLeft.distinctCount == colStatRight.distinctCount)
There was a problem hiding this comment.
I doubt that the when this condition is true, it is a complete overlapping between two columns.
The complete equality between the values of two columns also depends on the order. E.g., when left values are (1, 2, 3, 4), right values are (4, 3, 2, 1), the condition is true, but no values can pass the filter predicate left_col = right_col.
Am I missing something?
There was a problem hiding this comment.
This is empirical. Without more statistics, it's really hard to do it mathematically.
There was a problem hiding this comment.
Agreed. We prefer over estimation to under estimation in order to avoid out-of-memory error.
|
Test build #75470 has finished for PR 17415 at commit
|
|
Test build #75472 has started for PR 17415 at commit |
|
retest this please |
There was a problem hiding this comment.
&& (colStatLeft.distinctCount == colStatRight.distinctCount)
There was a problem hiding this comment.
The indention is wrong here.
There was a problem hiding this comment.
This only checks the expectedColStats is a sub-set of filterStats.attributeStats, shall we also check the size?
|
LGTM except some minor comments |
|
Test build #75477 has finished for PR 17415 at commit
|
|
Test build #75484 has finished for PR 17415 at commit
|
|
LGTM |
1 similar comment
|
LGTM |
|
Thanks, Merging to master. |
What changes were proposed in this pull request?
In SQL queries, we also see predicate expressions involving two columns such as "column-1 (op) column-2" where column-1 and column-2 belong to same table. Note that, if column-1 and column-2 belong to different tables, then it is a join operator's work, NOT a filter operator's work.
This PR estimates filter selectivity on two columns of same table. For example, multiple tpc-h queries have this predicate "WHERE l_commitdate < l_receiptdate"
How was this patch tested?
We added 6 new test cases to test various logical predicates involving two columns of same table.
Please review http://spark.apache.org/contributing.html before opening a pull request.