[SPARK-27905] [SQL] Add higher order function 'forall'#24761
[SPARK-27905] [SQL] Add higher order function 'forall'#24761nvander1 wants to merge 6 commits intoapache:masterfrom nvander1:feature/for_all
Conversation
|
Do you know if any DBMS has this function? |
|
I don't feel strongly about it. If |
|
@srowen The latest commit should address that issue. Thanks for pointing it out! :) |
|
@ueshin maybe you want to look as you added |
|
ok to test cc @hvanhovell |
|
ok to test |
|
@nvander1 can you clarify if there's any reference for this function? I am asking this for its name, behaviour and see if users actually need it. |
| assert(ex4.getMessage.contains("cannot resolve '`a`'")) | ||
| } | ||
|
|
||
| test("forall function - array for primitive type not containing null") { |
There was a problem hiding this comment.
I think most/all of this should be covered by unit tests. You can add a single test to validate that the function registry works, if you must. I know you mirrored the tests for exists and I think they have the same problem. In general we should test the interface here (including the errors), and not so much the underlying functionality (that should be covered by UTs).
There was a problem hiding this comment.
I agree with it, but I remember there was a problem that the behavior was different between the whole stage codegen on/off before, then we decided to do like this as a workaround (#21795).
ueshin
left a comment
There was a problem hiding this comment.
The implementation itself LGTM.
I'm not sure whether we need this or not yet. We had a short discussion before (SPARK-25068).
Also I'm wondering about the comment above (#24761 (comment)).
| assert(ex4.getMessage.contains("cannot resolve '`a`'")) | ||
| } | ||
|
|
||
| test("forall function - array for primitive type not containing null") { |
There was a problem hiding this comment.
I agree with it, but I remember there was a problem that the behavior was different between the whole stage codegen on/off before, then we decided to do like this as a workaround (#21795).
|
Test build #106139 has finished for PR 24761 at commit
|
|
Can we just logically rewrite this to use exists, rather than having two physical implementations? |
|
@ueshin Re: the references for forall, its a fairly standard higher order function like exists, filter, map https://www.scala-lang.org/api/current/scala/collection/IterableLike.html#forall(p:A=%3EBoolean):Boolean |
|
@ueshin re the whole stage code gen, do you meant to use this checkResult2 to test if the results are the same with and without it? |
| i += 1 | ||
| } | ||
| !check(continue) | ||
| } |
There was a problem hiding this comment.
I think the name for continue makes sense here, but I can see that the final !check(continue) returned value may be confusing to read. Anyone have suggestions to make the intent more clear here?
There was a problem hiding this comment.
What is the intent of the !check(continue) ?
There was a problem hiding this comment.
@yeikel perhaps the following implementation would be more clear?
override def nullSafeEval(inputRow: InternalRow, argumentValue: Any): Any = {
val arr = argumentValue.asInstanceOf[ArrayData]
val f = functionForEval
var res = emptyRes
var i = 0
while (!isConfirmed(res) && i < arr.numElements) {
elementVar.value.set(arr.get(i, elementVar.dataType))
res = f.eval(inputRow).asInstanceOf[Boolean]
i += 1
}
res
}Where isConfirmed represents whether we can break out early from our while loop:
For ArrayExists, we can break out early as soon as we find an element that matches the predicate.
For ArrayForAll, we can break out early as soon as we find an element that does NOT match the predicate.
So for ArrayExists, we define the emptyRes to be false since there are no elements in the array to satisfy the predicate. And we define the isConfirmed to be just the result of the predicate on the most recent element.
For ArrayForAll, we define the emptyRes to be true since the predicate holds for every element of an empty array. And we define the isConfirmed to be the negation of the result of the predicate on the most recent element.
This is similar to the approach employed by the scala stdlib: https://github.com/scala/scala/blob/v2.13.0/src/library/scala/collection/IterableOnce.scala#L587-L606
Although they do not abstract out the operation over forall and exists. I'm all for keeping the code DRY like @rxin 's suggestion prompted, but if we can't find a way to do so that is easy to understand, maybe we should just have two implementations that are similar.
Here is a branch that I can merge into this one if needed with the changes I described above:
https://github.com/nvander1/spark/commit/aa5c94f5fb5ce9d677a65af7184c35752d2ca491
| extends ArrayExistsForAllBase { | ||
| override def prettyName: String = "forall" | ||
| override def check(cond: Boolean): Boolean = !cond | ||
| override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): ArrayForAll = { |
There was a problem hiding this comment.
I tried to factor out the bind definition as well here, but there seems to be a known issue on trying to rely on the generated copy method of a case class in its parent trait: https://www.scala-lang.org/old/node/6369
Aside: this is also the same bind method for ArrayFilter
|
@rxin Refactored so ArrayExists and ArrayForAll share an implementation |
|
Test build #106278 has finished for PR 24761 at commit
|
|
re: #24761 (comment) i'm okie with this. |
|
cc @ueshin |
|
@nvander1 I'm sorry for the delay. After the discussion, I became okay to provide this since, even if the rewrite under the three-valued boolean logic works, it's easier and more confident for users. Let me fix it first, then, although I'm not sure whether we can still share an implementation between the two, could you follow the fix here as well? Thanks. |
|
I submitted a PR #24873. |
|
@nvander1 My PR #24873 was merged.
See also: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21197 Thanks! |
|
@ueshin Should there be a conf setting for following three valued logic on
|
Also apply 3 valued logic to ArrayForAll
|
Added in the three valued logic to ArrayForAll. |
|
Test build #106562 has finished for PR 24761 at commit
|
|
@rxin @ueshin @gatorsmile @HyukjinKwon @hvanhovell I don't think the build failure is related to my changes. It is a failure in a spark streaming fault tolerance test |
| > SELECT _FUNC_(array(2, 4, 8), x -> x % 2 == 0); | ||
| true | ||
| > SELECT _FUNC_(array(1, null, 3), x -> x % 2 == 0); | ||
| NULL |
There was a problem hiding this comment.
if the right-hand array contains any null elements and no false comparison result is obtained, the result of ALL will be null, not true
presto> SELECT 1 > ALL (VALUES 2, null, 3);
_col0
-------
false
(1 row)
The result of last example isn't false?
There was a problem hiding this comment.
yeah, it should be false.
For example,
SELECT _FUNC_(array(2, null, 8), x -> x % 2 == 0);
should be null.
| var forall = true | ||
| var foundNull = false | ||
| var i = 0 | ||
| while (i < arr.numElements && (forall || !foundNull)) { |
There was a problem hiding this comment.
We can't break with foundNull since there is still a false after null was found. We need to find false first even if we found null.
| null | ||
| } else { | ||
| forall | ||
| } |
There was a problem hiding this comment.
if (!forall) {
false
} else if (foundNull) {
null
} else {
true
}
?
| > SELECT _FUNC_(array(2, 4, 8), x -> x % 2 == 0); | ||
| true | ||
| > SELECT _FUNC_(array(1, null, 3), x -> x % 2 == 0); | ||
| NULL |
There was a problem hiding this comment.
yeah, it should be false.
For example,
SELECT _FUNC_(array(2, null, 8), x -> x % 2 == 0);
should be null.
|
Should this be null or false? It seems that this would be false, but following your suggested would result in null. |
|
I think this summarizes the expected behavior for forall with three valued logic: And this is captured by the following check What do you think? |
I don't think it returns null, but false properly. Thanks for the summary, which is the expected behavior, and the logic should work as well. |
|
LGTM pending tests. cc @viirya |
|
Test build #108490 has finished for PR 24761 at commit
|
|
Test build #108534 has finished for PR 24761 at commit
|
HyukjinKwon
left a comment
There was a problem hiding this comment.
Looks fine to me too but let me leave it to @ueshin
|
I'm sorry for the delay. |
|
Jenkins, retest this please. |
|
Test build #108728 has finished for PR 24761 at commit
|
|
Thanks! merging to master. |
### What changes were proposed in this pull request? This is a follow-up of #24761 which added a higher-order function `ArrayForAll`. The PR mistakenly removed the `prettyName` from `ArrayExists` and forgot to add it to `ArrayForAll`. ### Why are the changes needed? This reverts the `prettyName` back to `ArrayExists` not to affect explained plans, and adds it to `ArrayForAll` to clarify the `prettyName` as the same as the expressions around. ### Does this PR introduce any user-facing change? No. ### How was this patch tested? Existing tests. Closes #25501 from ueshin/issues/SPARK-27905/pretty_names. Authored-by: Takuya UESHIN <ueshin@databricks.com> Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
What changes were proposed in this pull request?
Add's the higher order function
forall, which tests an array to see if a predicate holds for every element.The function is implemented in
org.apache.spark.sql.catalyst.expressions.ArrayForAll.The function is added to the function registry under the pretty name
forall.How was this patch tested?
I've added appropriate unit tests for the new ArrayForAll expression in
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala.Also added tests for the function in
sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala.Not sure who is best to ask about this PR so:
@HyukjinKwon @rxin @gatorsmile @ueshin @srowen @hvanhovell @gatorsmile