Skip to content

Comments

[SPARK-27905] [SQL] Add higher order function 'forall'#24761

Closed
nvander1 wants to merge 6 commits intoapache:masterfrom
nvander1:feature/for_all
Closed

[SPARK-27905] [SQL] Add higher order function 'forall'#24761
nvander1 wants to merge 6 commits intoapache:masterfrom
nvander1:feature/for_all

Conversation

@nvander1
Copy link
Contributor

What changes were proposed in this pull request?

Add's the higher order function forall, which tests an array to see if a predicate holds for every element.
The function is implemented in org.apache.spark.sql.catalyst.expressions.ArrayForAll.
The function is added to the function registry under the pretty name forall.

How was this patch tested?

I've added appropriate unit tests for the new ArrayForAll expression in
sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/HigherOrderFunctionsSuite.scala.

Also added tests for the function in sql/core/src/test/scala/org/apache/spark/sql/DataFrameFunctionsSuite.scala.

Not sure who is best to ask about this PR so:
@HyukjinKwon @rxin @gatorsmile @ueshin @srowen @hvanhovell @gatorsmile

@HyukjinKwon
Copy link
Member

Do you know if any DBMS has this function?

@srowen
Copy link
Member

srowen commented Jun 1, 2019

I don't feel strongly about it. If exists is implemented, this seems reasonable to add, even if it's niche. I think you'd have to modify ReplaceNullWithFalseInPredicateSuite too? just looking at how ArrayExists works.

@nvander1
Copy link
Contributor Author

nvander1 commented Jun 3, 2019

@srowen The latest commit should address that issue. Thanks for pointing it out! :)

@srowen
Copy link
Member

srowen commented Jun 3, 2019

@ueshin maybe you want to look as you added ArrayExists, etc

@gatorsmile
Copy link
Member

ok to test

cc @hvanhovell

@hvanhovell
Copy link
Contributor

ok to test

@HyukjinKwon
Copy link
Member

@nvander1 can you clarify if there's any reference for this function? I am asking this for its name, behaviour and see if users actually need it.

assert(ex4.getMessage.contains("cannot resolve '`a`'"))
}

test("forall function - array for primitive type not containing null") {
Copy link
Contributor

@hvanhovell hvanhovell Jun 4, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think most/all of this should be covered by unit tests. You can add a single test to validate that the function registry works, if you must. I know you mirrored the tests for exists and I think they have the same problem. In general we should test the interface here (including the errors), and not so much the underlying functionality (that should be covered by UTs).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with it, but I remember there was a problem that the behavior was different between the whole stage codegen on/off before, then we decided to do like this as a workaround (#21795).

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation itself LGTM.
I'm not sure whether we need this or not yet. We had a short discussion before (SPARK-25068).
Also I'm wondering about the comment above (#24761 (comment)).

assert(ex4.getMessage.contains("cannot resolve '`a`'"))
}

test("forall function - array for primitive type not containing null") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with it, but I remember there was a problem that the behavior was different between the whole stage codegen on/off before, then we decided to do like this as a workaround (#21795).

@SparkQA
Copy link

SparkQA commented Jun 4, 2019

Test build #106139 has finished for PR 24761 at commit abef9cf.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rxin
Copy link
Contributor

rxin commented Jun 5, 2019

Can we just logically rewrite this to use exists, rather than having two physical implementations?

@nvander1
Copy link
Contributor Author

nvander1 commented Jun 7, 2019

@ueshin re the whole stage code gen, do you meant to use this checkResult2 to test if the results are the same with and without it?

i += 1
}
!check(continue)
}
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the name for continue makes sense here, but I can see that the final !check(continue) returned value may be confusing to read. Anyone have suggestions to make the intent more clear here?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the intent of the !check(continue) ?

Copy link
Contributor Author

@nvander1 nvander1 Jun 12, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yeikel perhaps the following implementation would be more clear?

  override def nullSafeEval(inputRow: InternalRow, argumentValue: Any): Any = {
    val arr = argumentValue.asInstanceOf[ArrayData]
    val f = functionForEval
    var res = emptyRes
    var i = 0
    while (!isConfirmed(res) && i < arr.numElements) {
      elementVar.value.set(arr.get(i, elementVar.dataType))
      res = f.eval(inputRow).asInstanceOf[Boolean]
      i += 1
    }
    res
  }

Where isConfirmed represents whether we can break out early from our while loop:
For ArrayExists, we can break out early as soon as we find an element that matches the predicate.
For ArrayForAll, we can break out early as soon as we find an element that does NOT match the predicate.

So for ArrayExists, we define the emptyRes to be false since there are no elements in the array to satisfy the predicate. And we define the isConfirmed to be just the result of the predicate on the most recent element.

For ArrayForAll, we define the emptyRes to be true since the predicate holds for every element of an empty array. And we define the isConfirmed to be the negation of the result of the predicate on the most recent element.

This is similar to the approach employed by the scala stdlib: https://github.com/scala/scala/blob/v2.13.0/src/library/scala/collection/IterableOnce.scala#L587-L606

Although they do not abstract out the operation over forall and exists. I'm all for keeping the code DRY like @rxin 's suggestion prompted, but if we can't find a way to do so that is easy to understand, maybe we should just have two implementations that are similar.

Here is a branch that I can merge into this one if needed with the changes I described above:
https://github.com/nvander1/spark/commit/aa5c94f5fb5ce9d677a65af7184c35752d2ca491

extends ArrayExistsForAllBase {
override def prettyName: String = "forall"
override def check(cond: Boolean): Boolean = !cond
override def bind(f: (Expression, Seq[(DataType, Boolean)]) => LambdaFunction): ArrayForAll = {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to factor out the bind definition as well here, but there seems to be a known issue on trying to rely on the generated copy method of a case class in its parent trait: https://www.scala-lang.org/old/node/6369

Aside: this is also the same bind method for ArrayFilter

@nvander1
Copy link
Contributor Author

nvander1 commented Jun 7, 2019

@rxin Refactored so ArrayExists and ArrayForAll share an implementation

@SparkQA
Copy link

SparkQA commented Jun 7, 2019

Test build #106278 has finished for PR 24761 at commit 1775c03.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

re: #24761 (comment) i'm okie with this.

@gatorsmile
Copy link
Member

cc @ueshin

@ueshin
Copy link
Member

ueshin commented Jun 14, 2019

@nvander1 I'm sorry for the delay.
Actually I had an offline discussion about this, and during the discussion we found that ArrayExists (also ArrayForAll if we provide) should follow the three-valued boolean logic.

After the discussion, I became okay to provide this since, even if the rewrite under the three-valued boolean logic works, it's easier and more confident for users.

Let me fix it first, then, although I'm not sure whether we can still share an implementation between the two, could you follow the fix here as well?

Thanks.

@ueshin
Copy link
Member

ueshin commented Jun 14, 2019

I submitted a PR #24873.

@ueshin
Copy link
Member

ueshin commented Jun 15, 2019

@nvander1 My PR #24873 was merged.
Could you continue this PR and follow the three-valued logic for forall as well please?

  • false if the predicate holds at least one false
  • otherwise, null if the predicate holds null
  • otherwise, true

See also: https://www.postgresql.org/docs/9.6/functions-comparisons.html#AEN21197

Thanks!

@nvander1
Copy link
Contributor Author

nvander1 commented Jun 16, 2019

@ueshin Should there be a conf setting for following three valued logic on forall as well, or should it always be three-valued since there isn't any legacy code that would expect that?

Also, should we open another PR to make the ArrayFilter + MapFilter follow three valued logic for their predicates as well? Now that I think about it, that wouldn't make much sense in that context.

Also, in the case of forall([1, null, 3], x -> x % 2 = 0) Should this evaluate null or false? I see why this should be null now

Also apply 3 valued logic to ArrayForAll
@nvander1
Copy link
Contributor Author

Added in the three valued logic to ArrayForAll.

@SparkQA
Copy link

SparkQA commented Jun 17, 2019

Test build #106562 has finished for PR 24761 at commit 04822ff.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@nvander1
Copy link
Contributor Author

@rxin @ueshin @gatorsmile @HyukjinKwon @hvanhovell

I don't think the build failure is related to my changes. It is a failure in a spark streaming fault tolerance test

> SELECT _FUNC_(array(2, 4, 8), x -> x % 2 == 0);
true
> SELECT _FUNC_(array(1, null, 3), x -> x % 2 == 0);
NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the right-hand array contains any null elements and no false comparison result is obtained, the result of ALL will be null, not true

presto> SELECT 1 > ALL (VALUES 2, null, 3);
 _col0 
-------
 false 
(1 row)

The result of last example isn't false?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it should be false.

For example,

SELECT _FUNC_(array(2, null, 8), x -> x % 2 == 0);

should be null.

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nvander1 Thanks for updating.
I don't think we need the config for this, and forall should always follow three-valued logic.
I left some comments. Could you update this please?
Thanks.

var forall = true
var foundNull = false
var i = 0
while (i < arr.numElements && (forall || !foundNull)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can't break with foundNull since there is still a false after null was found. We need to find false first even if we found null.

null
} else {
forall
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if (!forall) {
  false
} else if (foundNull) {
  null
} else {
  true
}

?

> SELECT _FUNC_(array(2, 4, 8), x -> x % 2 == 0);
true
> SELECT _FUNC_(array(1, null, 3), x -> x % 2 == 0);
NULL
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, it should be false.

For example,

SELECT _FUNC_(array(2, null, 8), x -> x % 2 == 0);

should be null.

@nvander1
Copy link
Contributor Author

Should this be null or false?

SELECT _FUNC_(array(1, null, 3), x -> false)

It seems that this would be false, but following your suggested

if (!forall) {
  false
} else if (foundNull) {
  null
} else {
  true
}

would result in null.

@ueshin

@nvander1
Copy link
Contributor Author

nvander1 commented Jul 31, 2019

I think this summarizes the expected behavior for forall with three valued logic:

  /*
   * true for all non null elements foundNull      result
   *    F                              F             F
   *    F                              T             F
   *    T                              F             T
   *    T                              T             N
   */

And this is captured by the following check

    if (foundNull && forall) {
      null
    } else {
      forall
    }

What do you think?

@ueshin
Copy link
Member

ueshin commented Jul 31, 2019

It seems that this would be false, but following your suggested would result in null.

I don't think it returns null, but false properly.

Thanks for the summary, which is the expected behavior, and the logic should work as well.

@ueshin
Copy link
Member

ueshin commented Jul 31, 2019

LGTM pending tests. cc @viirya

@SparkQA
Copy link

SparkQA commented Jul 31, 2019

Test build #108490 has finished for PR 24761 at commit c93e701.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 2, 2019

Test build #108534 has finished for PR 24761 at commit af77619.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine to me too but let me leave it to @ueshin

@ueshin
Copy link
Member

ueshin commented Aug 6, 2019

I'm sorry for the delay.
Let me retrigger the tests since it's been a while since the last build.

@ueshin
Copy link
Member

ueshin commented Aug 6, 2019

Jenkins, retest this please.

@SparkQA
Copy link

SparkQA commented Aug 6, 2019

Test build #108728 has finished for PR 24761 at commit af77619.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Member

ueshin commented Aug 6, 2019

Thanks! merging to master.

@ueshin ueshin closed this in 9e931e7 Aug 6, 2019
dongjoon-hyun pushed a commit that referenced this pull request Aug 19, 2019
### What changes were proposed in this pull request?

This is a follow-up of #24761 which added a higher-order function `ArrayForAll`.
The PR mistakenly removed the `prettyName` from `ArrayExists` and forgot to add it to `ArrayForAll`.

### Why are the changes needed?

This reverts the `prettyName` back to `ArrayExists` not to affect explained plans, and adds it to `ArrayForAll` to clarify the `prettyName` as the same as the expressions around.

### Does this PR introduce any user-facing change?

No.

### How was this patch tested?

Existing tests.

Closes #25501 from ueshin/issues/SPARK-27905/pretty_names.

Authored-by: Takuya UESHIN <ueshin@databricks.com>
Signed-off-by: Dongjoon Hyun <dhyun@apple.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.