[SPARK-19451][SQL] rangeBetween method should accept Long value as boundary by jiangxb1987 · Pull Request #18540 · apache/spark

jiangxb1987 · 2017-07-05T09:14:13Z

What changes were proposed in this pull request?

Long values can be passed to rangeBetween as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this.

Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add.

This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c

After this been merged, we can close #16818 .

How was this patch tested?

Add new tests in DataFrameWindowFunctionsSuite and TypeCoercionSuite.

hvanhovell

Thanks for taking this over. I left a few comments.

hvanhovell · 2017-07-05T09:42:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

NIT: typo Cannot

hvanhovell · 2017-07-05T09:48:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

I made lower and upper AnyRef, this was to allow the use of both of (foldable) Expressions and SpecialFrameBoundary. This works rather well with things like constant folding. The reason for not making SpecialFrameBoundary an Expression is that this cannot have a type (unless you make it a case class I suppose) and that it showed some weird behavior during analysis/optimization.

hvanhovell · 2017-07-05T10:01:08Z

sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala

Both DateType and TimestampType expressions are going to need a time zone. I was wondering if we can use a GMT because these are just offset calculation? cc @ueshin

If we can't then we either need to thread through the session local timezone, or it might be easier to put the entire offset calculation in the frame.

hvanhovell · 2017-07-05T10:01:38Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala

We should also create API's that allow for other types of literals, at least one for CalendarIntervals.

I totally agree that's what we definitely should do, but I'd suggest we address this in a follow-up work, and focus on resolving the overflow issue on Long frame boundaries in rangeBetween in this PR.

One major concern is the WindowSpec API is marked Stable, so I'm wondering what's the proper procedure to make a change to this interface?

It might be the easiest to make it take a Column, so rangeBetween(begin: Column, end: Column), only downside to this is that we need some way to express the special boundaries (current row, unbounded). Also cc @rxin

Maybe we can use Literal(0) to represent CurrentRow? And a sufficient large number(like Literal(Long.MaxValue)) for Unbounded?

i was trying to avoid introducing a special value, but maybe you can do that.

How important is it to fix this?

Let's rule it out of the scope of this PR and address this in a follow-up.

hvanhovell · 2017-07-05T10:02:52Z

sql/core/src/test/scala/org/apache/spark/sql/DataFrameWindowFunctionsSuite.scala

We should also add tests for dates/doubles/timestamps.

Will do this tomorrow.

We can only add the test cases after we have finalized the API change.

SparkQA · 2017-07-05T11:51:33Z

Test build #79210 has finished for PR 18540 at commit 52c5289.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait FrameType
sealed trait WindowFrame extends Expression with Unevaluable
case class SpecifiedWindowFrame(

SparkQA · 2017-07-05T18:36:21Z

Test build #79231 has finished for PR 18540 at commit 1135053.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2017-07-06T21:07:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

do we really need this? can we just require strict types?

Hmmm that would be kind of weird. So a user will get type coercion in its select but not in the range clause.

SparkQA · 2017-07-11T18:52:43Z

Test build #79531 has finished for PR 18540 at commit a761f63.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2017-07-12T15:51:03Z

cc @cloud-fan @gatorsmile @gengliangwang

cloud-fan · 2017-07-13T07:11:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

-}
+/**
+ * A specified Window Frame. The val lower/uppper can be either a foldable [[Expression]] or a
+ * [[SpecialFrameBoundary]].


can we make SpecialFrameBoundary an expression?

I tried that. The problem is that you will need to make them have a proper data type. I tried to make them case object .. {} with data type null, but I ended with these replaced with a null literal.

All I am saying that this will require a little bit more coding. Since you need to resolve the data type of the boundary.

can we do this in WindowFrameCoercion?

cloud-fan · 2017-07-13T07:18:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

+
+  private def isValidFrameType(ft: DataType): Boolean = (orderSpec.head.dataType, ft) match {
+    case (DateType, IntegerType) => true
+    case (TimestampType, CalendarIntervalType) => true


shall we support DateType, TimestampType in follow-up PR? Let's focus on refactor/cleanup in this PR.

SparkQA · 2017-07-21T15:13:20Z

Test build #79835 has finished for PR 18540 at commit 427753d.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
sealed trait SpecialFrameBoundary extends Expression with Unevaluable

SparkQA · 2017-07-21T18:28:36Z

Test build #79841 has finished for PR 18540 at commit 43b2399.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-22T10:03:30Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

-  }
-
-  override def children: Seq[Expression] = partitionSpec ++ orderSpec
+  override def children: Seq[Expression] = partitionSpec ++ orderSpec ++ Seq(frameSpecification)


nit: partitionSpec ++ orderSpec :+ frameSpecification

cloud-fan · 2017-07-22T10:56:44Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

- * Represents a window frame.
- */
-sealed trait WindowFrame
+  def isValueBound: Boolean = valueBoundary.nonEmpty


nit: def isValueBound: Boolean = !isUnbounded

oh, is it possible that both lower and upper are current row?

yea we can have ROWS CURRENT ROW

Having rows between current row and current row is kinda dumb, the aggregate should contain only 1 value. However range between current row and current row can be very useful because you can aggregate over all the observations with the same ordering value.

cloud-fan · 2017-07-22T10:59:37Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

+  }
+
+  private def checkBoundary(b: Expression, location: String): TypeCheckResult = b match {
+    case e: Expression if !e.foldable && !e.isInstanceOf[SpecialFrameBoundary] =>


nit: we can mark SpecialFrameBoundary as foldable.

I don't think you should, with what are you going to replace the boundary during optimization?

sorry I was wrong.

cloud-fan · 2017-07-22T11:06:39Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercionSuite.scala

+        Seq(UnresolvedAttribute("a")),
+        Seq(SortOrder(Literal.default(DateType), Ascending)),
+        SpecifiedWindowFrame(RangeFrame, Literal(10.0), Literal(2147483648L)))
+    )


can we add some more test cases with special window frame boundary?

cloud-fan · 2017-07-22T11:07:21Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala

-      ("between unbounded preceding and current row", UnboundedPreceding, CurrentRow),
+      ("10 preceding", -Literal(10), CurrentRow),
+      ("2147483648 preceding", -Literal(2147483648L), CurrentRow),
+      ("3 + 1 following", Add(Literal(3), Literal(1)), CurrentRow), // Will fail during analysis


why this will fail during analysis?

The lower boundary would be higher than the upper boundary, previously it would fail, but we have removed this check, should add it.

Do you think that will improve the UX?

cloud-fan · 2017-07-22T11:07:51Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala

+      ("2147483648 preceding", -Literal(2147483648L), CurrentRow),
+      ("3 + 1 following", Add(Literal(3), Literal(1)), CurrentRow), // Will fail during analysis
+      ("unbounded preceding", Unbounded, CurrentRow),
+      ("unbounded following", Unbounded, CurrentRow), // Will fail during analysis


ditto, why?

In fact this is problematic, we would generate the same result for both unbounded preceding and unbounded following. @hvanhovell any idea on resolving this?

Well the idea was that the unboundedness was tied to the location in which it was used, so for example unbounded in the first position would mean unbounded preceding. However this is completely opposite to how we interpret literal bounds, it might be better to reintroduce special boundaries for unbounded preceding and unbounded following.

cloud-fan · 2017-07-22T11:10:07Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/trees/TreeNodeSuite.scala

+          "num-children" -> 2,
+          "frameType" -> JObject("object" -> JString(RowFrame.getClass.getName)),
+          "lower" -> 0,
+          "upper" -> 1),


why lower and upper is 0 and 1?

After this PR, SpecialFrameBoundary and WindowFrame are made Expressions, thus they are TreeNodes, so the field values are made value index in the TreeNode.children.

cloud-fan · 2017-07-22T11:15:08Z

LGTM except some minor comments

SparkQA · 2017-07-24T13:13:52Z

Test build #79905 has finished for PR 18540 at commit 5c9a992.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-25T04:30:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

+        TypeCheckFailure(
+          "Cannot use an UnspecifiedFrame. This should have been converted during analysis. " +
+            "Please file a bug report.")
+      case f: SpecifiedWindowFrame if f.frameType == RangeFrame && !f.isUnbounded &&


I think this should be !f.isValueBound? basically current row and current row is not unbound but should be allowed here.

Sorry I was wrong

SparkQA · 2017-07-27T17:54:08Z

Test build #80006 has finished for PR 18540 at commit a1f91cd.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-07-28T12:09:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

-        "Frame bound value must be a constant integer.",
-        ctx)
-      e.eval().asInstanceOf[Int]
+      validate(e.resolved && e.foldable, "Frame bound value must be a literal.", ctx)


is it necessary? I think analyzer can detect and report this failure too?

How about keep this so we can fail earlier?

cloud-fan · 2017-07-28T12:38:48Z

LGTM, pending tests

SparkQA · 2017-07-28T15:10:57Z

Test build #80020 has finished for PR 18540 at commit 9b8a19b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-07-28T15:21:48Z

Test build #80021 has finished for PR 18540 at commit fbcea1e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-29T06:24:37Z

You still missed resolving this comment

gatorsmile · 2017-07-29T06:27:11Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

-                 FrameBoundary(l),
-                 FrameBoundary(h))))
-             if order.isEmpty || frame != RowFrame || l != h =>
+               frame: SpecifiedWindowFrame))


Nit: Cutting in the middle looks weird. Please keep them in the same line.
WindowSpecDefinition(_, order, frame: SpecifiedWindowFrame))

gatorsmile · 2017-07-29T06:30:14Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/CheckAnalysis.scala

                failAnalysis(s"Expression '$e' not supported within a window function.")
            }
-            // Make sure the window specification is valid.
-            s.validate match {


The verification is moved to checkInputDataTypes, right?

gatorsmile · 2017-07-29T06:41:24Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+    private def createBoundaryCast(boundary: Expression, dt: DataType): Expression = {
+      boundary match {
+        case e: Expression if e.dataType != dt && Cast.canCast(e.dataType, dt) &&
+          !e.isInstanceOf[SpecialFrameBoundary] =>


Splitting if in the middle looks weird. How about?

case e: SpecialFrameBoundary => e case e: Expression if e.dataType != dt && Cast.canCast(e.dataType, dt) => Cast(e, dt) case o => o

gatorsmile · 2017-07-29T06:43:22Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/TypeCoercion.scala

+  object WindowFrameCoercion extends Rule[LogicalPlan] {
+    def apply(plan: LogicalPlan): LogicalPlan = plan resolveExpressions {
+      case s @ WindowSpecDefinition(_, Seq(order), SpecifiedWindowFrame(RangeFrame, lower, upper))
+        if order.resolved =>


Nit: add two more spaces before if

gatorsmile · 2017-07-29T06:57:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

+          s"A range window frame with value boundaries cannot be used in a window specification " +
+            s"with multiple order by expressions: ${orderSpec.mkString(",")}")
+      case f: SpecifiedWindowFrame if f.frameType == RangeFrame && f.isValueBound &&
+        !isValidFrameType(f.valueBoundary.head.dataType) =>


Personally, I do not like many long if conditions. Let us add extra two spaces in line 71, 66, and 62.

gatorsmile · 2017-07-29T07:05:27Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

+        TypeCheckFailure(
+          s"The data type '${orderSpec.head.dataType}' used in the order specification does " +
+            s"not match the data type '${f.valueBoundary.head.dataType}' which is used in the " +
+            "range frame.")


Just want to confirm whether we have at least four negatives test cases to respectively cover these cases?

The first case is for defensive guard, I'll add test sql for the two negative cases related to orderBy in RangeFrame.

gatorsmile · 2017-07-29T07:10:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

-sealed trait FrameBoundary {
-  def notFollows(other: FrameBoundary): Boolean
+case object RangeFrame extends FrameType {
+  override def inputType: AbstractDataType = TypeCollection.NumericAndInterval


uh, we also support CalendarInterval. Do we have a test case to verify it works on CalendarInterval?

gatorsmile · 2017-07-29T07:13:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

+            s"'${l.dataType.catalogString}' <> '${u.dataType.catalogString}'")
+      case (l: Expression, u: Expression) if isGreaterThan(l, u) =>
+        TypeCheckFailure(
+          "The lower bound of a window frame must less than or equal to the upper bound")


Another question. Do we have test cases for the above three negative cases?

The second is defensive check, added test sql for the rest.

gatorsmile · 2017-07-29T07:16:16Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/windowExpressions.scala

+    if (check.isFailure) {
+      check
+    } else if (!offset.foldable) {
+      TypeCheckFailure(s"Offset expression '$offset' must be a literal.")


Having a test case?

Currently it's only used by lead()/lag() functions, that both checked the input types, so we're not able to test this from sql.

gatorsmile · 2017-07-29T07:21:28Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

-        "Frame bound value must be a constant integer.",
-        ctx)
-      e.eval().asInstanceOf[Int]
+      validate(e.resolved && e.foldable, "Frame bound value must be a literal.", ctx)


Any test case?

gatorsmile · 2017-07-29T07:28:03Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

        UnboundedFollowing
      case SqlBaseParser.FOLLOWING =>
-        ValueFollowing(value)
+        value


It should be an unsigned-integer based on ANSI SQL

May I ask how should we parse it into an unsigned-integer?

It sounds like we already allowed it in the previous release. Thus, we need to follow what we have now.

gatorsmile · 2017-07-29T07:28:25Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/AstBuilder.scala

        UnboundedPreceding
      case SqlBaseParser.PRECEDING =>
-        ValuePreceding(value)
+        UnaryMinus(value)


The same here. Do we allow users assign a negative value?

gatorsmile · 2017-07-29T07:33:03Z

sql/core/src/main/scala/org/apache/spark/sql/expressions/WindowSpec.scala

@@ -174,28 +191,22 @@ class WindowSpec private[sql](
   */
  // Note: when updating the doc for this method, also update Window.rangeBetween.
  def rangeBetween(start: Long, end: Long): WindowSpec = {


What happens if start is larger than end?

We'll get and compute empty frames.

gatorsmile · 2017-07-29T07:35:47Z

It looks pretty solid. Thanks!

SparkQA · 2017-07-29T12:06:38Z

Test build #80041 has finished for PR 18540 at commit 38b04df.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-07-29T17:10:47Z

LGTM

gatorsmile · 2017-07-29T17:12:03Z

Thanks! Merging to master

…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close #16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes #18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d8563) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

…undary ## What changes were proposed in this pull request? Long values can be passed to `rangeBetween` as range frame boundaries, but we silently convert it to Int values, this can cause wrong results and we should fix this. Further more, we should accept any legal literal values as range frame boundaries. In this PR, we make it possible for Long values, and make accepting other DataTypes really easy to add. This PR is mostly based on Herman's previous amazing work: hvanhovell@596f53c After this been merged, we can close apache#16818 . ## How was this patch tested? Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`. Author: Xingbo Jiang <xingbo.jiang@databricks.com> Closes apache#18540 from jiangxb1987/rangeFrame. (cherry picked from commit 92d8563) Signed-off-by: gatorsmile <gatorsmile@gmail.com>

hvanhovell reviewed Jul 5, 2017

View reviewed changes

rxin reviewed Jul 6, 2017

View reviewed changes

jiangxb1987 added 3 commits July 11, 2017 22:46

rangeBetween accept literal values.

f7baf92

update comments

3c0718b

add test cases.

a761f63

jiangxb1987 force-pushed the rangeFrame branch from 1135053 to a761f63 Compare July 11, 2017 16:10

cloud-fan reviewed Jul 13, 2017

View reviewed changes

make SpecialFrameBoundary an Expression

427753d

bugfix

43b2399

cloud-fan reviewed Jul 22, 2017

View reviewed changes

address comments

5c9a992

cloud-fan reviewed Jul 25, 2017

View reviewed changes

refactor Unbounded

a1f91cd

update data check for CurrentRow

9abdb5e

cloud-fan reviewed Jul 28, 2017

View reviewed changes

jiangxb1987 added 2 commits July 28, 2017 20:23

remove notFollows function

9b8a19b

fix a typo

fbcea1e

gatorsmile reviewed Jul 29, 2017

View reviewed changes

update tests

38b04df

asfgit closed this in 92d8563 Jul 29, 2017

jiangxb1987 deleted the rangeFrame branch August 9, 2017 08:38

Conversation

jiangxb1987 commented Jul 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 5, 2017

Uh oh!

SparkQA commented Jul 5, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 11, 2017

Uh oh!

jiangxb1987 commented Jul 12, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

SparkQA commented Jul 21, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

jiangxb1987 commented Jul 5, 2017 •

edited

Loading