Skip to content

Conversation

@wangyum
Copy link
Member

@wangyum wangyum commented Jun 16, 2017

What changes were proposed in this pull request?

Add build-in SQL function - WIDTH_BUCKET

Ref: https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717

How was this patch tested?

unit tests and manual tests.

manual tests(compare spark-sql results with oracle):

select width_bucket(5.35, 0.024, 10.06, 5) from dual; -- 3 --

select width_bucket(3.14, 0, 4, 3) from dual;         -- 3 --
select width_bucket(2, 0, 4, 3) from dual;            -- 2 --
select width_bucket(-1, 0, 3.2, 4) from dual;         -- 0 -- 

select width_bucket(3.14, 4, 0, 3) from dual;         -- 1 --
select width_bucket(2, 4, 0, 3) from dual;            -- 2 --
select width_bucket(-1, 3.2, 0, 4) from dual;         -- 5 --

@SparkQA
Copy link

SparkQA commented Jun 16, 2017

Test build #78160 has started for PR 18323 at commit 20fe567.

wangyum added 2 commits June 16, 2017 14:32
Conflicts:
	sql/core/src/test/resources/sql-tests/inputs/operators.sql
	sql/core/src/test/resources/sql-tests/results/operators.sql.out
@SparkQA
Copy link

SparkQA commented Jun 16, 2017

Test build #78161 has started for PR 18323 at commit 3a0eaf7.

@wangyum
Copy link
Member Author

wangyum commented Jun 16, 2017

Jenkins, retest this please

@SparkQA
Copy link

SparkQA commented Jun 16, 2017

Test build #78181 has finished for PR 18323 at commit 3a0eaf7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Conflicts:
	sql/core/src/test/resources/sql-tests/inputs/operators.sql
	sql/core/src/test/resources/sql-tests/results/operators.sql.out
@SparkQA
Copy link

SparkQA commented Jun 17, 2017

Test build #78196 has finished for PR 18323 at commit 3a5d46e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @wzhfy

@SparkQA
Copy link

SparkQA commented Jun 19, 2017

Test build #78245 has finished for PR 18323 at commit 7407541.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

ev: ExprCode,
f: (String, String, String, String) => String): ExprCode = {
nullSafeCodeGen(ctx, ev, (eval1, eval2, eval3, eval4) => {
s"${ev.value} = ${f(eval1, eval2, eval3, eval3)};"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the last eval3 -> eval4

override def nullable: Boolean = children.exists(_.nullable)

/**
* Default behavior of evaluation according to the default nullability of TernaryExpression.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TernaryExpression -> QuaternaryExpression, please also fix other places

* If either of the sub-expressions is null, the result of this computation
* is assumed to be null.
*
* @param f function that accepts the 3 non-null evaluation result names of children
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 -> 4

}

/**
* The bucket number into which
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Returns the bucket number

* Returns the bucket number into which
* the value of this expression would fall after being evaluated.
*
* @param expr id the expression for which the histogram is being created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

id -> is

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: id -> is

* @param numBucket is an An expression that resolves to
* a constant indicating the number of buckets
* @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by
* the range [minValue, maxValue]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both endpoints are included? Can you check with other databases and add a comment here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, test on Oracle:

width_Bucket(0, 1, 1, 1) -> 0, width_Bucket(20, 1, 1, 1) -> 2

}

val result = if (minValue > maxValue) (numBucket - preResult) + 1 else preResult
result
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: if (minValue > maxValue) (numBucket - preResult) + 1 else preResult

val preResult: Long = if (expr < lower) {
0
} else if (expr >= upper) {
Math.addExact(numBucket, 1)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need to use addExact? In Oracle's doc, numBucket is an integer, then we can use numBucket + 1L here

} else if (expr >= upper) {
Math.addExact(numBucket, 1)
} else {
(numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what if upper == lower?

assert(widthBucket(100, 100, 5000, 10) === 1)
assert(widthBucket(590, 100, 5000, 10) === 2)
assert(widthBucket(5000, 100, 5000, 10) === 11)
assert(widthBucket(6000, 100, 5000, 10) === 11)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use the same cases from Oracle's doc? just to make sure we are getting the same results here.

wangyum added 2 commits June 23, 2017 13:26
Conflicts:
	sql/core/src/test/resources/sql-tests/results/operators.sql.out
@SparkQA
Copy link

SparkQA commented Jun 23, 2017

Test build #78515 has finished for PR 18323 at commit 5c3ecee.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Returns the bucket number into which
* the value of this expression would fall after being evaluated.
*
* @param expr id the expression for which the histogram is being created
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: id -> is


-- width_bucket
select width_bucket(5.35, 0.024, 10.06, 5);
select width_bucket(5.35, 0.024, 10.06, -5);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add a case for wrong input type: select width_bucket(5.35, 0.024, 10.06, 0.5);


if (numBucket <= 0) {
throw new AnalysisException(s"The num of bucket must be greater than 0, but got ${numBucket}")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we consider minValue == maxValue and numBucket > 1 valid input or not?
Please also add a test case for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If minValue == maxValue , then lower==upper, result is numBucket + 1L:

    val lower: Double = Math.min(minValue, maxValue)
    val upper: Double = Math.max(minValue, maxValue)

    val result: Long = if (expr < lower) {
      0
    } else if (expr >= upper) {
      numBucket + 1L
    } else {
      (numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong
    }

    if (minValue > maxValue) (numBucket - result) + 1 else result

Copy link
Contributor

@wzhfy wzhfy Jun 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know, but my question is: should the possible result when minValue == maxValue only be 0, 1, or 2? e.g. should width_bucket(2, 1, 1, 100) = 101 or just throw an error due to invalid input?

* a constant indicating the number of buckets
* @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by
* the range [minValue, maxValue]. For example:
* widthBucket(0, 1, 1, 1) -> 0, widthBucket(20, 1, 1, 1) -> 2.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these examples in the description, they are just corner cases. My previous comment was just to make sure both ends should be included.

assert(widthBucket(400, 100, 5000, 10) === 1)

// minValue == maxValue
assert(widthBucket(10, 4, 4, 15) === 16)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this behavior consistent with other databases? If so, then I'm fine with this and please ignore my previous comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, This is Oracle's behavior:
oracle-width_bucket

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78611 has finished for PR 18323 at commit 9e2cabb.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wangyum
Copy link
Member Author

wangyum commented Jun 26, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented Jun 26, 2017

Test build #78627 has finished for PR 18323 at commit 9e2cabb.

  • This patch fails PySpark pip packaging tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@wzhfy
Copy link
Contributor

wzhfy commented Jun 27, 2017

This error appears in a few builds. Let's wait for the fix. Currently it looks good to me.
ping @gatorsmile

@wzhfy
Copy link
Contributor

wzhfy commented Jun 28, 2017

retest this please

@SparkQA
Copy link

SparkQA commented Jun 28, 2017

Test build #78769 has finished for PR 18323 at commit 9e2cabb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* to the minimum end point of the acceptable range for expr
* @param maxValue is an expression that resolves
* to the maximum end point of the acceptable range for expr
* @param numBucket is an An expression that resolves to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

an An -> an


override def nullSafeEval(ex: Any, min: Any, max: Any, num: Any): Any = {
MathUtils.widthBucket(
ex.asInstanceOf[Double],
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happened if the input is not a constant, but an foldable expression?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

foldable expression is correct.

@SparkQA
Copy link

SparkQA commented Jul 16, 2017

Test build #79641 has finished for PR 18323 at commit 507fcfb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

*/
// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(expr, min_value, max_value, num_bucket) - Returns an long between 0 and `num_buckets`+1 by mapping the `expr` into buckets defined by the range [`min_value`, `max_value`].",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: an -> a

Copy link
Member

@gatorsmile gatorsmile Jul 16, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returns the bucket to which operand would be assigned in an equidepth histogram with num_bucket buckets, in the range min_value to max_value.

@SparkQA
Copy link

SparkQA commented Aug 1, 2017

Test build #80112 has finished for PR 18323 at commit e0478f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
#	sql/core/src/test/resources/sql-tests/inputs/operators.sql
#	sql/core/src/test/resources/sql-tests/results/operators.sql.out
@SparkQA
Copy link

SparkQA commented Sep 16, 2017

Test build #81843 has finished for PR 18323 at commit 2e2b2ca.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@rdehouss
Copy link

rdehouss commented Jul 9, 2018

Any news? What's blocking this PR to move on except the conflict to resolve since it's been 1 year?
This addition would be really great.

@SparkQA
Copy link

SparkQA commented Aug 7, 2018

Test build #94376 has finished for PR 18323 at commit 2e2b2ca.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@wangyum wangyum closed this Nov 10, 2018
@rdehouss
Copy link

Closed? Not merged? Will it come via another way?

@gatorsmile gatorsmile reopened this May 1, 2020
@gatorsmile
Copy link
Member

@wangyum Can we rebase it? We should add this support for compatibility.

@SparkQA
Copy link

SparkQA commented May 1, 2020

Test build #122180 has finished for PR 18323 at commit 2e2b2ca.

  • This patch fails Scala style tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

wangyum added 2 commits May 3, 2020 10:59
# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala
#	sql/core/src/test/resources/sql-tests/results/operators.sql.out
@SparkQA
Copy link

SparkQA commented May 3, 2020

Test build #122223 has finished for PR 18323 at commit 31ee943.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented May 3, 2020

You need to run SPARK_GENERATE_GOLDEN_FILES=1 build/sbt "sql/test-only *ExpressionsSchemaSuite" to update the golden file.

@SparkQA
Copy link

SparkQA commented May 3, 2020

Test build #122229 has finished for PR 18323 at commit 44ef7df.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

expr: Expression,
minValue: Expression,
maxValue: Expression,
numBucket: Expression) extends Expression with ImplicitCastInputTypes {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cloud we do WidthBucket (...) extends QuaternaryExpression with ImplicitCastInputTypes with CodegenFallback for simplicity? If we need the codegen support, I think its okay to do so in follow-up.


/**
* Returns the bucket number into which
* the value of this expression would fall after being evaluated.
Copy link
Member

@maropu maropu May 29, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about the format and the rephrasing below?

/**
 * Returns the bucket number into which the value of this expression would fall
 * after being evaluated.
 *
 * @param expr is the expression to compute a bucket number in the histogram
 * @param minValue is the minimum value of the histogram
 * @param maxValue is the maximum value of the histogram
 * @param numBucket is the number of buckets
 */

// scalastyle:off line.size.limit
@ExpressionDescription(
usage = "_FUNC_(expr, min_value, max_value, num_bucket) - Returns the `bucket` to which operand would be assigned in an equidepth histogram with `num_bucket` buckets, in the range `min_value` to `max_value`.",
extended = """
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extended -> examples. Also, plz add a since tag.

} else if (evals(0) == null) {
null
} else {
MathUtils.widthBucket(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MathUtils.widthBucket is used only for this new expr now? If so, could we move the method to this expr?

@maropu
Copy link
Member

maropu commented May 29, 2020

Could you solve the conflict?

@maropu
Copy link
Member

maropu commented Jun 9, 2020

@wangyum If you don't time to update this, I could take this over. Thought?

@wangyum
Copy link
Member Author

wangyum commented Jun 9, 2020

@maropu Please take it over. Thank you.

@maropu
Copy link
Member

maropu commented Jun 9, 2020

ok

@wangyum wangyum closed this Jun 9, 2020
dongjoon-hyun pushed a commit that referenced this pull request Jun 11, 2020
### What changes were proposed in this pull request?

This PR intends to add a build-in SQL function - `WIDTH_BUCKET`.
It is the rework of #18323.

Closes #18323

The other RDBMS references for `WIDTH_BUCKET`:
 - Oracle: https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717
 - PostgreSQL: https://www.postgresql.org/docs/current/functions-math.html
 - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/width_bucket.html
 - Prestodb: https://prestodb.io/docs/current/functions/math.html
 - Teradata: https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/Wa8vw69cGzoRyNULHZeudg
 - DB2: https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061483.html?pos=2

### Why are the changes needed?

For better usability.

### Does this PR introduce _any_ user-facing change?

No.

### How was this patch tested?

Added unit tests.

Closes #28764 from maropu/SPARK-21117.

Lead-authored-by: Takeshi Yamamuro <[email protected]>
Co-authored-by: Yuming Wang <[email protected]>
Signed-off-by: Dongjoon Hyun <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants