-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-21117][SQL] Built-in SQL Function Support - WIDTH_BUCKET #18323
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #78160 has started for PR 18323 at commit |
Conflicts: sql/core/src/test/resources/sql-tests/inputs/operators.sql sql/core/src/test/resources/sql-tests/results/operators.sql.out
|
Test build #78161 has started for PR 18323 at commit |
|
Jenkins, retest this please |
|
Test build #78181 has finished for PR 18323 at commit
|
Conflicts: sql/core/src/test/resources/sql-tests/inputs/operators.sql sql/core/src/test/resources/sql-tests/results/operators.sql.out
|
Test build #78196 has finished for PR 18323 at commit
|
|
cc @wzhfy |
|
Test build #78245 has finished for PR 18323 at commit
|
| ev: ExprCode, | ||
| f: (String, String, String, String) => String): ExprCode = { | ||
| nullSafeCodeGen(ctx, ev, (eval1, eval2, eval3, eval4) => { | ||
| s"${ev.value} = ${f(eval1, eval2, eval3, eval3)};" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the last eval3 -> eval4
| override def nullable: Boolean = children.exists(_.nullable) | ||
|
|
||
| /** | ||
| * Default behavior of evaluation according to the default nullability of TernaryExpression. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TernaryExpression -> QuaternaryExpression, please also fix other places
| * If either of the sub-expressions is null, the result of this computation | ||
| * is assumed to be null. | ||
| * | ||
| * @param f function that accepts the 3 non-null evaluation result names of children |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
3 -> 4
| } | ||
|
|
||
| /** | ||
| * The bucket number into which |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Returns the bucket number
| * Returns the bucket number into which | ||
| * the value of this expression would fall after being evaluated. | ||
| * | ||
| * @param expr id the expression for which the histogram is being created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
id -> is
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: id -> is
| * @param numBucket is an An expression that resolves to | ||
| * a constant indicating the number of buckets | ||
| * @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by | ||
| * the range [minValue, maxValue] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Both endpoints are included? Can you check with other databases and add a comment here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, test on Oracle:
width_Bucket(0, 1, 1, 1) -> 0, width_Bucket(20, 1, 1, 1) -> 2
| } | ||
|
|
||
| val result = if (minValue > maxValue) (numBucket - preResult) + 1 else preResult | ||
| result |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: if (minValue > maxValue) (numBucket - preResult) + 1 else preResult
| val preResult: Long = if (expr < lower) { | ||
| 0 | ||
| } else if (expr >= upper) { | ||
| Math.addExact(numBucket, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need to use addExact? In Oracle's doc, numBucket is an integer, then we can use numBucket + 1L here
| } else if (expr >= upper) { | ||
| Math.addExact(numBucket, 1) | ||
| } else { | ||
| (numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what if upper == lower?
| assert(widthBucket(100, 100, 5000, 10) === 1) | ||
| assert(widthBucket(590, 100, 5000, 10) === 2) | ||
| assert(widthBucket(5000, 100, 5000, 10) === 11) | ||
| assert(widthBucket(6000, 100, 5000, 10) === 11) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you use the same cases from Oracle's doc? just to make sure we are getting the same results here.
Conflicts: sql/core/src/test/resources/sql-tests/results/operators.sql.out
|
Test build #78515 has finished for PR 18323 at commit
|
| * Returns the bucket number into which | ||
| * the value of this expression would fall after being evaluated. | ||
| * | ||
| * @param expr id the expression for which the histogram is being created |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: id -> is
|
|
||
| -- width_bucket | ||
| select width_bucket(5.35, 0.024, 10.06, 5); | ||
| select width_bucket(5.35, 0.024, 10.06, -5); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add a case for wrong input type: select width_bucket(5.35, 0.024, 10.06, 0.5);
|
|
||
| if (numBucket <= 0) { | ||
| throw new AnalysisException(s"The num of bucket must be greater than 0, but got ${numBucket}") | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we consider minValue == maxValue and numBucket > 1 valid input or not?
Please also add a test case for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If minValue == maxValue , then lower==upper, result is numBucket + 1L:
val lower: Double = Math.min(minValue, maxValue)
val upper: Double = Math.max(minValue, maxValue)
val result: Long = if (expr < lower) {
0
} else if (expr >= upper) {
numBucket + 1L
} else {
(numBucket.toDouble * (expr - lower) / (upper - lower) + 1).toLong
}
if (minValue > maxValue) (numBucket - result) + 1 else result
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know, but my question is: should the possible result when minValue == maxValue only be 0, 1, or 2? e.g. should width_bucket(2, 1, 1, 100) = 101 or just throw an error due to invalid input?
| * a constant indicating the number of buckets | ||
| * @return Returns an long between 0 and numBucket+1 by mapping the expr into buckets defined by | ||
| * the range [minValue, maxValue]. For example: | ||
| * widthBucket(0, 1, 1, 1) -> 0, widthBucket(20, 1, 1, 1) -> 2. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's remove these examples in the description, they are just corner cases. My previous comment was just to make sure both ends should be included.
| assert(widthBucket(400, 100, 5000, 10) === 1) | ||
|
|
||
| // minValue == maxValue | ||
| assert(widthBucket(10, 4, 4, 15) === 16) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this behavior consistent with other databases? If so, then I'm fine with this and please ignore my previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
|
Test build #78611 has finished for PR 18323 at commit
|
|
retest this please. |
|
Test build #78627 has finished for PR 18323 at commit
|
|
This error appears in a few builds. Let's wait for the fix. Currently it looks good to me. |
|
retest this please |
|
Test build #78769 has finished for PR 18323 at commit
|
| * to the minimum end point of the acceptable range for expr | ||
| * @param maxValue is an expression that resolves | ||
| * to the maximum end point of the acceptable range for expr | ||
| * @param numBucket is an An expression that resolves to |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
an An -> an
|
|
||
| override def nullSafeEval(ex: Any, min: Any, max: Any, num: Any): Any = { | ||
| MathUtils.widthBucket( | ||
| ex.asInstanceOf[Double], |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happened if the input is not a constant, but an foldable expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
foldable expression is correct.
|
Test build #79641 has finished for PR 18323 at commit
|
| */ | ||
| // scalastyle:off line.size.limit | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(expr, min_value, max_value, num_bucket) - Returns an long between 0 and `num_buckets`+1 by mapping the `expr` into buckets defined by the range [`min_value`, `max_value`].", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: an -> a
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns the
bucketto which operand would be assigned in an equidepth histogram withnum_bucketbuckets, in the rangemin_valuetomax_value.
|
Test build #80112 has finished for PR 18323 at commit
|
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala # sql/core/src/test/resources/sql-tests/inputs/operators.sql # sql/core/src/test/resources/sql-tests/results/operators.sql.out
|
Test build #81843 has finished for PR 18323 at commit
|
|
Any news? What's blocking this PR to move on except the conflict to resolve since it's been 1 year? |
|
Test build #94376 has finished for PR 18323 at commit
|
|
Closed? Not merged? Will it come via another way? |
|
@wangyum Can we rebase it? We should add this support for compatibility. |
|
Test build #122180 has finished for PR 18323 at commit
|
# Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/mathExpressions.scala # sql/core/src/test/resources/sql-tests/results/operators.sql.out
|
Test build #122223 has finished for PR 18323 at commit
|
|
You need to run |
|
Test build #122229 has finished for PR 18323 at commit
|
| expr: Expression, | ||
| minValue: Expression, | ||
| maxValue: Expression, | ||
| numBucket: Expression) extends Expression with ImplicitCastInputTypes { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloud we do WidthBucket (...) extends QuaternaryExpression with ImplicitCastInputTypes with CodegenFallback for simplicity? If we need the codegen support, I think its okay to do so in follow-up.
|
|
||
| /** | ||
| * Returns the bucket number into which | ||
| * the value of this expression would fall after being evaluated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about the format and the rephrasing below?
/**
* Returns the bucket number into which the value of this expression would fall
* after being evaluated.
*
* @param expr is the expression to compute a bucket number in the histogram
* @param minValue is the minimum value of the histogram
* @param maxValue is the maximum value of the histogram
* @param numBucket is the number of buckets
*/
| // scalastyle:off line.size.limit | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(expr, min_value, max_value, num_bucket) - Returns the `bucket` to which operand would be assigned in an equidepth histogram with `num_bucket` buckets, in the range `min_value` to `max_value`.", | ||
| extended = """ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extended -> examples. Also, plz add a since tag.
| } else if (evals(0) == null) { | ||
| null | ||
| } else { | ||
| MathUtils.widthBucket( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MathUtils.widthBucket is used only for this new expr now? If so, could we move the method to this expr?
|
Could you solve the conflict? |
|
@wangyum If you don't time to update this, I could take this over. Thought? |
|
@maropu Please take it over. Thank you. |
|
ok |
### What changes were proposed in this pull request? This PR intends to add a build-in SQL function - `WIDTH_BUCKET`. It is the rework of #18323. Closes #18323 The other RDBMS references for `WIDTH_BUCKET`: - Oracle: https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717 - PostgreSQL: https://www.postgresql.org/docs/current/functions-math.html - Snowflake: https://docs.snowflake.com/en/sql-reference/functions/width_bucket.html - Prestodb: https://prestodb.io/docs/current/functions/math.html - Teradata: https://docs.teradata.com/reader/kmuOwjp1zEYg98JsB8fu_A/Wa8vw69cGzoRyNULHZeudg - DB2: https://www.ibm.com/support/producthub/db2/docs/content/SSEPGG_11.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0061483.html?pos=2 ### Why are the changes needed? For better usability. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Added unit tests. Closes #28764 from maropu/SPARK-21117. Lead-authored-by: Takeshi Yamamuro <[email protected]> Co-authored-by: Yuming Wang <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

What changes were proposed in this pull request?
Add build-in SQL function -
WIDTH_BUCKETRef: https://docs.oracle.com/cd/B28359_01/olap.111/b28126/dml_functions_2137.htm#OLADM717
How was this patch tested?
unit tests and manual tests.
manual tests(compare
spark-sqlresults withoracle):