[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

beliefer · 2023-12-18T07:55:37Z

What changes were proposed in this pull request?

This PR will translate the aggregate function PERCENTILE_CONT and PERCENTILE_DISC for pushdown.

This PR adds Expression[] orderingWithinGroups into GeneralAggregateFunc, so as DS V2 pushdown framework could compile the WITHIN GROUP (ORDER BY ...) easily.
This PR also split visitInverseDistributionFunction from visitAggregateFunction, so as DS V2 pushdown framework could generate the syntax WITHIN GROUP (ORDER BY ...) easily.
This PR also fix a bug that JdbcUtils can't treat the precision and scale of decimal returned from JDBC.

Why are the changes needed?

DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

beliefer · 2023-12-18T07:58:31Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

The origin code will throws exception.

Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)

The precision is 38 and scale is 38 based on DecimalType.
In fact, the decimal return from JDBC is BigDecimal(7, 3).

…ialect ### What changes were proposed in this pull request? This PR fix a but by make JDBC dialect decide the decimal precision and scale. **How to reproduce the bug?** #44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`. The bug fired when pushdown the below SQL to H2 JDBC. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` **The root cause** `getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`. The query for database H2 show below. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` We can get the five variables from `ResultSetMetaData`, please refer: ``` columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)" dataType = 2 typeName = "NUMERIC" fieldSize = 100000 fieldScale = 50000 ``` Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually. The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`. At finally, `makeGetter` throws exception. ``` Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339) ``` ### Why are the changes needed? This PR fix the bug that `JdbcUtils` can't get the correct decimal type. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? Manual tests in #44397 ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44398 from beliefer/SPARK-46443. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

…ialect ### What changes were proposed in this pull request? This PR fix a but by make JDBC dialect decide the decimal precision and scale. **How to reproduce the bug?** #44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`. The bug fired when pushdown the below SQL to H2 JDBC. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` **The root cause** `getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`. The query for database H2 show below. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` We can get the five variables from `ResultSetMetaData`, please refer: ``` columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)" dataType = 2 typeName = "NUMERIC" fieldSize = 100000 fieldScale = 50000 ``` Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually. The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`. At finally, `makeGetter` throws exception. ``` Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339) ``` ### Why are the changes needed? This PR fix the bug that `JdbcUtils` can't get the correct decimal type. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? Manual tests in #44397 ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44398 from beliefer/SPARK-46443. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit a921da8) Signed-off-by: Wenchen Fan <[email protected]>

…NTILE_DISC

beliefer · 2023-12-22T06:07:46Z

ping @cloud-fan cc @huaxingao

beliefer · 2024-01-04T13:23:07Z

ping @cloud-fan

cloud-fan · 2024-01-05T16:01:55Z

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java

    this.name = name;
    this.isDistinct = isDistinct;
    this.children = children;
+    this.orderingWithinGroups = null;


empty array is a better default

cloud-fan · 2024-01-05T16:03:29Z

sql/catalyst/src/main/java/org/apache/spark/sql/connector/util/V2ExpressionSQLBuilder.java

+      String funcName, boolean isDistinct, String[] inputs, String[] orderingWithinGroups) {
+    assert(isDistinct == false);
+    String withinGroup =
+      joinArrayToString(orderingWithinGroups, ", ", "WITHIN GROUP (ORDER BY ", ")");


how do we translate ASC/DESC?

Please refer visitSortOrder.

beliefer · 2024-01-09T04:28:03Z

The GA failure is unrelated.

cloud-fan · 2024-01-09T11:29:43Z

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java

shall we use SortOrder[]?

cloud-fan · 2024-01-09T11:29:53Z

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java

cloud-fan · 2024-01-09T11:32:02Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala

why do we need this change? H2 dialect deals with these two functions in visitInverseDistributionFunction

Because H2 dialect overrides the visitInverseDistributionFunction and check with isSupportedFunction.

override def visitInverseDistributionFunction( funcName: String, isDistinct: Boolean, inputs: Array[String], orderingWithinGroups: Array[String]): String = { if (isSupportedFunction(funcName)) { super.visitInverseDistributionFunction( dialectFunctionName(funcName), isDistinct, inputs, orderingWithinGroups) } else { throw new UnsupportedOperationException( s"${this.getClass.getSimpleName} does not support " + s"inverse distribution function: $funcName") } }

cloud-fan · 2024-01-10T04:24:02Z

thanks, merging to master!

beliefer · 2024-01-10T04:44:02Z

@cloud-fan Thank you!

github-actions bot added the SQL label Dec 18, 2023

beliefer commented Dec 18, 2023

View reviewed changes

beliefer mentioned this pull request Dec 18, 2023

[SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect. #44398

Closed

beliefer force-pushed the SPARK-46442 branch 2 times, most recently from da47b85 to 1c985b3 Compare December 20, 2023 12:06

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCE…

87561b7

…NTILE_DISC

beliefer force-pushed the SPARK-46442 branch from 1c985b3 to 87561b7 Compare December 22, 2023 02:13

cloud-fan reviewed Jan 5, 2024

View reviewed changes

cloud-fan reviewed Jan 9, 2024

View reviewed changes

...src/main/java/org/apache/spark/sql/connector/expressions/aggregate/GeneralAggregateFunc.java Outdated

Copy link

Contributor

cloud-fan Jan 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

cloud-fan reviewed Jan 9, 2024

View reviewed changes

cloud-fan approved these changes Jan 9, 2024

View reviewed changes

Update code

7a89d77

beliefer force-pushed the SPARK-46442 branch from e085d72 to 7a89d77 Compare January 9, 2024 13:47

cloud-fan closed this in 85b504d Jan 10, 2024

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

[SPARK-46442][SQL] DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC #44397

Uh oh!

Conversation

beliefer commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer commented Dec 22, 2023

Uh oh!

beliefer commented Jan 4, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

beliefer commented Jan 9, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Jan 10, 2024

Uh oh!

beliefer commented Jan 10, 2024

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

beliefer commented Dec 18, 2023 •

edited

Loading