Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Dec 18, 2023

What changes were proposed in this pull request?

This PR will translate the aggregate function PERCENTILE_CONT and PERCENTILE_DISC for pushdown.

  • This PR adds Expression[] orderingWithinGroups into GeneralAggregateFunc, so as DS V2 pushdown framework could compile the WITHIN GROUP (ORDER BY ...) easily.

  • This PR also split visitInverseDistributionFunction from visitAggregateFunction, so as DS V2 pushdown framework could generate the syntax WITHIN GROUP (ORDER BY ...) easily.

  • This PR also fix a bug that JdbcUtils can't treat the precision and scale of decimal returned from JDBC.

Why are the changes needed?

DS V2 supports push down PERCENTILE_CONT and PERCENTILE_DISC.

Does this PR introduce any user-facing change?

'No'.
New feature.

How was this patch tested?

New test cases.

Was this patch authored or co-authored using generative AI tooling?

'No'.

@github-actions github-actions bot added the SQL label Dec 18, 2023
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The origin code will throws exception.

Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003
	at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The precision is 38 and scale is 38 based on DecimalType.
In fact, the decimal return from JDBC is BigDecimal(7, 3).

@beliefer beliefer force-pushed the SPARK-46442 branch 2 times, most recently from da47b85 to 1c985b3 Compare December 20, 2023 12:06
cloud-fan pushed a commit that referenced this pull request Dec 22, 2023
…ialect

### What changes were proposed in this pull request?
This PR fix a but by make JDBC dialect decide the decimal precision and scale.

**How to reproduce the bug?**
#44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`.
The bug fired when pushdown the below SQL to H2 JDBC.
`SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`

**The root cause**
`getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`.
The query for database H2 show below.
`SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`
We can get the five variables from `ResultSetMetaData`, please refer:
```
columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)"
dataType = 2
typeName = "NUMERIC"
fieldSize = 100000
fieldScale = 50000
```
Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually.
The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`.
At finally, `makeGetter` throws exception.
```
Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003
	at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)
```

### Why are the changes needed?
This PR fix the bug that `JdbcUtils` can't get the correct decimal type.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Fix a bug.

### How was this patch tested?
Manual tests in #44397

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44398 from beliefer/SPARK-46443.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
cloud-fan pushed a commit that referenced this pull request Dec 22, 2023
…ialect

### What changes were proposed in this pull request?
This PR fix a but by make JDBC dialect decide the decimal precision and scale.

**How to reproduce the bug?**
#44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`.
The bug fired when pushdown the below SQL to H2 JDBC.
`SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`

**The root cause**
`getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`.
The query for database H2 show below.
`SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`
We can get the five variables from `ResultSetMetaData`, please refer:
```
columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)"
dataType = 2
typeName = "NUMERIC"
fieldSize = 100000
fieldScale = 50000
```
Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually.
The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`.
At finally, `makeGetter` throws exception.
```
Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003
	at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)
```

### Why are the changes needed?
This PR fix the bug that `JdbcUtils` can't get the correct decimal type.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Fix a bug.

### How was this patch tested?
Manual tests in #44397

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44398 from beliefer/SPARK-46443.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a921da8)
Signed-off-by: Wenchen Fan <[email protected]>
@beliefer
Copy link
Contributor Author

ping @cloud-fan cc @huaxingao

@beliefer
Copy link
Contributor Author

beliefer commented Jan 4, 2024

ping @cloud-fan

this.name = name;
this.isDistinct = isDistinct;
this.children = children;
this.orderingWithinGroups = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty array is a better default

String funcName, boolean isDistinct, String[] inputs, String[] orderingWithinGroups) {
assert(isDistinct == false);
String withinGroup =
joinArrayToString(orderingWithinGroups, ", ", "WITHIN GROUP (ORDER BY ", ")");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how do we translate ASC/DESC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer visitSortOrder.

@beliefer
Copy link
Contributor Author

beliefer commented Jan 9, 2024

The GA failure is unrelated.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we use SortOrder[]?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need this change? H2 dialect deals with these two functions in visitInverseDistributionFunction

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because H2 dialect overrides the visitInverseDistributionFunction and check with isSupportedFunction.

     override def visitInverseDistributionFunction(
         funcName: String,
         isDistinct: Boolean,
         inputs: Array[String],
         orderingWithinGroups: Array[String]): String = {
       if (isSupportedFunction(funcName)) {
         super.visitInverseDistributionFunction(
           dialectFunctionName(funcName), isDistinct, inputs, orderingWithinGroups)
       } else {
         throw new UnsupportedOperationException(
           s"${this.getClass.getSimpleName} does not support " +
             s"inverse distribution function: $funcName")
       }
     }

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in 85b504d Jan 10, 2024
@beliefer
Copy link
Contributor Author

@cloud-fan Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants