[SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect. #44398

beliefer · 2023-12-18T10:13:43Z

What changes were proposed in this pull request?

This PR fix a but by make JDBC dialect decide the decimal precision and scale.

How to reproduce the bug?
#44397 proposed DS V2 push down PERCENTILE_CONT and PERCENTILE_DISC.
The bug fired when pushdown the below SQL to H2 JDBC.
SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"

The root cause
getQueryOutputSchema used to get the output schema of query by call JdbcUtils.getSchema.
The query for database H2 show below.
SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"
We can get the five variables from ResultSetMetaData, please refer:

columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)"
dataType = 2
typeName = "NUMERIC"
fieldSize = 100000
fieldScale = 50000

Then we get the catalyst schema with JdbcUtils.getCatalystType, it calls DecimalType.bounded(precision, scale) actually.
The DecimalType.bounded(100000, 50000) returns DecimalType(38, 38).
At finally, makeGetter throws exception.

Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003
	at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)

Why are the changes needed?

This PR fix the bug that JdbcUtils can't get the correct decimal type.

Does this PR introduce any user-facing change?

'Yes'.
Fix a bug.

How was this patch tested?

Manual tests in #44397

Was this patch authored or co-authored using generative AI tooling?

'No'.

beliefer · 2023-12-18T10:23:34Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/JdbcDialects.scala

I don't know the background. So I let it as the default implementation.

cc @JoshRosen

cloud-fan · 2023-12-19T06:29:09Z

Where does Decimal precision 42 come from?

beliefer · 2023-12-19T10:06:47Z

Where does Decimal precision 42 come from?

It comes from

spark/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

Line 416 in dc0bfc4

    
           nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s))

The schema is DecimalType(38, 38) and the data returns from H2 is java.math.BigDecimal(7, 2).
d = java.math.BigDecimal(7, 2)
p = 38
s = 38
The Decimal(d, p, s) causes the exception.

cloud-fan · 2023-12-19T12:45:10Z

So what we need is a cast? seems Decimal(d, p, s) is not safe.

cloud-fan · 2023-12-19T12:50:32Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala

This seems wrong. I think we should make sure the final Decimal instance we return has the same precision and scale as the JDBC column type.

cloud-fan · 2023-12-19T12:52:34Z

The DecimalType.bounded(100000, 50000) returns DecimalType(38, 38).

I think this is already wrong. We should update the H2 dialect to return decimal(38, 19), so that we have half digits for the integral part and half digits for the fraction part.

beliefer · 2023-12-19T13:07:00Z

So what we need is a cast? seems Decimal(d, p, s) is not safe.

Yes.

beliefer · 2023-12-19T13:07:29Z

I think this is already wrong. We should update the H2 dialect to return decimal(38, 19), so that we have half digits for the integral part and half digits for the fraction part.

Let me try this way.

cloud-fan · 2023-12-20T05:07:51Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala

this is too specific. Can we do it if precision > 38?

I am doubt that H2 may only have this particular situation.
Other situations greater than 38 have not been actually verified. Can we wait until we encounter other exceptions in the future before expanding?

cloud-fan · 2023-12-21T12:49:22Z

sql/core/src/main/scala/org/apache/spark/sql/jdbc/H2Dialect.scala

let's make the comment a bit more clearer:

H2 supports very large decimal precision like 100000. The max precision in Spark is only 38. Here we shrink both the precision and scale of H2 decimal to fit Spark, and still keep the ratio between them.

… dialect.

beliefer · 2023-12-22T01:42:57Z

The GA failure is unrelated.

cloud-fan · 2023-12-22T01:54:21Z

thanks, merging to master/3.5!

…ialect ### What changes were proposed in this pull request? This PR fix a but by make JDBC dialect decide the decimal precision and scale. **How to reproduce the bug?** #44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`. The bug fired when pushdown the below SQL to H2 JDBC. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` **The root cause** `getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`. The query for database H2 show below. `SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"` We can get the five variables from `ResultSetMetaData`, please refer: ``` columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)" dataType = 2 typeName = "NUMERIC" fieldSize = 100000 fieldScale = 50000 ``` Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually. The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`. At finally, `makeGetter` throws exception. ``` Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003 at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48) at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124) at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358) at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339) ``` ### Why are the changes needed? This PR fix the bug that `JdbcUtils` can't get the correct decimal type. ### Does this PR introduce _any_ user-facing change? 'Yes'. Fix a bug. ### How was this patch tested? Manual tests in #44397 ### Was this patch authored or co-authored using generative AI tooling? 'No'. Closes #44398 from beliefer/SPARK-46443. Authored-by: Jiaan Geng <[email protected]> Signed-off-by: Wenchen Fan <[email protected]> (cherry picked from commit a921da8) Signed-off-by: Wenchen Fan <[email protected]>

beliefer · 2023-12-22T01:58:23Z

@cloud-fan Thank you!

github-actions bot added the SQL label Dec 18, 2023

beliefer force-pushed the SPARK-46443 branch from c5eb9c1 to d0f31bd Compare December 18, 2023 10:22

beliefer commented Dec 18, 2023

View reviewed changes

beliefer force-pushed the SPARK-46443 branch from d0f31bd to 039ef7e Compare December 18, 2023 13:15

beliefer requested review from JoshRosen and cloud-fan December 19, 2023 01:45

cloud-fan reviewed Dec 19, 2023

View reviewed changes

beliefer changed the title ~~[SPARK-46443][SQL] Decimal precision and scale should decided by JDBC dialect.~~ [SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect. Dec 20, 2023

cloud-fan reviewed Dec 20, 2023

View reviewed changes

beliefer force-pushed the SPARK-46443 branch from c87d2d4 to 2a07868 Compare December 20, 2023 12:20

beliefer requested a review from cloud-fan December 21, 2023 08:28

cloud-fan reviewed Dec 21, 2023

View reviewed changes

cloud-fan approved these changes Dec 21, 2023

View reviewed changes

[SPARK-46443][SQL] Decimal precision and scale should decided by JDBC…

86eeeb8

… dialect.

beliefer force-pushed the SPARK-46443 branch from 2a07868 to 86eeeb8 Compare December 21, 2023 13:38

cloud-fan approved these changes Dec 22, 2023

View reviewed changes

cloud-fan closed this in a921da8 Dec 22, 2023

[SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect. #44398

[SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect. #44398

Uh oh!

Conversation

beliefer commented Dec 18, 2023 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

beliefer Dec 18, 2023

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 19, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 19, 2023

Uh oh!

beliefer commented Dec 19, 2023

Uh oh!

cloud-fan commented Dec 19, 2023

Uh oh!

cloud-fan Dec 19, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 19, 2023

Uh oh!

beliefer commented Dec 19, 2023

Uh oh!

beliefer commented Dec 19, 2023

Uh oh!

cloud-fan Dec 20, 2023

Choose a reason for hiding this comment

Uh oh!

beliefer Dec 20, 2023

Choose a reason for hiding this comment

Uh oh!

cloud-fan Dec 21, 2023

Choose a reason for hiding this comment

Uh oh!

beliefer commented Dec 22, 2023

Uh oh!

cloud-fan commented Dec 22, 2023

Uh oh!

beliefer commented Dec 22, 2023

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

beliefer commented Dec 18, 2023 •

edited

Loading