Skip to content

Conversation

@beliefer
Copy link
Contributor

@beliefer beliefer commented Dec 18, 2023

What changes were proposed in this pull request?

This PR fix a but by make JDBC dialect decide the decimal precision and scale.

How to reproduce the bug?
#44397 proposed DS V2 push down PERCENTILE_CONT and PERCENTILE_DISC.
The bug fired when pushdown the below SQL to H2 JDBC.
SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"

The root cause
getQueryOutputSchema used to get the output schema of query by call JdbcUtils.getSchema.
The query for database H2 show below.
SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"
We can get the five variables from ResultSetMetaData, please refer:

columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)"
dataType = 2
typeName = "NUMERIC"
fieldSize = 100000
fieldScale = 50000

Then we get the catalyst schema with JdbcUtils.getCatalystType, it calls DecimalType.bounded(precision, scale) actually.
The DecimalType.bounded(100000, 50000) returns DecimalType(38, 38).
At finally, makeGetter throws exception.

Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003
	at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)

Why are the changes needed?

This PR fix the bug that JdbcUtils can't get the correct decimal type.

Does this PR introduce any user-facing change?

'Yes'.
Fix a bug.

How was this patch tested?

Manual tests in #44397

Was this patch authored or co-authored using generative AI tooling?

'No'.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know the background. So I let it as the default implementation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cloud-fan
Copy link
Contributor

Where does Decimal precision 42 come from?

@beliefer
Copy link
Contributor Author

Where does Decimal precision 42 come from?

It comes from

nullSafeConvert[java.math.BigDecimal](rs.getBigDecimal(pos + 1), d => Decimal(d, p, s))

The schema is DecimalType(38, 38) and the data returns from H2 is java.math.BigDecimal(7, 2).
d = java.math.BigDecimal(7, 2)
p = 38
s = 38
The Decimal(d, p, s) causes the exception.

@cloud-fan
Copy link
Contributor

So what we need is a cast? seems Decimal(d, p, s) is not safe.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong. I think we should make sure the final Decimal instance we return has the same precision and scale as the JDBC column type.

@cloud-fan
Copy link
Contributor

The DecimalType.bounded(100000, 50000) returns DecimalType(38, 38).

I think this is already wrong. We should update the H2 dialect to return decimal(38, 19), so that we have half digits for the integral part and half digits for the fraction part.

@beliefer
Copy link
Contributor Author

So what we need is a cast? seems Decimal(d, p, s) is not safe.

Yes.

@beliefer
Copy link
Contributor Author

I think this is already wrong. We should update the H2 dialect to return decimal(38, 19), so that we have half digits for the integral part and half digits for the fraction part.

Let me try this way.

@beliefer beliefer changed the title [SPARK-46443][SQL] Decimal precision and scale should decided by JDBC dialect. [SPARK-46443][SQL] Decimal precision and scale should decided by H2 dialect. Dec 20, 2023
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is too specific. Can we do it if precision > 38?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am doubt that H2 may only have this particular situation.
Other situations greater than 38 have not been actually verified. Can we wait until we encounter other exceptions in the future before expanding?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's make the comment a bit more clearer:

H2 supports very large decimal precision like 100000. The max precision in Spark is only 38.
Here we shrink both the precision and scale of H2 decimal to fit Spark, and still keep the ratio between them.

@beliefer
Copy link
Contributor Author

The GA failure is unrelated.

@cloud-fan
Copy link
Contributor

thanks, merging to master/3.5!

@cloud-fan cloud-fan closed this in a921da8 Dec 22, 2023
cloud-fan pushed a commit that referenced this pull request Dec 22, 2023
…ialect

### What changes were proposed in this pull request?
This PR fix a but by make JDBC dialect decide the decimal precision and scale.

**How to reproduce the bug?**
#44397 proposed DS V2 push down `PERCENTILE_CONT` and `PERCENTILE_DISC`.
The bug fired when pushdown the below SQL to H2 JDBC.
`SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`

**The root cause**
`getQueryOutputSchema` used to get the output schema of query by call `JdbcUtils.getSchema`.
The query for database H2 show below.
`SELECT "DEPT",PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY "SALARY" ASC NULLS FIRST) FROM "test"."employee" WHERE 1=0 GROUP BY "DEPT"`
We can get the five variables from `ResultSetMetaData`, please refer:
```
columnName = "PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY SALARY NULLS FIRST)"
dataType = 2
typeName = "NUMERIC"
fieldSize = 100000
fieldScale = 50000
```
Then we get the catalyst schema with `JdbcUtils.getCatalystType`, it calls `DecimalType.bounded(precision, scale)` actually.
The `DecimalType.bounded(100000, 50000)` returns `DecimalType(38, 38)`.
At finally, `makeGetter` throws exception.
```
Caused by: org.apache.spark.SparkArithmeticException: [DECIMAL_PRECISION_EXCEEDS_MAX_PRECISION] Decimal precision 42 exceeds max precision 38. SQLSTATE: 22003
	at org.apache.spark.sql.errors.DataTypeErrors$.decimalPrecisionExceedsMaxPrecisionError(DataTypeErrors.scala:48)
	at org.apache.spark.sql.types.Decimal.set(Decimal.scala:124)
	at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:577)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$4(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.nullSafeConvert(JdbcUtils.scala:552)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3(JdbcUtils.scala:408)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$3$adapted(JdbcUtils.scala:406)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:358)
	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$$anon$1.getNext(JdbcUtils.scala:339)
```

### Why are the changes needed?
This PR fix the bug that `JdbcUtils` can't get the correct decimal type.

### Does this PR introduce _any_ user-facing change?
'Yes'.
Fix a bug.

### How was this patch tested?
Manual tests in #44397

### Was this patch authored or co-authored using generative AI tooling?
'No'.

Closes #44398 from beliefer/SPARK-46443.

Authored-by: Jiaan Geng <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
(cherry picked from commit a921da8)
Signed-off-by: Wenchen Fan <[email protected]>
@beliefer
Copy link
Contributor Author

@cloud-fan Thank you!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants