Skip to content

[SPARK-34333][SQL] Fix PostgresDialect to handle money types properly#31442

Closed
sarutak wants to merge 6 commits intoapache:masterfrom
sarutak:fix-for-money-type
Closed

[SPARK-34333][SQL] Fix PostgresDialect to handle money types properly#31442
sarutak wants to merge 6 commits intoapache:masterfrom
sarutak:fix-for-money-type

Conversation

@sarutak
Copy link
Member

@sarutak sarutak commented Feb 2, 2021

What changes were proposed in this pull request?

This PR changes the type mapping for money and money[] types for PostgreSQL.
Currently, those types are tried to convert to DoubleType and ArrayType of double respectively.
But the JDBC driver seems not to be able to handle those types properly.

pgjdbc/pgjdbc#100
pgjdbc/pgjdbc#1405

Due to these issue, we can get the error like as follows.

money type.

[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : 1,000.00
[info] 	at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104)
[info] 	at org.postgresql.jdbc.PgResultSet.getDouble(PgResultSet.java:2432)
[info] 	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$5(JdbcUtils.scala:418)

money[] type.

[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : $2,000.00
[info] 	at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104)
[info] 	at org.postgresql.jdbc.ArrayDecoding$5.parseValue(ArrayDecoding.java:235)
[info] 	at org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:122)
[info] 	at org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:764)
[info] 	at org.postgresql.jdbc.PgArray.buildArray(PgArray.java:310)
[info] 	at org.postgresql.jdbc.PgArray.getArrayImpl(PgArray.java:171)
[info] 	at org.postgresql.jdbc.PgArray.getArray(PgArray.java:111)

For money type, a known workaround is to treat it as string so this PR do it.
For money[], however, there is no reasonable workaround so this PR remove the support.

Why are the changes needed?

This is a bug.

Does this PR introduce any user-facing change?

Yes. As of this PR merged, money type is mapped to StringType rather than DoubleType and the support for money[] is stopped.
For money type, if the value is less than one thousand, $100.00 for instance, it works without this change so I also updated the migration guide because it's a behavior change for such small values.
On the other hand, money[] seems not to work with any value but mentioned in the migration guide just in case.

How was this patch tested?

New test.

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39381/

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39381/

@SparkQA
Copy link

SparkQA commented Feb 2, 2021

Test build #134793 has finished for PR 31442 at commit 59dbb59.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


## Upgrading from Spark SQL 3.1 to 3.2

- In Spark 3.2, money type in PostgreSQL table is converted to `StringType` and money[] type is not supported due to the JDBC driver for PostgreSQL can't handle those types properly.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, DecimalType fixes the accuracy problem with money, but can't account for units. I suppose we could one day support it as a struct of DecimalType and StringType for currency, but string seems fine now.
Why not a string array for a money array?

Copy link
Member Author

@sarutak sarutak Feb 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a string array for a money array?

For money type, Spark SQL calls PgResultSet.getDouble causing the error.

[info]   org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0) (192.168.1.204 executor driver): org.postgresql.util.PSQLException: Bad value for type double : 1,000.00
[info] 	at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104)
[info] 	at org.postgresql.jdbc.PgResultSet.getDouble(PgResultSet.java:2432)
[info] 	at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.$anonfun$makeGetter$5(JdbcUtils.scala:418)

So, we can avoid this issue by mapping money type to StringType to let Spark SQL call getString rather than getDouble.

For money[] type, on the other hand, the PostgreSQL's JDBC driver calls PgResultSet.toDouble internally.

[info] 	at org.postgresql.jdbc.PgResultSet.toDouble(PgResultSet.java:3104)
[info] 	at org.postgresql.jdbc.ArrayDecoding$5.parseValue(ArrayDecoding.java:235)
[info] 	at org.postgresql.jdbc.ArrayDecoding$AbstractObjectStringArrayDecoder.populateFromString(ArrayDecoding.java:122)
[info] 	at org.postgresql.jdbc.ArrayDecoding.readStringArray(ArrayDecoding.java:764)

We can control how Spark SQL gets the value from the array obtained by PgResultSet.getArray, but it's difficult to control how the JDBC driver handles the elements in the array which is to be returned.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the migration guide, we should also mention the previous behavior.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks. I've updated.

@sarutak
Copy link
Member Author

sarutak commented Feb 8, 2021

cc: @cloud-fan

@cloud-fan
Copy link
Contributor

LGTM except one comment about migration guide.

@SparkQA
Copy link

SparkQA commented Feb 9, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39642/

@SparkQA
Copy link

SparkQA commented Feb 9, 2021

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39642/

@SparkQA
Copy link

SparkQA commented Feb 9, 2021

Test build #135060 has finished for PR 31442 at commit c63121a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@maropu
Copy link
Member

maropu commented Feb 10, 2021

Could you resolve the conflict? Looks fine otherwise.


## Upgrading from Spark SQL 3.1 to 3.2

- In Spark 3.2, PostgreSQL JDBC dialect uses StringType for MONEY and MONEY[] is not supported due to the JDBC driver for PostgreSQL can't handle those types properly. Previously, DoubleType and ArrayType of DoubleType are used respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Previously, => In Spark 3.1 or earlier,

@sarutak
Copy link
Member Author

sarutak commented Feb 10, 2021

@maropu Thanks. I've updated.

@SparkQA
Copy link

SparkQA commented Feb 10, 2021

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39673/

@SparkQA
Copy link

SparkQA commented Feb 10, 2021

Kubernetes integration test status failure
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/39673/

@SparkQA
Copy link

SparkQA commented Feb 10, 2021

Test build #135091 has finished for PR 31442 at commit b838218.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.


## Upgrading from Spark SQL 3.1 to 3.2

- In Spark 3.2, PostgreSQL JDBC dialect uses StringType for MONEY and MONEY[] is not supported due to the JDBC driver for PostgreSQL can't handle those types properly. In Spark 3.1 or earlier, DoubleType and ArrayType of DoubleType are used respectively.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have the same comment here with #31491 (comment)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've replied at #31491 (comment) .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu Any more concern?

@sarutak
Copy link
Member Author

sarutak commented Feb 16, 2021

If there is no objection, I'll merge soon.
If we find that the word dialect in the migration guide is an internal term, I'll open another PR to fix all the occurrence in the guide.

@asfgit asfgit closed this in dd6383f Feb 17, 2021
@sarutak
Copy link
Member Author

sarutak commented Feb 17, 2021

Merged to master. Thanks all.

@dongjoon-hyun
Copy link
Member

Thank you, @sarutak and all.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants