Skip to content

Conversation

@amanomer
Copy link
Contributor

What changes were proposed in this pull request?

To make SparkSQL's cast as timestamp behavior consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL.

Why are the changes needed?

SparkSQL and PostgreSQL have a lot different cast behavior between types by default. We should make SparkSQL's cast behavior be consistent with PostgreSQL when spark.sql.dialect is configured as PostgreSQL.

Does this PR introduce any user-facing change?

Yes. If user switches to PostgreSQL dialect now, they will

  • get an AnalysisException when they input any data type except StringType and DateType.

How was this patch tested?

Added test cases.

@amanomer amanomer changed the title [WIP][SPARK-29838][SQL] PostgreSQL dialect: cast to timestamp [SPARK-29838][SQL] PostgreSQL dialect: cast to timestamp Nov 14, 2019
@amanomer
Copy link
Contributor Author

cc @maropu @cloud-fan

@amanomer
Copy link
Contributor Author

@maropu updated as per review comments. withTimeZone() is being repeated because cannot override it in abstract class PostgreCastBase.

@amanomer amanomer requested a review from maropu November 15, 2019 04:55
@amanomer
Copy link
Contributor Author

@maropu Can you review the latest changes?

@amanomer
Copy link
Contributor Author

@maropu Kindly review. I have updated as per your suggestion.

if (conf.usePostgreSQLDialect) {
plan.transformExpressions {
case Cast(child, dataType, timeZoneId)
if dataType == TimestampType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can leave cast timestamp to timestamp case to Optimizer to do optimization.

Copy link
Member

@Ngone51 Ngone51 Nov 22, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. Here, I mean that we should change the if condition to: if child.dataType != TimestampType && dataType == TimestampType =>. Because Optimizer, currently, can not optimize Pg cast.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got this. I will update this.

override def sql: String = s"CAST(${child.sql} AS ${dataType.sql})"
override def castToTimestamp(from: DataType): Any => Any = from match {
case StringType =>
buildCast[UTF8String](_, utfs => DateTimeUtils.stringToTimestamp(utfs, zoneId)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that postgre could correctly parse string 19700101, 1970/01/01, January 1 04:05:06 1970 PST while spark can't. So, I think that we may also need to support it in PostgreCastToTimestamp.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for your suggestion. I will check this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

postgres# select cast('19700101' as timestamp);
01.01.1970 00:00:00
postgres# select cast('1970/01/01' as timestamp);
01.01.1970 00:00:00
postgres# select cast('January 1 04:05:06 1970 PST' as timestamp);
01.01.1970 04:05:06

Spark results with NULL for all of them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu kindly review latest changes and give your feedback on supporting above queries.

Do we need to support them in this PR? If yes, we need to list all formats for timestamps which postgres supports but spark don't.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I personally think that the support above is not a main issue of this pr, so better to separate the two work: the timestamp cast support and the timestamp format support for the pg dialect.

@amanomer
Copy link
Contributor Author

amanomer commented Nov 19, 2019

cc @cloud-fan @srowen

@amanomer
Copy link
Contributor Author

amanomer commented Nov 20, 2019

@maropu Kindly review this. Other PRs are depending on this.

@amanomer amanomer requested a review from maropu November 20, 2019 05:56
val expectedResult = s"cannot cast type ${value.getClass} to timestamp"
assert(actualResult.contains(expectedResult))
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these tests to SQLQueryTestSuite,e.g., input/postgreSQL/cast.sql?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have moved these test cases. cast.sql.out needs to be updated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to delete this test case.

}

override def dataType: DataType = BooleanType
case class PostgreCastToTimestamp(child: Expression, timeZoneId: Option[String])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

btw, we need to define a new rule and a new cast expr for each Pg cast pattern? I mean we cannot define all the Pg cast patterns in a single rule and a cast expr? cc: @cloud-fan @Ngone51

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can and should combine them into a single one(both rule and expression) when more types get in. Just like the original Cast does. But I'm not sure where shall we start. Maybe, this one ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, I personally think so. @cloud-fan

@amanomer
Copy link
Contributor Author

Jenkins, test this please.

@amanomer amanomer requested a review from maropu November 23, 2019 02:53
@amanomer
Copy link
Contributor Author

cc @maropu @Ngone51

@Ngone51
Copy link
Member

Ngone51 commented Nov 25, 2019

Hi @amanomer , have you addressed #26472 (comment) ?

@amanomer
Copy link
Contributor Author

@Ngone51 I have updated this PR as per your reviews. Kindly review.

cc @maropu @cloud-fan

@maropu
Copy link
Member

maropu commented Nov 26, 2019

Can you merge the current two rule into one as mentioned above? #26472 (comment)

@amanomer
Copy link
Contributor Author

Yes. I will merge these two rules that are, CastToTimestamp and CastToBoolean

@amanomer
Copy link
Contributor Author

@maropu Please check latest changes. Thanks

@maropu
Copy link
Member

maropu commented Nov 29, 2019

IIUC we the community are planning to hold off the Pg dialect PRs for a while. (until 3.0 released?). plz check http://apache-spark-developers-list.1001551.n3.nabble.com/DISCUSS-PostgreSQL-dialect-td28417.html cc: @cloud-fan

@dongjoon-hyun
Copy link
Member

Thank you for contribution, @amanomer . As you know, unfortunately, we decided to remove PostgreSQL dialect via SPARK-30125 (#26763). Sorry about that. I'll close this PR, too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants