[SPARK-33888][SQL] JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis by saikocat · Pull Request #30902 · apache/spark

saikocat · 2020-12-23T07:29:54Z

What changes were proposed in this pull request?

JDBC SQL TIME type represents incorrectly as TimestampType, we change it to be physical Int in millis for now.

Why are the changes needed?

Currently, for JDBC, SQL TIME type represents incorrectly as Spark TimestampType. This should be represent as physical int in millis Represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one millisecond. It stores the number of milliseconds after midnight, 00:00:00.000.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Close #30902

gengliangwang · 2020-12-23T08:58:05Z

Hi @saikocat ,

thanks for reporting the issue.
As per Avro spec https://avro.apache.org/docs/1.8.2/spec.html#Time+%28millisecond+precision%29:

Time (microsecond precision)
The time-micros logical type represents a time of day, with no reference to a particular calendar, time zone or date, with a precision of one microsecond.

A time-micros logical type annotates an Avro long, where the long stores the number of microseconds after midnight, 00:00:00.000000.

Timestamp (millisecond precision)
The timestamp-millis logical type represents an instant on the global timeline, independent of a particular time zone or calendar, with a precision of one millisecond.

A timestamp-millis logical type annotates an Avro long, where the long stores the number of milliseconds from the unix epoch, 1 January 1970 00:00:00.000 UTC.

So, we can't map the time-millis/time-micros type to Timestamp type. Instead, we should map it as CalendarIntervalType.
In Spark, CalendarIntervalType is internal type and we can't use it for table creatation. So we didn't support the Avro logical type time-millis and time-micros.
In the current stage, I am +0 for supporting it.
cc @cloud-fan

saikocat · 2020-12-23T09:19:29Z

Hi gengliangwang,

Thank you so much for your reply. I understand the issue now.

Then I guess the bug should be in the Spark JDBC Utils schema conversion part then (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L229)? Cos it converts a TIME(6) in MySQL into TimestampType. It should be IntegerType then no?

gengliangwang · 2020-12-23T12:45:44Z

Then I guess the bug should be in the Spark JDBC Utils schema conversion part then (https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala#L229)? Cos it converts a TIME(6) in MySQL into TimestampType. It should be IntegerType then no?

Yes, it should be Integertype or CalendarIntervalType

cloud-fan · 2020-12-24T06:18:40Z

TIME is a standard SQL data type, which is not TIMESTAMP or INTERVAL. Unfortunately, Spark doesn't support this data type yet, and can only fail (or read as the physical type int).

The JDBC mapping is wrong and we should fix it.

saikocat · 2020-12-24T06:26:08Z

Noted with thanks. Should I just make a change/fix to JdbcUtils.scala then? Should be an easy fix if we choose to go with IntegerType.

cloud-fan · 2020-12-24T08:09:43Z

Yea, and we can also fix avro to produce better error message (or read time as int)

saikocat · 2020-12-25T07:18:29Z

I think I've fixed the type mapping and tests. However, I can't seem to pass this test CliSuite.SPARK-29022: Commands using SerDe provided in --hive.aux.jars.path which looks pretty much unrelated to the changes I made: https://github.com/apache/spark/pull/30902/checks?check_run_id=1607685366.

Any helps would be much appreciated!

Merry Christmas!

cloud-fan · 2020-12-25T08:15:23Z

ok to test

saikocat · 2020-12-25T09:59:36Z

Seems like Amplab Jenkins CI run out of space when assembly :(

SparkQA · 2020-12-25T13:29:19Z

Test build #133380 has finished for PR 30902 at commit 68e2afe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-28T07:07:20Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+        // Represents a time of day, with no reference to a particular calendar,
+        // time zone or date, with a precision of one millisecond.
+        // It stores the number of milliseconds after midnight, 00:00:00.000.
+        if (rs.getMetaData.getColumnType(columnIdx) == java.sql.Types.TIME) {


do we run the if check for every input row?

Yes. You're right.

Originally, I made changes to private method makeGetters and pass in the ResultSetMetadata as additional parameter. Then we have a separate pattern matching for case IntegerType if rsMetadata.getColumnType(columnIdx) == TIME => But right now, the column index/position is not accessible at the pattern matching level, only available at the closure function :(

Any advice is deeply appreciated here.

Actually, let me see if I can work with the metadata from getSchema function.

cloud-fan · 2020-12-28T07:07:37Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+    val expectedTimeMillis = java.util.concurrent.TimeUnit.SECONDS.toMillis(
+      expectedTimeRaw.toLocalTime().toSecondOfDay()
+    ).toInt
+    assert(rows(0).getAs[java.sql.Time](0) === expectedTimeMillis)


what was the result before? null?

Result before is expectedTimeMilllis for rows(0).
For row(1) then it is null

cloud-fan · 2020-12-28T07:57:06Z

@saikocat can you update the PR description? It's not related to avro now.

saikocat · 2020-12-28T09:25:55Z

@cloud-fan I have addressed the code review comments. Took awhile cos of a bug in metadata builder not being build.

I will use the metadata field to store a key logicaltimetype boolean and check for this in pattern matching.
Also fixed a bug in the metadata not being build/propagate for StructField in getSchema(). Ideally another MR but this fix is required for the extra field in metadata.

SparkQA · 2020-12-28T10:41:37Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38035/

SparkQA · 2020-12-28T11:09:43Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38035/

SparkQA · 2020-12-29T19:50:28Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38092/

SparkQA · 2020-12-29T20:19:07Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38092/

SparkQA · 2020-12-29T23:57:01Z

Test build #133503 has finished for PR 30902 at commit f55a7d6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-30T05:50:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+        val rawTime = rs.getTime(pos + 1)
+        if (rawTime != null) {
+          val rawTimeInNano = rawTime.toLocalTime().toNanoOfDay()
+          val timeInMillis = TimeUnit.NANOSECONDS.toMillis(rawTimeInNano).toInt


Can we use Math.toIntExact to avoid overflow?

👍 Thanks for this! Good catch!

SparkQA · 2020-12-30T06:57:04Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38108/

SparkQA · 2020-12-30T07:26:03Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38108/

SparkQA · 2020-12-30T09:09:47Z

Test build #133519 has finished for PR 30902 at commit 6d5a3c4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-12-30T13:39:01Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

    assert(rows(0).getAs[java.sql.Timestamp](2).getNanos === 543543000)
  }

+  test("SPARK-33888 : test TIME types") {


nit: SPARK-33888: ...

cloud-fan · 2020-12-30T13:39:21Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+        // scalastyle:off
+        case java.sql.Types.NUMERIC => metadata.putLong("scale", fieldScale)
+        case java.sql.Types.DECIMAL => metadata.putLong("scale", fieldScale)
+        case java.sql.Types.TIME    => metadata.putBoolean("logicaltimetype", true)


how about logical_time_type

cloud-fan · 2020-12-30T13:42:20Z

sql/core/src/test/scala/org/apache/spark/sql/jdbc/JDBCSuite.scala

+      expectedTimeRaw.toLocalTime().toSecondOfDay()
+    ).toInt
+    assert(rows(0).getAs[java.sql.Time](0) === expectedTimeMillis)
+    assert(rows(1).getAs[java.sql.Time](0) === expectedTimeMillis)


are we comparing java.sql.Time and int? How does it work?

Oops. The value is already converted to Int via makeGetter in JdbcUtils. So these should be getAs[Int]. Sorry about that.

SparkQA · 2020-12-30T15:53:40Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38126/

SparkQA · 2020-12-30T16:22:58Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/38126/

SparkQA · 2020-12-30T19:29:12Z

Test build #133537 has finished for PR 30902 at commit 63fad20.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

saikocat · 2021-01-02T13:30:42Z

Anything else that is needed before this can be merged?

AmplabJenkins · 2021-01-03T16:58:53Z

Can one of the admins verify this patch?

cloud-fan · 2021-01-04T06:53:15Z

thanks, merging to master! (not backporting because TIME is rarely used)

dongjoon-hyun · 2021-01-20T17:42:27Z

Hi, All.
This seems to break JDBC IT. Please see @sarutak 's PR.

cloud-fan · 2021-01-21T04:53:33Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+    // It stores the number of milliseconds after midnight, 00:00:00.000.
+    case IntegerType if metadata.contains("logical_time_type") =>
+      (rs: ResultSet, row: InternalRow, pos: Int) => {
+        val rawTime = rs.getTime(pos + 1)


@sarutak do you mean what returns here is seconds (with certain precision) from a starting timestamp, while the timestamp is different between databases? I'm a bit surprised if the JDBC protocol was design this way, but if this is true, then this PR doesn't make sense...

What returns here is java.sql.Time, and its doc says

The date components should be set to the "zero epoch" value of January 1, 1970 and should not be accessed.

Maybe some databases don't follow the requirement, but it doesn't matter, as we call rawTime.toLocalTime which only access the hour:minute:second components.

That said, I think reading SQL TIME type as integer has a well-define semantic in Spark (after this PR): the integer represents the milliseconds of the time from 00:00:00.

O.K, so the test seems wrong.
Actually, I'll fix it in another PR.

…gresDialect ### What changes were proposed in this pull request? This PR fixes the regression bug brought by SPARK-33888 (#30902). After that PR merged, `PostgresDIalect#getCatalystType` throws Exception for array types. ``` [info] - Type mapping for various types *** FAILED *** (551 milliseconds) [info] java.util.NoSuchElementException: key not found: scale [info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:106) [info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:104) [info] at org.apache.spark.sql.types.Metadata.get(Metadata.scala:111) [info] at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51) [info] at org.apache.spark.sql.jdbc.PostgresDialect$.getCatalystType(PostgresDialect.scala:43) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) ``` ### Why are the changes needed? To fix the regression bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed the test case `SPARK-22291: Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL` in `PostgresIntegrationSuite` passed. I also confirmed whether all the `v2.*IntegrationSuite` pass because this PR changed them and they passed. Closes #31262 from sarutak/fix-postgres-dialect-regression. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

cloud-fan · 2021-02-03T07:13:46Z

sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/jdbc/JdbcUtils.scala

+
+    // SPARK-33888 - sql TIME type represents as physical int in millis
+    // Represents a time of day, with no reference to a particular calendar,
+    // time zone or date, with a precision of one millisecond.


After a second thought, why do we pick millisecond precision? Why not microsecond? Is there a standard for it?

Since we are converting to/from java.sql.Time, and according to the javadoc https://docs.oracle.com/javase/8/docs/api/java/sql/Time.html , it supports till milliseconds for constructor.

It may confuse Spark users, as Spark timestamp is microsecond precision.

After more thought, it's probably better to return timestamp when reading JDBC time, with a clear rule: we convert the time to timestamp by using "zero epoch" as the date part. It's also more useful as users can call hour function or similar ones to get some field values. What do you think?

Hmm, I agree with you on the user experience part (hour function and all). I think it is hard (near impossible) to introduce new DataType (Time - HH:MM:SS.sss display) and another function to convert an int without date portion to Timestamp (most conversion is parse string) as it gonna take through multiple level of approvals and testings.

Another reason why I went with Int was that with TimestampType, it breaks compatibility with Avro logical & Spark type converter (https://github.com/apache/spark/blob/master/external/avro/src/main/scala/org/apache/spark/sql/avro/SchemaConverters.scala).

I'm not sure if a time_to_timestamp() helper function would be a better compromise or revert back the 2 MRs?

It looks better if the avro schema converter can convert timestamp to time. After reading time column from JDBC, it becomes IntegerType and there is no context to indicate that this int comes from JDBC time and means milliseconds. What if the avro logic type is time-micros? With timestamp type, at least we know the precision is microsecond.

Your suggestion makes sense. We can also stuff the info into the metadata field of the struct field. Let me find sometimes this weekend or over the new year to try out your suggestion? Will ping you back if I couldn't find time or manage to get a solution so we can revert the MRs. Thanks a lots for helping me out!

Should I create a new JIRA ticket and cut a new MR or do you prefer to consolidate in here?

Let's create a new JIRA to track it. The PR is merged to master only so we have plenty of time to fix it before the 3.2 release :)

@cloud-fan btw, I did a quick check and it seems like if we use TimestampType, the time will always be converted to the JVM system timezone. So for the JDBCSuite test, given 12:34:56 time value, when you do the .select(hour("time")) it will always point to my local timezone hour instead of 12. So I don't know if we should proceed in this case.

We can do whatever we want. We can use JDBC API getTime to get the time value, and construct the timestamp value in a reasonable way. It's under our control.

…gresDialect ### What changes were proposed in this pull request? This PR fixes the regression bug brought by SPARK-33888 (apache#30902). After that PR merged, `PostgresDIalect#getCatalystType` throws Exception for array types. ``` [info] - Type mapping for various types *** FAILED *** (551 milliseconds) [info] java.util.NoSuchElementException: key not found: scale [info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:106) [info] at scala.collection.immutable.Map$EmptyMap$.apply(Map.scala:104) [info] at org.apache.spark.sql.types.Metadata.get(Metadata.scala:111) [info] at org.apache.spark.sql.types.Metadata.getLong(Metadata.scala:51) [info] at org.apache.spark.sql.jdbc.PostgresDialect$.getCatalystType(PostgresDialect.scala:43) [info] at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils$.getSchema(JdbcUtils.scala:321) ``` ### Why are the changes needed? To fix the regression bug. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? I confirmed the test case `SPARK-22291: Conversion error when transforming array types of uuid, inet and cidr to StingType in PostgreSQL` in `PostgresIntegrationSuite` passed. I also confirmed whether all the `v2.*IntegrationSuite` pass because this PR changed them and they passed. Closes apache#31262 from sarutak/fix-postgres-dialect-regression. Authored-by: Kousuke Saruta <sarutak@oss.nttdata.com> Signed-off-by: HyukjinKwon <gurwls223@apache.org>

…portion fixed regardless of timezone ### What changes were proposed in this pull request? Due to user-experience (confusing to Spark users - java.sql.Time using milliseconds vs Spark using microseconds; and user losing useful functions like hour(), minute(), etc on the column), we have decided to revert back to use TimestampType but this time we will enforce the hour to be consistently across system timezone (via offset manipulation) and date part fixed to zero epoch. Full Discussion with Wenchen Fan Wenchen Fan regarding this ticket is here #30902 (comment) ### Why are the changes needed? Revert and improvement to sql.Time handling ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? Unit tests and integration tests Closes #31473 from saikocat/SPARK-34357. Authored-by: Hoa <hoameomu@gmail.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

[SPARK-33888] Add support for TimeMillis logicalType to TimestampType

656f0af

github-actions bot added AVRO SQL labels Dec 23, 2020

saikocat marked this pull request as ready for review December 23, 2020 10:52

hoand-teko and others added 7 commits December 24, 2020 20:50

[SPARK-33888] Fix JDBC Mapping for JDBC TIME type

b7eed45

[SPARK-33888] Fix JDBC Mapping for JDBC TIME type

497e2f6

[SPARK-33888] Fix unit test

bd3c0bb

[SPARK-33888] Fix unreachable code

c8bcb3c

[SPARK-33888] Address scalastyle check

78d79c1

[SPARK-33888] Update documentation for mapping

4a13c03

[SPARK-33888] Add unit test for TIME types

68e2afe

saikocat changed the title ~~[SPARK-33888][SQL] Add support for TimeMillis logicalType to TimestampType~~ [SPARK-33888][SQL] JDBC SQL TIME type represents incorrectly as TimestampType, it should be physical Int in millis Dec 25, 2020

cloud-fan reviewed Dec 28, 2020

View reviewed changes

saikocat added 2 commits December 28, 2020 17:22

[SPARK-33888] Fix metadata builder for getSchema()

db366fb

[SPARK-33888] Use int if logicaltimetype metadata is present

c75037a

cloud-fan reviewed Dec 30, 2020

View reviewed changes

[SPARK-33888] Use Math.toIntExact to avoid overflow

6d5a3c4

cloud-fan reviewed Dec 30, 2020

View reviewed changes

[SPARK-33888] Address code review comment: snake_case, spacing, test

63fad20

cloud-fan approved these changes Dec 30, 2020

View reviewed changes

cloud-fan approved these changes Jan 4, 2021

View reviewed changes

cloud-fan closed this in 0b647fe Jan 4, 2021

sarutak mentioned this pull request Jan 20, 2021

[SPARK-34180][SQL] Fix the regression brought by SPARK-33888 for PostgresDialect #31262

Closed

sarutak mentioned this pull request Jan 21, 2021

[SPARK-34180][SQL] Revert SPARK-33888 #31270

Closed

cloud-fan reviewed Jan 21, 2021

View reviewed changes

cloud-fan reviewed Feb 3, 2021

View reviewed changes

saikocat mentioned this pull request Feb 4, 2021

[SPARK-34357][SQL] Map JDBC SQL TIME type to TimestampType with time portion fixed regardless of timezone #31473

Closed

Conversation

saikocat commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

gengliangwang commented Dec 23, 2020

Uh oh!

saikocat commented Dec 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gengliangwang commented Dec 23, 2020

Uh oh!

cloud-fan commented Dec 24, 2020

Uh oh!

saikocat commented Dec 24, 2020

Uh oh!

cloud-fan commented Dec 24, 2020

Uh oh!

saikocat commented Dec 25, 2020

Uh oh!

cloud-fan commented Dec 25, 2020

Uh oh!

saikocat commented Dec 25, 2020

Uh oh!

SparkQA commented Dec 25, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Dec 28, 2020

Uh oh!

saikocat commented Dec 28, 2020

Uh oh!

SparkQA commented Dec 28, 2020

Uh oh!

SparkQA commented Dec 28, 2020

Uh oh!

SparkQA commented Dec 29, 2020

Uh oh!

SparkQA commented Dec 29, 2020

Uh oh!

SparkQA commented Dec 29, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 30, 2020

Uh oh!

SparkQA commented Dec 30, 2020

Uh oh!

SparkQA commented Dec 30, 2020

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Dec 30, 2020

Uh oh!

SparkQA commented Dec 30, 2020

Uh oh!

SparkQA commented Dec 30, 2020

Uh oh!

saikocat commented Jan 2, 2021

Uh oh!

AmplabJenkins commented Jan 3, 2021

saikocat commented Dec 23, 2020 •

edited

Loading

saikocat commented Dec 23, 2020 •

edited

Loading

cloud-fan commented Jan 4, 2021 •

edited

Loading