Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Feb 16, 2019

What changes were proposed in this pull request?

In the PR, I propose to add new Catalyst type converter for TimestampType. It should be able to convert java.time.Instant to/from TimestampType.

Main motivations for the changes:

  • Smoothly support Java 8 time API
  • Avoid inconsistency of calendars used inside of Spark 3.0 (Proleptic Gregorian calendar) and java.sql.Timestamp (hybrid calendar - Julian + Gregorian).
  • Make conversion independent from current system timezone.

By default, Spark converts values of TimestampType to java.sql.Timestamp instances but the SQL config spark.sql.catalyst.timestampType can change the behavior. It accepts two values Timestamp (default) and Instant. If the former one is set, Spark returns java.time.Instant instances for timestamp values.

How was this patch tested?

Added new testes to CatalystTypeConvertersSuite to check conversion of TimestampType to/from java.time.Instant.

@SparkQA
Copy link

SparkQA commented Feb 16, 2019

Test build #102420 has finished for PR 23811 at commit 682f769.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • .doc(\"Java class to/from which an instance of TimestampType is converted.\")

.booleanConf
.createWithDefault(true)

val TIMESTAMP_EXTERNAL_TYPE = buildConf("spark.sql.catalyst.timestampType")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can support reading from both types at the same time right?
I don't know if it's worth changing what it is written to; not worth a flag IMHO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can support reading from both types at the same time right?

At Spark side, we can read both.

I don't know if it's worth changing what it is written to; not worth a flag IMHO.

Timestamps can be loaded from a datasource, casted from other types and etc. If an user wants to imports (collect) non-legacy timestamps (I mean java.time.Instant), how she/he can do that without the flag?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import is fine; we could potentially read both types to TimestampType. Can we just be opinionated about the right way to write it back out, and keep current behavior? it may be 'legacy' but not sure it's worth the behavior change. You may have more context on why that's important though.

As with many things I just don't know how realistically people will understand the issue, find the flag, set it, and maintain it across deployments.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it may be 'legacy' but not sure it's worth the behavior change.

The SQL config spark.sql.catalyst.timestampType has default value Timestamp which preserves current behavior. When an user wants to import java.time.Instant from Spark, she/he can change the config to point out the Java timestamp class.

@kiszk
Copy link
Member

kiszk commented Feb 18, 2019

Just curios: is it ok to convert TimestampType from/to Java class while I might see discussions that it is not good to expose TimestampType ? cc @gatorsmile

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 22, 2019

is it ok to convert TimestampType from/to Java class while I might see discussions that it is not good to expose TimestampType ?

Actually we are not exposing TimestampType here. In current implementation internal TimestampType is converted to java.sql.Timestamp (when you do collect() for example). Java 8 brought new set of time-related classes including java.time.Instant which directly reflects Spark's TimestampType. I want to just allow users import Spark's timestamp to a modern Java timestamp class to avoid any possible problems with java.sql.Timestamp - like hybrid calendar + default time zone used inside of java.util.Date in some places.

@srowen
Copy link
Member

srowen commented Feb 22, 2019

I still have a preference for keeping it simple and returning one type, being opinionated. That would probably argue for the newer type. However I can imagine that could break a lot of code, even though this is only a major version upgrade. RIght? or would most user code see the same methods exposed on Instant and Timestamp and not care much?

It's a case where I do understand having a flag. I'd even be OK with defaulting to instant with this as a safety-valve, to push people to better timestamp implementations. The Java 8 class has been out for years.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 25, 2019

I still have a preference for keeping it simple and returning one type, being opinionated.

This is what I propose in this PR - return java.sql.Timestamp by default but an user can ask java.time.Instant instead by setting the SQL config spark.sql.catalyst.timestampType to Instant.

even though this is only a major version upgrade. Right?

If we stick on Timestamp value for the SQL config, this won't break anything but definitely breaks users apps if we make Instant as the default value for the config.

... but if we will brave enough and make java.time.Instant as default "external" type for Catalyst's TimestampType, migration on user side should not be so hard - Timestamp.from(instant).

I'd even be OK with defaulting to instant with this as a safety-valve, to push people to better timestamp implementations. The Java 8 class has been out for years.

I would prefer this way too but I am just afraid to break user's apps even in major version - 3.0.

@cloud-fan Could you look at the PR as well, please.

@cloud-fan
Copy link
Contributor

The RowEncoder also needs to create timestamp objects for timestamp columns, we should apply the config there too.

@cloud-fan
Copy link
Contributor

BTW do we have end-to-end tests for this feature? I'd like to see df.collect() and UDF test cases.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 25, 2019

BTW do we have end-to-end tests for this feature?

Not yet. I just wanted to be sure this will be accepted in general before investing time on this.

@MaxGekk MaxGekk changed the title [SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType [WIP][SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType Feb 25, 2019
@cloud-fan
Copy link
Contributor

As long as it's protected by a config, I think we are fine. We can have more discussion about whether or not make it default in the followup PR.

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 25, 2019

As long as it's protected by a config, I think we are fine.

ok. I'll continue and support java.time.Instant in other places (under the SQL config).

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 26, 2019

The RowEncoder also needs to create timestamp objects for timestamp columns, we should apply the config there too.
I'd like to see df.collect() and UDF test cases.

Added such tests - UDF test contains collect as well.

@MaxGekk MaxGekk changed the title [WIP][SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType [SPARK-26902][SQL] Support java.time.Instant as an external type of TimestampType Feb 26, 2019
@SparkQA
Copy link

SparkQA commented Feb 27, 2019

Test build #102797 has finished for PR 23811 at commit 64a632b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

can you rebase your branch? We need to make changes in DeserializerBuildHelper instead of ScalaReflection now.

@cloud-fan
Copy link
Contributor

otherwise LGTM

@SparkQA
Copy link

SparkQA commented Feb 27, 2019

Test build #102816 has finished for PR 23811 at commit a2ec027.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@cloud-fan cloud-fan closed this in b0450d0 Feb 27, 2019
@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 27, 2019

@cloud-fan Thanks. I will do similar changes for DateType and java.time.LocalDate.

@HeartSaVioR
Copy link
Contributor

HeartSaVioR commented Feb 27, 2019

@MaxGekk
Just curious, is there specific reason for Instant serializer to be added only for ScalaReflection? I haven't find same for JavaTypeInference so just would like to see whether it is just missing spot or intended to exclude it.

FYI: I'm proposing the same refactoring for serializerFor as well (#23908) and just aware of this PR while rebasing.

@cloud-fan
Copy link
Contributor

@HeartSaVioR can you add it in #23908 ? thanks!

@HeartSaVioR
Copy link
Contributor

Ah OK it was a missing spot. Not sure from my side. I'll add. Thanks!

@MaxGekk
Copy link
Member Author

MaxGekk commented Feb 27, 2019

Ah OK it was a missing spot.

Probably I missed it because the code is not covered by any of my tests. Just for the future when I will add new type for DateType, which tests do cover the code in JavaTypeInference?

@HeartSaVioR
Copy link
Contributor

It doesn't look like exact match of ScalaReflectionSuite for JavaTypeInference. (better to have one I guess)
Maybe JavaBeanDeserializationSuite (more closer) or JavaDatasetSuite?

cloud-fan pushed a commit that referenced this pull request Mar 21, 2019
## What changes were proposed in this pull request?

In the PR, I propose to extend `Literal.apply` to support constructing literals of `TimestampType` and `DateType` from `java.time.Instant` and `java.time.LocalDate`. The java classes have been already supported as external types for `TimestampType` and `DateType` by the PRs #23811  and #23913.

## How was this patch tested?

Added new tests to `LiteralExpressionSuite`.

Closes #24161 from MaxGekk/literal-instant-localdate.

Authored-by: Maxim Gekk <[email protected]>
Signed-off-by: Wenchen Fan <[email protected]>
@MaxGekk MaxGekk deleted the timestamp-instant branch September 18, 2019 15:57
@rxin
Copy link
Contributor

rxin commented Jan 24, 2020

How useful is this change? Wouldn't it break a lot of user code that use Timestmap type when they upgrade to 3.0?

It seems like we wouldn't be able to ever remove the config flag.

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 25, 2020

How useful is this change?

Please, take a look at the motivation points in the PR description.

Wouldn't it break a lot of user code that use Timestmap type when they upgrade to 3.0?

No, it will not break because Spark still returns java.sql.Timestamp by default.

It seems like we wouldn't be able to ever remove the config flag.

The flag has been removed already, and replaced by spark.sql.datetime.java8API.enabled (see

val DATETIME_JAVA8API_ENABLED = buildConf("spark.sql.datetime.java8API.enabled")
.doc("If the configuration property is set to true, java.time.Instant and " +
"java.time.LocalDate classes of Java 8 API are used as external types for " +
"Catalyst's TimestampType and DateType. If it is set to false, java.sql.Timestamp " +
"and java.sql.Date are used for the same purpose.")
.booleanConf
.createWithDefault(false)
) which is disabled by default.

@rxin
Copy link
Contributor

rxin commented Jan 29, 2020

But if it is off by default, it means almost nobody will be using the new type right? My point is that it's not great when you have a feature that's almost never on and mostly just be dead code. Is there a plan to transition over to the new type? Wouldn't that plan involve breaking a lot of user code?

@MaxGekk
Copy link
Member Author

MaxGekk commented Jan 29, 2020

it means almost nobody will be using the new type right?

From my point of view, this is debatable statement. Java 8 is 6 years old already. I could guess significant amount of modern apps including Spark apps is written on top of Java 8 time API. I do think users will look for how to parallelize Java 8 time-related values to Spark. Maybe we should highlight in Spark SQL docs more clearly how do that by using the flag.

My point is that it's not great when you have a feature that's almost never on and mostly just be dead code.

Actually the spark.sql.datetime.java8API.enabled config controls only output types (from Spark). Regarding input types - Spark accepts both legacy and new one java.time.Instant. If an app or UDF are written using Java 8 API, user will look for ways to take Instant from Spark as well. There is no so much "dead" code in the changes.

Is there a plan to transition over to the new type?

Spark 3.0 is going to introduce Java 8 time classes so far. In this release, we could keep both with old returned types by default. I would switch on Java 8 time API by default in the next release - Spark 3.1 or 3.2. From another side, the major release is good time for switching since Java 8 time API is mature enough.

Wouldn't that plan involve breaking a lot of user code?

I don't have statistics on hands. Switching on Java 8 time API by default, definitely will break: 1. UDFs and 2. apps that collects results from Spark. In the first case, we could try to detect input types of UDF, and maybe avoid failures by passing legacy types. but in the case of collecting datasets from Spark, it depends on users apps. In any case, the code can be easily fixed by converting, for instance java.time.Instant to java.sqlTimestamp via: java.sql.Timestamp.from(instant).

@marmbrus
Copy link
Contributor

As a somewhat heavy user of Datasets I'm actually +1 on letting users get nicer implementations of Timestamp out of Spark. The flag is not the most elegant way to expose this, but in some cases I think its our only choice (i.e. when Spark is returning GenericRows without any other hint from the users about the type they want)

Apologies, if I misread the PR, but couldn't we seamlessly support both types in many cases without the flag? (i.e. when reflection can tell us which type the user is expecting).

That said, I'm -1 on making this the default anytime soon. Sounds like it will break a lot of programs.

@cloud-fan
Copy link
Contributor

For df.collect, there is no hint that can tell if users want the old timestamp or the java8 one, except for the flag. So agree that we can't make this the default anytime soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants