Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Jun 26, 2016

What changes were proposed in this pull request?

This PR corrects CSV data sources in order to write DateType and TimestampType correctly just like JSON data source.

Currently this writes the data as below:

+----------------+
|            date|
+----------------+
|1440637200000000|
|1414459800000000|
|1454040000000000|
+----------------+

This PR adds the support to write dates and timestamps in a formatted string as below:

  • TimestampType
    • With dateFormat option (e.g. dd/MM/yyyy HH:mm)

      +----------------+
      |            date|
      +----------------+
      |2015/08/26 18:00|
      |2014/10/27 18:30|
      |2016/01/28 20:00|
      +----------------+
      
    • With dateFormat (e.g. dd/MM/yyyy HH:mm z) and timezone option (e.g. GMT)

      +--------------------+
      |                date|
      +--------------------+
      |2015-08-26 18:00 GMT|
      |2014-10-27 18:30 GMT|
      |2016-01-28 20:00 GMT|
      +--------------------+
      
  • DateType
    • With dateFormat option (e.g. yyyy/MM/dd)

      +----------+
      |      date|
      +----------+
      |2015/08/26|
      |2014/10/27|
      |2016/01/28|
      +----------+
      
    • With dateFormat (e.g. yyyy/MM/dd z) and timezone option (e.g. GMT)

      +--------------+
      |          date|
      +--------------+
      |2015/08/26 GMT|
      |2014/10/27 GMT|
      |2016/01/28 GMT|
      +--------------+
      

This PR also adds the support for parsing timezone appropriately when reading. So, roundtrip in reading and writing timestamps and dates with the option above would be okay.

How was this patch tested?

Unit tests in CsvSuite.scala.


case dt: DataType =>
(row: InternalRow, ordinal: Int) =>
row.get(ordinal, dt).toString
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

schema is being validated here. So it does not have to handle exception cases here again.

@srowen
Copy link
Member

srowen commented Jun 26, 2016

This has a big problem though -- the times are now wrong, because there is no timezone.
Why would we write formatted dates in a data file though? it's for human consumption. Timestamp is an appropriate way to render a point in time.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 26, 2016

Thank you for your comment @srowen !
I did this for two reasons. Firstly, it can't write and read back the dates or timestamps but it might have to be converted manually by users if it writes them as long values. The other reason is consistency with JSON data source. JSON writes dates and timestamps in the format as above.

+Another reason is about dateFormat option. I guess users want to use this as writing as well. This is only used as reading.

@srowen
Copy link
Member

srowen commented Jun 26, 2016

Hm, that's a bug isn't it? any time is ambiguous without a timezone. These things can make sense for local display but not for output of data.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 26, 2016

I understand the problem without a timezone but.. hm.. to be honest, I have seen a lot of cases of reading and writing times in CSV files without a timezone (e.g. by Microsoft Excel).

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 26, 2016

Anyway do you think I should close this and deal with JSON one? I can't decide.

@srowen
Copy link
Member

srowen commented Jun 26, 2016

I suppose that if it's configurable, and not the default, then you're letting people do what they want at least, including specify a format with a timezone. I think it's virtually always asking for trouble to not store a timezone, but hey. If Spark anywhere forces you to write data as a human-readable format without one, that seems like a problem.

@SparkQA
Copy link

SparkQA commented Jun 26, 2016

Test build #61258 has finished for PR 13912 at commit 79a290a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

private def rowToString(row: InternalRow): Seq[String] = {
var i = 0
val values = new Array[String](row.numFields)
while (i < row.numFields) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use values.indices and then map or foreach to make it more functional (and hopefully readable). With the change, you'll link values to indices (not relying on row.numFields used twice).

BTW, do we need values to be initialized first before adding elements? I'd vote for foldLeft as a better alternative, e.g.

def appendRowValue(arr: Array[String], i: Int) = {
  // do the if here
  arr
}
(0 to row.numFields).foldLeft(Array[String]()) { case (arr, i) => arr }

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jun 26, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using funtional transformation is generally slower due to virtual function calls. This part is executed a lot and such overhead will become really heavy. See https://github.com/databricks/scala-style-guide#traversal-and-zipwithindex

@HyukjinKwon HyukjinKwon changed the title [SPARK-16216][SQL] CSV data source does not write date and timestamp correctly [SPARK-16216][SQL] CSV data source supports custom date format for writing and timezone option for both reading and writing Jun 27, 2016
@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jun 27, 2016

@srowen Hm.. I just wonder if I should correct JSON data source to be consistent with this meaning these below:

  1. Default is to write time as numeric timestamps
  2. Supports both dateFormat and timezone for reading and writing.

Currently, JSON data source is writing both Date and Timestamp as below:

// TimestampType
1970-01-01 11:46:40.0

// DateType
1970-01-01

@HyukjinKwon
Copy link
Member Author

@rxin, Could you please take a look?

@SparkQA
Copy link

SparkQA commented Jun 27, 2016

Test build #61280 has finished for PR 13912 at commit 649ab85.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61444 has finished for PR 13912 at commit d03e7a0.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 29, 2016

Test build #61443 has finished for PR 13912 at commit 0e1901e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • public final class JavaStructuredNetworkWordCount
    • class OptionUtils(object):
    • class DataFrameReader(OptionUtils):
    • class DataFrameWriter(OptionUtils):
    • class DataStreamReader(OptionUtils):
    • case class ShowFunctionsCommand(

@HyukjinKwon
Copy link
Member Author

I will update this if it is decided to be worth being added.

@deanchen
Copy link
Contributor

@srowen @rxin Would love to see this get merged as this has been a pain point for us. Not a fan of timezoneless dates as an engineer but the need to passthrough or write timezoneless dates to csv's has been a necessary task due to a variety of reasons in the past.

We use Spark to parse large amounts of daily financial asset data with pre-agreed upon date conventions. Most of the dates we deal doesn't belong to a timezone and are treated more like indices with the maximum granularity of a day. The CSV reports we produce for our customers do not ever contain timezone as a result and almost all dates as passthrough from the original timezoneless dates.

This is not just a common pattern with financial data, but also common practice in large companies I've worked at in the past. A large tech company I've worked at in the past used the convention of MST date and datetime for all internal systems(database included).

which means trying to parse times and date by
``java.sql.Timestamp.valueOf()`` and ``java.sql.Date.valueOf()``.
:param timezone: defines the timezone to be used for both date type and timestamp type.
If a timezone is specified in the data, this will load them after
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get what this is saying -- if an input string specifies a time zone then that is the time zone to use to parse it. That's it right?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it will use the timezone specified in the input. Since Date and Timestamp do not keep timezone information, it calculates the differences between specified timezone in the input and in the option but the standard is the one specified in the option.

I meant to say, for example, if timezone is set to GMT, all the read Date and Timestamp are in GMT timezone after calculating the differences. So..

If the CSV data is as below:

27/08/2015 00:00 PDT
27/08/2015 01:00 PDT
27/08/2015 02:00 PDT

If this is read as below:

val df = spark.read
  .format("csv")
  .option("timezone", "GTM")
  .option("dateFormat", "dd/MM/yyyy HH:mm z")
  .load("path")
df.show()

it will become as below in dataframe (difference between GMT and PDT is 7 hours), these below will be timestamp objects but I just represented them as strings.

+-----------------------+
|                     _1|
+-----------------------+
|2015-08-27 07:00:00.000|
|2015-08-27 08:00:00.000|
|2015-08-27 09:00:00.000|
+-----------------------+

EDITTED: - I added an example for writing as well just to be clear

Then, if you write this as below:

df.write
  .format("csv")
  .option("header", "true")
  .option("timezone", "GTM")
  .option("dateFormat", "dd/MM/yyyy HH:mm z")
  .save("path")

the results will be

27/08/2015 07:00 GMT
27/08/2015 08:00 GMT
27/08/2015 09:00 GMT

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this behaviour looks fine, I will just change the documentation to be more clear..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the example above, the input specifies a timezone and that must be used to interpret it. You say "it becomes in the dataframe" but what it becomes is a timestamp, which is unambiguous and has no timezone. Timezone matters when converting back to a string for display, but, your example only shows the parameter used on reading, and shows no timezone in the output. I am not sure that this is the intended behavior?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought it loses the timezone information after being loaded into Spark. I mean, Timestamp and Date instances don't have timezone information in them. The timezone specified in the input is being used in the example...

I am sorry that I think I didn't understand cleanly. Do you mind if I ask what you expect in before being read, after being read (in dataframe) and after being written please?

(I just added a example.)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. Thank you for the detailed explanation. I should correct comments and the behaviour.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems it is a default behaviour for SimpleDateFormat. I will look into this deeper and will fix or leave some more commemts tomorrow. Thanks again!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just have timezone as part of the dateFormat, so users can specify it?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ic - you want to control the zone this gets converted to.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will clean up the PR description and all those soon with a better proposal. Thanks!

@srowen
Copy link
Member

srowen commented Jul 18, 2016

This needs a rebase.
Pardon, but where does this affect parsing of date strings? I'm missing that but I'm sure it's here.


case TimestampType if params.dateFormat != null =>
(row: InternalRow, ordinal: Int) =>
params.dateFormat.format(DateTimeUtils.toJavaTimestamp(row.getLong(ordinal)))
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen Maybe you are looking for this case. I had to do it like this to avoid per-record type dispatch.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is formatting to a string, rather than parsing from a string right?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, sorry. This is not the part of this PR. This is

case _: TimestampType if options.dateFormat != null =>
// This one will lose microseconds parts.
// See https://issues.apache.org/jira/browse/SPARK-10681.
options.dateFormat.parse(datum).getTime * 1000L
case _: TimestampType =>
// This one will lose microseconds parts.
// See https://issues.apache.org/jira/browse/SPARK-10681.
DateTimeUtils.stringToTime(datum).getTime * 1000L
case _: DateType if options.dateFormat != null =>
DateTimeUtils.millisToDays(options.dateFormat.parse(datum).getTime)
case _: DateType =>
DateTimeUtils.millisToDays(DateTimeUtils.stringToTime(datum).getTime)

Copy link
Member Author

@HyukjinKwon HyukjinKwon Jul 18, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was merged before, here but timezone was not concerned.

So, this PR adds the support for..

  • timezone for both reading and writing
  • dateFormat for writing.

@HyukjinKwon
Copy link
Member Author

HyukjinKwon commented Jul 19, 2016

I am closing this first. I will open a new one only for the default format within today. I will cc you all as soon as I open. Thanks!

@HyukjinKwon HyukjinKwon deleted the SPARK-16216 branch January 2, 2018 03:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants