Skip to content
Closed
Show file tree
Hide file tree
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 4 additions & 4 deletions docs/sql-data-sources-csv.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,9 +109,9 @@ Data source options of CSV can be set via:
<td>read</td>
</tr>
<tr>
<td><code>inferDate</code></td>
<td><code>inferDate</code></td>
<td>false</td>
<td>Whether or not to infer columns that satisfy the <code>dateFormat</code> option as <code>Date</code>. Requires <code>inferSchema</code> to be <code>true</code>. When <code>false</code>, columns with dates will be inferred as <code>String</code> (or as <code>Timestamp</code> if it fits the <code>timestampFormat</code>).</td>
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

contain dates or timestamps reads a bit confusing. How about just ... infer string columns as Date ...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How to understand failed to be parsed by the respective formatter?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me update.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>
<td>Attempts to infer string columns that contain dates or timestamps as <code>Date</code> if the values satisfy <code>dateFormat</code> option and failed to be parsed by the respective formatter during schema inference (<code>inferSchema</code>). When used in conjunction with a user-provided schema, attempts to parse timestamp columns as dates using <code>dateFormat</code> if they fail to conform to <code>timestampFormat</code>, the parsed values will be cast to timestamp type afterwards.</td>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, thanks.

<td>read</td>
</tr>
<tr>
Expand Down Expand Up @@ -176,8 +176,8 @@ Data source options of CSV can be set via:
</tr>
<tr>
<td><code>enableDateTimeParsingFallback</code></td>
<td>Enabled if the time parser policy is legacy or no custom date or timestamp pattern was provided</td>
<td>Allows to fall back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
<td>Enabled if the time parser policy has legacy settings or if no custom date or timestamp pattern was provided.</td>
<td>Allows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
<td>read</td>
</tr>
<tr>
Expand Down
4 changes: 2 additions & 2 deletions docs/sql-data-sources-json.md
Original file line number Diff line number Diff line change
Expand Up @@ -204,8 +204,8 @@ Data source options of JSON can be set via:
</tr>
<tr>
<td><code>enableDateTimeParsingFallback</code></td>
<td>Enabled if the time parser policy is legacy or no custom date or timestamp pattern was provided</td>
<td>Allows to fall back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
<td>Enabled if the time parser policy has legacy settings or if no custom date or timestamp pattern was provided.</td>
<td>Allows falling back to the backward compatible (Spark 1.x and 2.0) behavior of parsing dates and timestamps if values do not match the set patterns.</td>
<td>read</td>
</tr>
<tr>
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -149,14 +149,20 @@ class CSVOptions(
val locale: Locale = parameters.get("locale").map(Locale.forLanguageTag).getOrElse(Locale.US)

/**
* Infer columns with all valid date entries as date type (otherwise inferred as timestamp type).
* Disabled by default for backwards compatibility and performance. When enabled, date entries in
* timestamp columns will be cast to timestamp upon parsing. Not compatible with
* legacyTimeParserPolicy == LEGACY since legacy date parser will accept extra trailing characters
* Infer columns with all valid date entries as date type (otherwise inferred as timestamp type)
* if schema inference is enabled. When being used with user-provided schema, tries to parse
* timestamp values as dates if the values do not conform to the timestamp formatter before
* falling back to the backward compatible parsing - the parsed values will be cast to timestamp
* afterwards.
*
* Disabled by default for backwards compatibility and performance.
*
* Not compatible with legacyTimeParserPolicy == LEGACY since legacy date parser will accept
* extra trailing characters.
*/
val inferDate = {
val inferDateFlag = getBool("inferDate")
if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY && inferDateFlag) {
if (inferDateFlag && SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
throw QueryExecutionErrors.inferDateWithLegacyTimeParserError()
}
inferDateFlag
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -254,7 +254,7 @@ class UnivocityParser(
try {
timestampNTZFormatter.parseWithoutTimeZone(datum, false)
} catch {
case NonFatal(e) if (options.inferDate) =>
case NonFatal(e) if options.inferDate =>
daysToMicros(dateFormatter.parse(datum), TimeZoneUTC.toZoneId)
}
}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2840,6 +2840,42 @@ abstract class CSVSuite
}
}

test("SPARK-39904: Parse incorrect timestamp values with inferDate=true") {
withTempPath { path =>
Seq(
"2020-02-01 12:34:56",
"2020-02-02",
"invalid"
).toDF()
.repartition(1)
.write.text(path.getAbsolutePath)

val schema = new StructType()
.add("ts", TimestampType)

val output = spark.read
.schema(schema)
.option("inferDate", "true")
.csv(path.getAbsolutePath)

if (SQLConf.get.legacyTimeParserPolicy == LegacyBehaviorPolicy.LEGACY) {
val msg = intercept[IllegalArgumentException] {
output.collect()
}.getMessage
assert(msg.contains("CANNOT_INFER_DATE"))
} else {
checkAnswer(
output,
Seq(
Row(Timestamp.valueOf("2020-02-01 12:34:56")),
Row(Timestamp.valueOf("2020-02-02 00:00:00")),
Row(null)
)
)
}
}
}

test("SPARK-39731: Correctly parse dates and timestamps with yyyyMMdd pattern") {
withTempPath { path =>
Seq(
Expand Down