Skip to content

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Nov 1, 2019

What changes were proposed in this pull request?

In the PR, I propose to check the left bound - the from interval unit, and reset all unit values out of the specified range of interval units. Currently, IntervalUtils.fromDayTimeString() takes into account only the right bound but not the left one.

Note: the reset is not performed if spark.sql.dialect = PostgreSQL.

Why are the changes needed?

This fix makes fromDayTimeString() consistent in bound handling.

Does this PR introduce any user-facing change?

Yes, before:

spark-sql> SELECT interval '1 2:03:04' hour to minute;
interval 1 days 2 hours 3 minutes

After:

spark-sql> SELECT interval '1 2:03:04' hour to minute;
interval 2 hours 3 minutes

How was this patch tested?

  • Added new tests to IntervalUtilsSuite
  • Fixed expected results for literals.sql and interval.sql

@SparkQA
Copy link

SparkQA commented Nov 1, 2019

Test build #113085 has finished for PR 26358 at commit d7bc01c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 2, 2019

Test build #113125 has finished for PR 26358 at commit e860772.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 3, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Nov 3, 2019

Test build #113158 has finished for PR 26358 at commit e860772.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 3, 2019

@srowen @dongjoon-hyun @cloud-fan Please, have a look at this.

@cloud-fan
Copy link
Contributor

the reset is not performed if spark.sql.dialect = PostgreSQL.

Do you mean pgsql has a different behavior?

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 4, 2019

Do you mean pgsql has a different behavior?

It ignores from but takes into account to. Not sure it is a bug or a feature.

maxim=# select interval '5 12:40:30.999999999' hour to minute;
    interval
-----------------
 5 days 12:40:00
(1 row)

@cloud-fan
Copy link
Contributor

Does SQL spec define the behavior? I tried oracle and it fails if the from/to field doesn't match the actual value.

@SparkQA
Copy link

SparkQA commented Nov 4, 2019

Test build #113197 has finished for PR 26358 at commit d40e3af.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class GangliaReporter extends ScheduledReporter
  • public static class Builder
  • case class AlterTableDropPartitionStatement(

@srowen
Copy link
Member

srowen commented Nov 4, 2019

Pardon the dumb question, but what is the 'hour to minute' supposed to mean? select just the hours and minutes out of the interval? yeah, the pgsql implementation doesn't seem to do that, whether a bug or by design. Just trying to understand what we think the semantics are, and whether they need to be different in Spark.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 4, 2019

what is the 'hour to minute' supposed to mean? select just the hours and minutes out of the interval?

Correct, it means to extract an interval with hour:minute from a string.
Screen Shot 2019-11-04 at 18 52 14

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 4, 2019

I haven't found precise description of acceptable input string but it seems the format should be defined by the interval qualifier. For example, if it is HOUR to MINUTE, the input string must be in the format hh21:mi but not any string as we suppose now:

private val dayTimePattern =
"^([+|-])?((\\d+) )?((\\d+):)?(\\d+):(\\d+)(\\.(\\d+))?$".r

@srowen
Copy link
Member

srowen commented Nov 4, 2019

Hm, none of those examples show 'extracting' a subset of the interval. It seems like a way to specify what the unit-less string means. It seems like the pgsql you have above ought to fail if so, as the string doesn't match the units, yeah.

It seems like the current behavior works like pgsql? if so, is that maybe an OK stance on the semantics in these 'incorrect' cases?

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 4, 2019

if so, is that maybe an OK stance on the semantics in these 'incorrect' cases?

I am not sure that fully following other dbms in bugs and features is right decision. From the user point of view, if I specify hour to minute, I would expect only hours and minutes in the result column but not seconds and days, or something else.

@cloud-fan
Copy link
Contributor

Since this is only for interval literal, I think it doesn't hurt if we fail invalid string format. The oracle behavior looks more reasonable to me, we can try more to see if that's a common behavior.

@MaxGekk
Copy link
Member Author

MaxGekk commented Nov 8, 2019

For example, MySQL is strict as proposed in the PR:

mysql> SELECT DATE_SUB("1998-01-01 00:00:00", INTERVAL "1 1:1" HOUR_MINUTE);
+---------------------------------------------------------------+
| DATE_SUB("1998-01-01 00:00:00", INTERVAL "1 1:1" HOUR_MINUTE) |
+---------------------------------------------------------------+
| NULL                                                          |
+---------------------------------------------------------------+
1 row in set, 1 warning (0.00 sec)

mysql> SELECT DATE_SUB("1998-01-01 00:00:00", INTERVAL "1:1" HOUR_MINUTE);
+-------------------------------------------------------------+
| DATE_SUB("1998-01-01 00:00:00", INTERVAL "1:1" HOUR_MINUTE) |
+-------------------------------------------------------------+
| 1997-12-31 22:59:00                                         |
+-------------------------------------------------------------+
1 row in set (0.00 sec)

@cloud-fan @srowen What can I do to continue with the PR except of resolving the conflicts.

@srowen
Copy link
Member

srowen commented Nov 8, 2019

OK, so it would be reasonable to fail / return null, to match Oracle? There doesn't seem to be consistent behavior out there. I guess I prefer to fail this if it doesn't make sense semantically.

…y-time-string

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala
#	sql/core/src/test/resources/sql-tests/results/literals.sql.out
@SparkQA
Copy link

SparkQA commented Nov 8, 2019

Test build #113472 has finished for PR 26358 at commit a78dcc3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • public class DateTimeConstants
  • public final class CalendarInterval implements Serializable, Comparable<CalendarInterval>
  • .doc(\"Comma-separated list of class names implementing \" +
  • sealed abstract class PluginContainer
  • class PluginMetricsSource(
  • case class PluginMessage(pluginName: String, message: AnyRef)
  • abstract class IntervalNumOperation(
  • case class MultiplyInterval(interval: Expression, num: Expression)
  • case class DivideInterval(interval: Expression, num: Expression)
  • case class AlterTableAddPartitionStatement(
  • case class AlterTableSerDePropertiesStatement(
  • case class ShowCurrentNamespaceStatement() extends ParsedStatement
  • case class ShowCurrentNamespace(catalogManager: CatalogManager) extends Command
  • case class LocalShuffleReaderExec(child: SparkPlan) extends UnaryExecNode
  • case class ShowCurrentNamespaceExec(
  • class ContinuousRecordEndpoint(buckets: Seq[Seq[UnsafeRow]], lock: Object)

@cloud-fan
Copy link
Contributor

so let's keep the existing behavior if pgsql dialect is set, and fail for invalid bounds otherwise.

…y-time-string

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala
…y-time-string

# Conflicts:
#	sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
#	sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala
}
}

withSQLConf(SQLConf.DIALECT.key -> "Spark") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we have a test case to show the error behavior?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, this PR just truncates results according to specified bound. In particular, it takes into account the left bound. Could you explain, please, what do you mean by saying error behavior?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the conclusion is to fail for invalid bounds?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean if a string doesn't match to specified bounds like in literals.sql: https://github.com/apache/spark/pull/26358/files#diff-4f9e28af8e9fcb40a8a99b4e49f3b9b2R424 ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea, like what Oracle does.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is the PR with strict parsing #26473

@SparkQA
Copy link

SparkQA commented Nov 11, 2019

Test build #113576 has finished for PR 26358 at commit 296de1f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants