[SPARK-29369][SQL] Support string intervals without the `interval` prefix by MaxGekk · Pull Request #26079 · apache/spark

MaxGekk · 2019-10-10T07:42:56Z

What changes were proposed in this pull request?

In the PR, I propose to move interval parsing to CalendarInterval.fromCaseInsensitiveString() which throws an IllegalArgumentException for invalid strings, and reuse it from CalendarInterval.fromString(). The former one handles IllegalArgumentException only and returns NULL for invalid interval strings. This will allow to support interval strings without the interval prefix in casting strings to intervals and in interval type constructor because they use fromString() for parsing string intervals.

For example:

spark-sql> select cast('1 year 10 days' as interval);
interval 1 years 1 weeks 3 days
spark-sql> SELECT INTERVAL '1 YEAR 10 DAYS';
interval 1 years 1 weeks 3 days

Why are the changes needed?

To maintain feature parity with PostgreSQL which supports interval strings without prefix:

# select interval '2 months 1 microsecond';
        interval        
------------------------
 2 mons 00:00:00.000001

and to improve Spark SQL UX.

Does this PR introduce any user-facing change?

Yes, previously parsing of interval strings without interval gives NULL:

spark-sql> select interval '2 months 1 microsecond';
NULL

After:

spark-sql> select interval '2 months 1 microsecond';
interval 2 months 1 microseconds

How was this patch tested?

Added new tests to CalendarIntervalSuite.java
A test for casting strings to intervals in CastSuite
Test for interval type constructor from strings in ExpressionParserSuite

MaxGekk · 2019-10-10T07:45:52Z

@cloud-fan @dongjoon-hyun @zsxwing May I ask you to take a look at this PR.

cloud-fan · 2019-10-10T11:25:35Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

   *
   * @throws IllegalArgumentException if the string is not a valid internal.
   */
  public static CalendarInterval fromCaseInsensitiveString(String s) {


Is it the only place we parse interval string? I thought we parse it with antlr parser.

antlr parser does this as well but it parses sql elements like

spark-sql> select interval 10 days 1 second; interval 1 weeks 3 days 1 seconds

here is only the place where we parse string values:

spark-sql> select interval 'interval 10 days 1 second'; interval 1 weeks 3 days 1 seconds

This looks duplicated. Shall we add a parseInterval method to the ParserInterface interface and call the parser here?

Maybe something has been duplicated, and can be reused but this is heavy refactoring for this PR.

For instance, AstBuilder.visitInterval gets already split interval units but CalendarInterval.fromString() uses regular expression to parse & split:

spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

Lines 50 to 53 in b103449

private static Pattern p = Pattern.compile("interval" + unitRegex("year") + unitRegex("month") +

unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +

unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),

Pattern.CASE_INSENSITIVE);

If you don't mind, I would try to do that in a separate PR.

This PR introduced code duplication #8034 for your code #7355 5 years ago.

And your regexp is not tolerant to the order of interval units, see:

spark-sql> select interval 'interval 1 microsecond 2 months'; NULL spark-sql> select interval 1 microsecond 2 months; interval 2 months 1 microseconds

Let's keep them separate so far. And I will try to write flexible and common code in the near future for parsing string intervals that could handle other features found in #26055

cloud-fan · 2019-10-10T12:59:59Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

+    }
+    String prefix = "interval";
+    String intervalStr = trimmed;
+    if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) {


what does this condition mean?

I added comments about this.

MaxGekk · 2019-10-10T13:20:18Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

+    // Checks the given interval string does not start with the `interval` prefix
+    if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) {
+      // Prepend `interval` if it does not present because
+      // the regular expression strictly require it.


I have not figured out how to modify the regular expression to make the interval prefix optional.

Probably, this needs this feature https://www.regular-expressions.info/branchreset.html which Java's regexps doesn't have.

How about something like

String intervalStr = trimmed.toLowerCase(); if (intervalStr.startsWith("interval")) { intervalStr = intervalStr.drop(8) } // parse the interval string assuming there is no leading "interval"

String intervalStr = trimmed.toLowerCase();

Your code is more expensive because you lower case whole input string.

// parse the interval string assuming there is no leading "interval"

Here there is a problem with current regexp when you delete the anchor "interval". Without this anchor, it cannot match to valid inputs:

scala> import java.util.regex._ import java.util.regex._ scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?" unitRegex: (unit: String)String scala> val p = Pattern.compile(unitRegex("year") + unitRegex("month") + | unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") + | unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"), | Pattern.CASE_INSENSITIVE) p: java.util.regex.Pattern = (?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? scala> val m = p.matcher("1 month 1 second") m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=] scala> m.matches() res7: Boolean = false

If we added it back:

scala> val p = Pattern.compile("interval" + unitRegex("year") + unitRegex("month") + | unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") + | unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"), | Pattern.CASE_INSENSITIVE) p: java.util.regex.Pattern = interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? scala> val m = p.matcher("interval 1 month 1 second") m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,25 lastmatch=] scala> m.matches() res8: Boolean = true

it can match now. That's why I had to add the interval prefix instead of removing it.

Can you just start the regex with (interval)?? then the first matching group is either null or "interval", and the rest should match the same way?

As far as I remember I tried this regex, and it didn't work. Have you tried it?

I have checked, it doesn't work:

scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?" unitRegex: (unit: String)String scala> val p = Pattern.compile("(interval)?" + unitRegex("year") + unitRegex("month") + | unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") + | unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"), | Pattern.CASE_INSENSITIVE) p: java.util.regex.Pattern = (interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? scala> val m = p.matcher("1 month 1 second") m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=] scala> m.matches() res0: Boolean = false

I tried simply "(interval)?(.+)".r and it worked as expected on inputs like "abc" and "interval abc". It's a toy example and not sure if it interacts unexpectedly with the rest of the matching. no big deal, just leave it.

MaxGekk · 2019-10-12T02:54:06Z

jenkins, retest this, please

SparkQA · 2019-10-12T04:59:51Z

Test build #111945 has finished for PR 26079 at commit 813897b.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-10-12T11:26:16Z

Test build #111966 has finished for PR 26079 at commit e2c1352.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

MaxGekk · 2019-10-13T19:29:46Z

@cloud-fan @dongjoon-hyun @srowen @HyukjinKwon May I ask you to have a look at the PR.

cloud-fan · 2019-10-14T14:56:41Z

common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java

+    }
+    String prefix = "interval";
+    String intervalStr = trimmed;
+    // Checks the given interval string does not start with the `interval` prefix


Why not just call trimmed.toLowerCase.startsWith("interval")? For perf reasons?

Yes, don't want to lower case the entire string and allocate memory for new one to only compare small prefix.

cloud-fan · 2019-10-14T15:34:34Z

thanks, merging to master!

MaxGekk added 5 commits October 9, 2019 23:51

Make fromString tolerant to prefix

6d0564c

Use Arrays.asList

60d7d34

Add test for missing prefix

9253cec

Test for cast

eecfbf4

Test for the interval constructor

e254ee5

cloud-fan reviewed Oct 10, 2019

View reviewed changes

Add comments

813897b

MaxGekk commented Oct 10, 2019

View reviewed changes

dongjoon-hyun added the SQL label Oct 10, 2019

Regenerate literals.sql.out

e2c1352

cloud-fan reviewed Oct 14, 2019

View reviewed changes

cloud-fan closed this in da576a7 Oct 14, 2019

MaxGekk deleted the interval-str-without-prefix branch October 15, 2019 19:56

	private static Pattern p = Pattern.compile("interval" + unitRegex("year") + unitRegex("month") +
	unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
	unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
	Pattern.CASE_INSENSITIVE);

Comments

Conversation

MaxGekk commented Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

MaxGekk commented Oct 10, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan Oct 10, 2019 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MaxGekk commented Oct 12, 2019

Uh oh!

SparkQA commented Oct 12, 2019

Uh oh!

SparkQA commented Oct 12, 2019

Uh oh!

MaxGekk commented Oct 13, 2019

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 14, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

MaxGekk commented Oct 10, 2019 •

edited

Loading

MaxGekk Oct 10, 2019 •

edited

Loading

MaxGekk Oct 10, 2019 •

edited

Loading

MaxGekk Oct 10, 2019 •

edited

Loading

cloud-fan Oct 10, 2019 •

edited

Loading