Skip to content

Comments

[SPARK-29369][SQL] Support string intervals without the interval prefix#26079

Closed
MaxGekk wants to merge 7 commits intoapache:masterfrom
MaxGekk:interval-str-without-prefix
Closed

[SPARK-29369][SQL] Support string intervals without the interval prefix#26079
MaxGekk wants to merge 7 commits intoapache:masterfrom
MaxGekk:interval-str-without-prefix

Conversation

@MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Oct 10, 2019

What changes were proposed in this pull request?

In the PR, I propose to move interval parsing to CalendarInterval.fromCaseInsensitiveString() which throws an IllegalArgumentException for invalid strings, and reuse it from CalendarInterval.fromString(). The former one handles IllegalArgumentException only and returns NULL for invalid interval strings. This will allow to support interval strings without the interval prefix in casting strings to intervals and in interval type constructor because they use fromString() for parsing string intervals.

For example:

spark-sql> select cast('1 year 10 days' as interval);
interval 1 years 1 weeks 3 days
spark-sql> SELECT INTERVAL '1 YEAR 10 DAYS';
interval 1 years 1 weeks 3 days

Why are the changes needed?

To maintain feature parity with PostgreSQL which supports interval strings without prefix:

# select interval '2 months 1 microsecond';
        interval        
------------------------
 2 mons 00:00:00.000001

and to improve Spark SQL UX.

Does this PR introduce any user-facing change?

Yes, previously parsing of interval strings without interval gives NULL:

spark-sql> select interval '2 months 1 microsecond';
NULL

After:

spark-sql> select interval '2 months 1 microsecond';
interval 2 months 1 microseconds

How was this patch tested?

  • Added new tests to CalendarIntervalSuite.java
  • A test for casting strings to intervals in CastSuite
  • Test for interval type constructor from strings in ExpressionParserSuite

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 10, 2019

@cloud-fan @dongjoon-hyun @zsxwing May I ask you to take a look at this PR.

*
* @throws IllegalArgumentException if the string is not a valid internal.
*/
public static CalendarInterval fromCaseInsensitiveString(String s) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it the only place we parse interval string? I thought we parse it with antlr parser.

Copy link
Member Author

@MaxGekk MaxGekk Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

antlr parser does this as well but it parses sql elements like

spark-sql> select interval 10 days 1 second;
interval 1 weeks 3 days 1 seconds

here is only the place where we parse string values:

spark-sql> select interval 'interval 10 days 1 second';
interval 1 weeks 3 days 1 seconds

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks duplicated. Shall we add a parseInterval method to the ParserInterface interface and call the parser here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe something has been duplicated, and can be reused but this is heavy refactoring for this PR.

For instance, AstBuilder.visitInterval gets already split interval units but CalendarInterval.fromString() uses regular expression to parse & split:

private static Pattern p = Pattern.compile("interval" + unitRegex("year") + unitRegex("month") +
unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
Pattern.CASE_INSENSITIVE);

If you don't mind, I would try to do that in a separate PR.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR introduced code duplication #8034 for your code #7355 5 years ago.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And your regexp is not tolerant to the order of interval units, see:

spark-sql> select interval 'interval 1 microsecond 2 months';
NULL
spark-sql> select interval 1 microsecond 2 months;
interval 2 months 1 microseconds

Copy link
Member Author

@MaxGekk MaxGekk Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep them separate so far. And I will try to write flexible and common code in the near future for parsing string intervals that could handle other features found in #26055

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

}
String prefix = "interval";
String intervalStr = trimmed;
if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what does this condition mean?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added comments about this.

// Checks the given interval string does not start with the `interval` prefix
if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) {
// Prepend `interval` if it does not present because
// the regular expression strictly require it.
Copy link
Member Author

@MaxGekk MaxGekk Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have not figured out how to modify the regular expression to make the interval prefix optional.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably, this needs this feature https://www.regular-expressions.info/branchreset.html which Java's regexps doesn't have.

Copy link
Contributor

@cloud-fan cloud-fan Oct 10, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about something like

String intervalStr = trimmed.toLowerCase();
if (intervalStr.startsWith("interval")) {
  intervalStr = intervalStr.drop(8)
}
// parse the interval string assuming there is no leading "interval"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

String intervalStr = trimmed.toLowerCase();

Your code is more expensive because you lower case whole input string.

// parse the interval string assuming there is no leading "interval"

Here there is a problem with current regexp when you delete the anchor "interval". Without this anchor, it cannot match to valid inputs:

scala> import java.util.regex._
import java.util.regex._

scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?"
unitRegex: (unit: String)String

scala> val p = Pattern.compile(unitRegex("year") + unitRegex("month") +
     |     unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
     |     unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
     |     Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = (?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?

scala> val m = p.matcher("1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=]

scala> m.matches()
res7: Boolean = false

If we added it back:

scala> val p = Pattern.compile("interval" + unitRegex("year") + unitRegex("month") +
     |     unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
     |     unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
     |     Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?

scala> val m = p.matcher("interval 1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,25 lastmatch=]

scala> m.matches()
res8: Boolean = true

it can match now. That's why I had to add the interval prefix instead of removing it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you just start the regex with (interval)?? then the first matching group is either null or "interval", and the rest should match the same way?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I remember I tried this regex, and it didn't work. Have you tried it?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have checked, it doesn't work:

scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?"
unitRegex: (unit: String)String

scala> val p = Pattern.compile("(interval)?" + unitRegex("year") + unitRegex("month") +
     |     unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
     |     unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
     |     Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = (interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?

scala> val m = p.matcher("1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=]

scala> m.matches()
res0: Boolean = false

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried simply "(interval)?(.+)".r and it worked as expected on inputs like "abc" and "interval abc". It's a toy example and not sure if it interacts unexpectedly with the rest of the matching. no big deal, just leave it.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 12, 2019

jenkins, retest this, please

@SparkQA
Copy link

SparkQA commented Oct 12, 2019

Test build #111945 has finished for PR 26079 at commit 813897b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2019

Test build #111966 has finished for PR 26079 at commit e2c1352.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Oct 13, 2019

@cloud-fan @dongjoon-hyun @srowen @HyukjinKwon May I ask you to have a look at the PR.

}
String prefix = "interval";
String intervalStr = trimmed;
// Checks the given interval string does not start with the `interval` prefix
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just call trimmed.toLowerCase.startsWith("interval")? For perf reasons?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, don't want to lower case the entire string and allocate memory for new one to only compare small prefix.

@cloud-fan cloud-fan closed this in da576a7 Oct 14, 2019
@cloud-fan
Copy link
Contributor

thanks, merging to master!

@MaxGekk MaxGekk deleted the interval-str-without-prefix branch October 15, 2019 19:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants