[SPARK-29369][SQL] Support string intervals without the interval prefix#26079
[SPARK-29369][SQL] Support string intervals without the interval prefix#26079MaxGekk wants to merge 7 commits intoapache:masterfrom
interval prefix#26079Conversation
|
@cloud-fan @dongjoon-hyun @zsxwing May I ask you to take a look at this PR. |
| * | ||
| * @throws IllegalArgumentException if the string is not a valid internal. | ||
| */ | ||
| public static CalendarInterval fromCaseInsensitiveString(String s) { |
There was a problem hiding this comment.
Is it the only place we parse interval string? I thought we parse it with antlr parser.
There was a problem hiding this comment.
antlr parser does this as well but it parses sql elements like
spark-sql> select interval 10 days 1 second;
interval 1 weeks 3 days 1 secondshere is only the place where we parse string values:
spark-sql> select interval 'interval 10 days 1 second';
interval 1 weeks 3 days 1 secondsThere was a problem hiding this comment.
This looks duplicated. Shall we add a parseInterval method to the ParserInterface interface and call the parser here?
There was a problem hiding this comment.
Maybe something has been duplicated, and can be reused but this is heavy refactoring for this PR.
For instance, AstBuilder.visitInterval gets already split interval units but CalendarInterval.fromString() uses regular expression to parse & split:
If you don't mind, I would try to do that in a separate PR.
There was a problem hiding this comment.
And your regexp is not tolerant to the order of interval units, see:
spark-sql> select interval 'interval 1 microsecond 2 months';
NULL
spark-sql> select interval 1 microsecond 2 months;
interval 2 months 1 microsecondsThere was a problem hiding this comment.
Let's keep them separate so far. And I will try to write flexible and common code in the near future for parsing string intervals that could handle other features found in #26055
| } | ||
| String prefix = "interval"; | ||
| String intervalStr = trimmed; | ||
| if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) { |
There was a problem hiding this comment.
what does this condition mean?
There was a problem hiding this comment.
I added comments about this.
| // Checks the given interval string does not start with the `interval` prefix | ||
| if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) { | ||
| // Prepend `interval` if it does not present because | ||
| // the regular expression strictly require it. |
There was a problem hiding this comment.
I have not figured out how to modify the regular expression to make the interval prefix optional.
There was a problem hiding this comment.
Probably, this needs this feature https://www.regular-expressions.info/branchreset.html which Java's regexps doesn't have.
There was a problem hiding this comment.
How about something like
String intervalStr = trimmed.toLowerCase();
if (intervalStr.startsWith("interval")) {
intervalStr = intervalStr.drop(8)
}
// parse the interval string assuming there is no leading "interval"
There was a problem hiding this comment.
String intervalStr = trimmed.toLowerCase();
Your code is more expensive because you lower case whole input string.
// parse the interval string assuming there is no leading "interval"
Here there is a problem with current regexp when you delete the anchor "interval". Without this anchor, it cannot match to valid inputs:
scala> import java.util.regex._
import java.util.regex._
scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?"
unitRegex: (unit: String)String
scala> val p = Pattern.compile(unitRegex("year") + unitRegex("month") +
| unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
| unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
| Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = (?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?
scala> val m = p.matcher("1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=]
scala> m.matches()
res7: Boolean = falseIf we added it back:
scala> val p = Pattern.compile("interval" + unitRegex("year") + unitRegex("month") +
| unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
| unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
| Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?
scala> val m = p.matcher("interval 1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,25 lastmatch=]
scala> m.matches()
res8: Boolean = trueit can match now. That's why I had to add the interval prefix instead of removing it.
There was a problem hiding this comment.
Can you just start the regex with (interval)?? then the first matching group is either null or "interval", and the rest should match the same way?
There was a problem hiding this comment.
As far as I remember I tried this regex, and it didn't work. Have you tried it?
There was a problem hiding this comment.
I have checked, it doesn't work:
scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?"
unitRegex: (unit: String)String
scala> val p = Pattern.compile("(interval)?" + unitRegex("year") + unitRegex("month") +
| unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
| unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
| Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = (interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?
scala> val m = p.matcher("1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=]
scala> m.matches()
res0: Boolean = falseThere was a problem hiding this comment.
I tried simply "(interval)?(.+)".r and it worked as expected on inputs like "abc" and "interval abc". It's a toy example and not sure if it interacts unexpectedly with the rest of the matching. no big deal, just leave it.
|
jenkins, retest this, please |
|
Test build #111945 has finished for PR 26079 at commit
|
|
Test build #111966 has finished for PR 26079 at commit
|
|
@cloud-fan @dongjoon-hyun @srowen @HyukjinKwon May I ask you to have a look at the PR. |
| } | ||
| String prefix = "interval"; | ||
| String intervalStr = trimmed; | ||
| // Checks the given interval string does not start with the `interval` prefix |
There was a problem hiding this comment.
Why not just call trimmed.toLowerCase.startsWith("interval")? For perf reasons?
There was a problem hiding this comment.
Yes, don't want to lower case the entire string and allocate memory for new one to only compare small prefix.
|
thanks, merging to master! |
What changes were proposed in this pull request?
In the PR, I propose to move interval parsing to
CalendarInterval.fromCaseInsensitiveString()which throws anIllegalArgumentExceptionfor invalid strings, and reuse it fromCalendarInterval.fromString(). The former one handlesIllegalArgumentExceptiononly and returnsNULLfor invalid interval strings. This will allow to support interval strings without theintervalprefix in casting strings to intervals and in interval type constructor because they usefromString()for parsing string intervals.For example:
Why are the changes needed?
To maintain feature parity with PostgreSQL which supports interval strings without prefix:
and to improve Spark SQL UX.
Does this PR introduce any user-facing change?
Yes, previously parsing of interval strings without
intervalgivesNULL:After:
How was this patch tested?
CalendarIntervalSuite.javaCastSuiteExpressionParserSuite