-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-29369][SQL] Support string intervals without the interval prefix
#26079
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
6d0564c
60d7d34
9253cec
eecfbf4
e254ee5
813897b
e2c1352
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -18,7 +18,6 @@ | |
| package org.apache.spark.unsafe.types; | ||
|
|
||
| import java.io.Serializable; | ||
| import java.util.Locale; | ||
| import java.util.regex.Matcher; | ||
| import java.util.regex.Pattern; | ||
|
|
||
|
|
@@ -73,45 +72,53 @@ private static long toLong(String s) { | |
| * This method is case-insensitive. | ||
| */ | ||
| public static CalendarInterval fromString(String s) { | ||
| if (s == null) { | ||
| return null; | ||
| } | ||
| s = s.trim(); | ||
| Matcher m = p.matcher(s); | ||
| if (!m.matches() || s.compareToIgnoreCase("interval") == 0) { | ||
| try { | ||
| return fromCaseInsensitiveString(s); | ||
| } catch (IllegalArgumentException e) { | ||
| return null; | ||
| } else { | ||
| long months = toLong(m.group(1)) * 12 + toLong(m.group(2)); | ||
| long microseconds = toLong(m.group(3)) * MICROS_PER_WEEK; | ||
| microseconds += toLong(m.group(4)) * MICROS_PER_DAY; | ||
| microseconds += toLong(m.group(5)) * MICROS_PER_HOUR; | ||
| microseconds += toLong(m.group(6)) * MICROS_PER_MINUTE; | ||
| microseconds += toLong(m.group(7)) * MICROS_PER_SECOND; | ||
| microseconds += toLong(m.group(8)) * MICROS_PER_MILLI; | ||
| microseconds += toLong(m.group(9)); | ||
| return new CalendarInterval((int) months, microseconds); | ||
| } | ||
| } | ||
|
|
||
| /** | ||
| * Convert a string to CalendarInterval. Unlike fromString, this method can handle | ||
| * Convert a string to CalendarInterval. This method can handle | ||
| * strings without the `interval` prefix and throws IllegalArgumentException | ||
| * when the input string is not a valid interval. | ||
| * | ||
| * @throws IllegalArgumentException if the string is not a valid internal. | ||
| */ | ||
| public static CalendarInterval fromCaseInsensitiveString(String s) { | ||
| if (s == null || s.trim().isEmpty()) { | ||
| throw new IllegalArgumentException("Interval cannot be null or blank."); | ||
| if (s == null) { | ||
| throw new IllegalArgumentException("Interval cannot be null"); | ||
| } | ||
| String sInLowerCase = s.trim().toLowerCase(Locale.ROOT); | ||
| String interval = | ||
| sInLowerCase.startsWith("interval ") ? sInLowerCase : "interval " + sInLowerCase; | ||
| CalendarInterval cal = fromString(interval); | ||
| if (cal == null) { | ||
| String trimmed = s.trim(); | ||
| if (trimmed.isEmpty()) { | ||
| throw new IllegalArgumentException("Interval cannot be blank"); | ||
| } | ||
| String prefix = "interval"; | ||
| String intervalStr = trimmed; | ||
| // Checks the given interval string does not start with the `interval` prefix | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why not just call
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes, don't want to lower case the entire string and allocate memory for new one to only compare small prefix. |
||
| if (!intervalStr.regionMatches(true, 0, prefix, 0, prefix.length())) { | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what does this condition mean?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I added comments about this. |
||
| // Prepend `interval` if it does not present because | ||
| // the regular expression strictly require it. | ||
|
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have not figured out how to modify the regular expression to make the
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Probably, this needs this feature https://www.regular-expressions.info/branchreset.html which Java's regexps doesn't have.
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How about something like
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Your code is more expensive because you lower case whole input string.
Here there is a problem with current regexp when you delete the anchor scala> import java.util.regex._
import java.util.regex._
scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?"
unitRegex: (unit: String)String
scala> val p = Pattern.compile(unitRegex("year") + unitRegex("month") +
| unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
| unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
| Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = (?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?
scala> val m = p.matcher("1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=]
scala> m.matches()
res7: Boolean = falseIf we added it back: scala> val p = Pattern.compile("interval" + unitRegex("year") + unitRegex("month") +
| unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
| unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
| Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?
scala> val m = p.matcher("interval 1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=interval(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,25 lastmatch=]
scala> m.matches()
res8: Boolean = trueit can match now. That's why I had to add the
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can you just start the regex with
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. As far as I remember I tried this regex, and it didn't work. Have you tried it?
Member
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I have checked, it doesn't work: scala> def unitRegex(unit: String) = "(?:\\s+(-?\\d+)\\s+" + unit + "s?)?"
unitRegex: (unit: String)String
scala> val p = Pattern.compile("(interval)?" + unitRegex("year") + unitRegex("month") +
| unitRegex("week") + unitRegex("day") + unitRegex("hour") + unitRegex("minute") +
| unitRegex("second") + unitRegex("millisecond") + unitRegex("microsecond"),
| Pattern.CASE_INSENSITIVE)
p: java.util.regex.Pattern = (interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)?
scala> val m = p.matcher("1 month 1 second")
m: java.util.regex.Matcher = java.util.regex.Matcher[pattern=(interval)?(?:\s+(-?\d+)\s+years?)?(?:\s+(-?\d+)\s+months?)?(?:\s+(-?\d+)\s+weeks?)?(?:\s+(-?\d+)\s+days?)?(?:\s+(-?\d+)\s+hours?)?(?:\s+(-?\d+)\s+minutes?)?(?:\s+(-?\d+)\s+seconds?)?(?:\s+(-?\d+)\s+milliseconds?)?(?:\s+(-?\d+)\s+microseconds?)? region=0,16 lastmatch=]
scala> m.matches()
res0: Boolean = false
Member
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I tried simply |
||
| intervalStr = prefix + " " + trimmed; | ||
| } else if (intervalStr.length() == prefix.length()) { | ||
| throw new IllegalArgumentException("Interval string must have time units"); | ||
| } | ||
|
|
||
| Matcher m = p.matcher(intervalStr); | ||
| if (!m.matches()) { | ||
| throw new IllegalArgumentException("Invalid interval: " + s); | ||
| } | ||
| return cal; | ||
|
|
||
| long months = toLong(m.group(1)) * 12 + toLong(m.group(2)); | ||
| long microseconds = toLong(m.group(3)) * MICROS_PER_WEEK; | ||
| microseconds += toLong(m.group(4)) * MICROS_PER_DAY; | ||
| microseconds += toLong(m.group(5)) * MICROS_PER_HOUR; | ||
| microseconds += toLong(m.group(6)) * MICROS_PER_MINUTE; | ||
| microseconds += toLong(m.group(7)) * MICROS_PER_SECOND; | ||
| microseconds += toLong(m.group(8)) * MICROS_PER_MILLI; | ||
| microseconds += toLong(m.group(9)); | ||
| return new CalendarInterval((int) months, microseconds); | ||
| } | ||
|
|
||
| public static long toLongWithRange(String fieldName, | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it the only place we parse interval string? I thought we parse it with antlr parser.
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
antlr parser does this as well but it parses sql elements like
here is only the place where we parse string values:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks duplicated. Shall we add a
parseIntervalmethod to theParserInterfaceinterface and call the parser here?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe something has been duplicated, and can be reused but this is heavy refactoring for this PR.
For instance,
AstBuilder.visitIntervalgets already split interval units butCalendarInterval.fromString()uses regular expression to parse & split:spark/common/unsafe/src/main/java/org/apache/spark/unsafe/types/CalendarInterval.java
Lines 50 to 53 in b103449
If you don't mind, I would try to do that in a separate PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR introduced code duplication #8034 for your code #7355 5 years ago.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And your regexp is not tolerant to the order of interval units, see:
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's keep them separate so far. And I will try to write flexible and common code in the near future for parsing string intervals that could handle other features found in #26055
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SGTM