-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-29605][SQL] Optimize string to interval casting #26256
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #112677 has finished for PR 26256 at commit
|
|
jenkins, retest this, please |
|
Test build #112684 has finished for PR 26256 at commit
|
|
@cloud-fan @srowen @dongjoon-hyun @maropu May I ask you to review this PR. |
| private final val minuteStr = UTF8String.fromString("minute") | ||
| private final val secondStr = UTF8String.fromString("second") | ||
| private final val millisStr = UTF8String.fromString("millisecond") | ||
| private final val microsStr = UTF8String.fromString("microsecond") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know you added final for performance reasons, but do we have actual performance diffs of the benchmarks below with/without this final? (Just out of curiosity...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I re-ran IntervalBenchmark without finals, there is no difference, actually.
| private final val millisStr = UTF8String.fromString("millisecond") | ||
| private final val microsStr = UTF8String.fromString("microsecond") | ||
|
|
||
| def stringToInterval(input: UTF8String): Option[CalendarInterval] = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use Option here? If this method has the same return type with safeFromString, we could make diff in Cast.scala much less.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: safeFromString -> safeAndFastFromString? Also, could you leave some comments here like This is the fast version of safeFromString...?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you use Option here?
The same reason as stringToDate and stringToTimestamp return Option[...]. I believe all 3 functions should follow similar convention - None for errors and Some[] for valid values.
Returning null instead of None looks slightly ugly in Scala but this should produce less garbage. Let me think of ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: safeFromString -> safeAndFastFromString?
I would remain the same name for consistency with stringToDate and stringToTimestamp, or rename it to safeFromUTF8String.
|
This might be different from |
yes, it does |
|
Yes it does allow in Spark now, but I'm asking if it's also allowed in others. I checked pgsql Can you check more? |
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala
Show resolved
Hide resolved
I can check but the purpose of this PR is optimize existing functionality. I don't think changing behavior here is right choice. |
|
If we need to change behavior, I'd suggest we wait for it to happen before writing a new implementation to optimize it. |
|
I've open #26283 to change the behavior. |
…nterval # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala
|
Test build #112865 has finished for PR 26256 at commit
|
…nterval # Conflicts: # sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/IntervalUtils.scala # sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/util/IntervalUtilsSuite.scala
|
Test build #113124 has finished for PR 26256 at commit
|
|
jenkins, retest this, please |
| case _ if '0' <= b && b <= '9' && fractionScale > 0 => | ||
| fraction = Math.addExact(fraction, Math.multiplyExact(fractionScale, (b - '0'))) | ||
| fractionScale /= 10 | ||
| case ' ' => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about 1. hour and 1. second?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
spark-sql> select cast('1. seconds' as interval);
interval 1 seconds
spark-sql> select cast('1. days' as interval);
interval 1 days|
Test build #113313 has finished for PR 26256 at commit
|
|
Test build #113312 has finished for PR 26256 at commit
|
| case ' ' => | ||
| state = BEGIN_UNIT_NAME | ||
| case '.' => | ||
| fractionScale = 100000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the antlr version(IntervalUtils.parseNanos) supports up to 9 digits in the fraction part. Shall we follow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems the behavior is:
- if it's less than 9 digits, take the first 6 and parse to microseconds
- if it's more than 9 digits, fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed the behavior
| val input = "+1 year -1 day" | ||
| val result = new CalendarInterval(12, -1, 0) | ||
| assert(fromString(input) == result) | ||
| test("string to interval: special cases") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we also test 1. second, 1. hour and 1.111111111 seconds here?
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except 2 comments.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except 2 comments.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except 2 comments.
cloud-fan
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM except 2 comments.
|
Test build #113324 has finished for PR 26256 at commit
|
|
Test build #113325 has finished for PR 26256 at commit
|
| case FRACTIONAL_PART => | ||
| b match { | ||
| case _ if '0' <= b && b <= '9' && fractionScale > 0 => | ||
| fraction = Math.addExact(fraction, Math.multiplyExact(fractionScale, (b - '0'))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The fractional part cannot overflow. The max number is 999999999 fits to Int. I will remove the exact functions.
|
jenkins, retest this, please |
|
Test build #113339 has finished for PR 26256 at commit
|
|
Test build #113341 has finished for PR 26256 at commit
|
|
thanks, merging to master! |
… signs and values ### What changes were proposed in this pull request? With the latest string to literal optimization #26256, some interval strings can not be cast when there are some spaces between signs and unit values. After state `PARSE_SIGN`, it directly goes to `PARSE_UNIT_VALUE` when takes a space character as the end. So when there are some white spaces come before the real unit value, it fails to parse, we should add a new state like `TRIM_VALUE` to trim all these spaces. How to re-produce, which aim the revisions since #26256 is merged ```sql select cast(v as interval) from values ('+ 1 second') t(v); select cast(v as interval) from values ('- 1 second') t(v); ``` ### Why are the changes needed? bug fix ### Does this PR introduce any user-facing change? no ### How was this patch tested? 1. ut 2. new benchmark test Closes #26449 from yaooqinn/SPARK-29605. Authored-by: Kent Yao <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
In the PR, I propose new function
stringToInterval()inIntervalUtilsfor convertingUTF8StringtoCalendarInterval. The function is used in casting aSTRINGcolumn to anINTERVALcolumn.Why are the changes needed?
The proposed implementation is ~10 times faster. For example, parsing 9 interval units on JDK 8:
Before:
After:
Does this PR introduce any user-facing change?
No
How was this patch tested?
stringToIntervalinIntervalUtilsSuite