-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-25202] [SQL] Implements split with limit sql function #22227
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
not |
|
@gatorsmile @ueshin can you trigger this test? |
|
Can you add tests in |
| * @note Pattern is a string representation of the regular expression. | ||
| * | ||
| * @group string_funcs | ||
| * @since 1.5.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1.5.0 -> 2.4.0
|
ok to test. |
| } | ||
|
|
||
| /** | ||
| * Splits str around pattern (pattern is a regular expression) up to `limit-1` times. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Drop up to limit-1 times in the first line? That's because the behaviour depends on values described below.
| */ | ||
| @ExpressionDescription( | ||
| usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.", | ||
| usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you refine the description and the format along with the others, e.g., RLike
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala
Line 78 in ceb3f41
| * pattern - a string expression. The pattern is a string which is matched literally, with |
| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1); | ||
| ["one","two","three",""] | ||
| | > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2); | ||
| | ["one","twoBthreeC"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: remove |.
| """) | ||
| case class StringSplit(str: Expression, pattern: Expression) | ||
| extends BinaryExpression with ImplicitCastInputTypes { | ||
| case class StringSplit(str: Expression, pattern: Expression, limit: Expression) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We still need to support 2 arguments. Please add a constructor def this(str: Expression, pattern: Expression).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For test coverage, better to add tests in string-functions.sql for the two cases: two arguments and three arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@maropu which tests use string-functions.sql? would like to add tests here but not sure how to explicitly kick off the test as there are no *Suites which use this file it seems.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^ ignore this! found it @maropu
| * @since 1.5.0 | ||
| */ | ||
| def split(str: Column, pattern: String, limit: Int): Column = withExpr { | ||
| StringSplit(str.expr, lit(pattern).expr, lit(limit).expr) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: better to directly use Literal
|
@phegstrom Thanks for your contribution! |
| "non-positive then the pattern will be applied as many times as " + | ||
| "possible and the array can have any length. If n is zero then the " + | ||
| "pattern will be applied as many times as possible, the array can " + | ||
| "have any length, and trailing empty strings will be discarded.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm, is it possible to make this usage more compact? I think the usage here should be concise.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya i'll take a crack at it -- the usage is a bit funky given the different behavior based on what limit is, I wanted to err on the side of verbose
| examples = """ | ||
| Examples: | ||
| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]'); | ||
| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is better to keep original example for default value.
|
|
||
| override def nullSafeEval(string: Any, regex: Any): Any = { | ||
| val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1) | ||
| override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to do some check on limit. According to Presto document, limit must be a positive number. -1 is only used when no limit parameter is given (default value).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya the underlying implementation of this method is Java.lang.String, correct? This method does allow non-positive values for limit, not sure what Presto is using.
|
Test build #95237 has finished for PR 22227 at commit
|
| @ExpressionDescription( | ||
| usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.", | ||
| usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." + | ||
| "The `limit` parameter controls the number of times the pattern is applied and " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we be more concise? e.g. presto's doc is just
"Splits string on delimiter and returns an array of size at most limit. The last element in the array always contain everything left in the string. limit must be a positive number."
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you should say if limit is ignored if it is a non-positive number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@rxin the underlying implementation of this method is Java.lang.String, correct? This method does allow non-positive values for limit, not sure what Presto is using. The text I've put here corresponds with the definition (rather long) from Java.lang.String.
| -- split function | ||
| select split('aa1cc2ee', '[1-9]+', 2); | ||
| select split('aa1cc2ee', '[1-9]+'); | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you move these tests to the end of this file in order to decrease unnecessary changes in the golden file.
| "less than 0, then the pattern will be applied as many times as " + | ||
| "possible and the array can have any length. If n is zero then the " + | ||
| "pattern will be applied as many times as possible, the array can " + | ||
| "have any length, and trailing empty strings will be discarded.", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for #22227 (comment). The doc should better be concise.
Can we just move those limit specific description into the arguments at limit - a..? This looks a bit messy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will do!
|
Test build #95311 has finished for PR 22227 at commit
|
|
@maropu @HyukjinKwon let me know what you think, took care of your comments |
| * n, and the array's last entry will contain all input beyond the last matched delimiter. | ||
| * If n is non-positive then the pattern will be applied as many times as possible and the | ||
| * array can have any length. If n is zero then the pattern will be applied as many times as | ||
| * possible, the array can have any length, and trailing empty strings will be discarded. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you copy SQL's doc here? You could describe them via @param here as well.
|
Seems okay |
|
Test build #95388 has finished for PR 22227 at commit
|
| | > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0); | ||
| ["one","two","three"] | ||
| | > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2); | ||
| ["one","twoBthreeC"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add the netative case?
|
|
||
| /** | ||
| * Splits str around pat (pattern is a regular expression). | ||
| * Splits str around pattern (pattern is a regular expression). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pattern? regex? we should use a consisntent word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to switch to regex, makes more sense given that with the use of pattern we always have to define it as a regex
| Examples: | ||
| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]'); | ||
| ["one","two","three",""] | ||
| | > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
drop |
|
Test build #96075 has finished for PR 22227 at commit
|
| ) | ||
| expect_equal( | ||
| collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1], | ||
| list(list("a", "[email protected] 1\\b")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's add a test for limit = 0 or limit = -1 too - while it's the default value, is any of the test cases changes behavior for limit = -1?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added for limit = 0 to catch the "change behavior" case
|
Test build #96112 has finished for PR 22227 at commit
|
|
long thread, are we all good with this? |
|
LGTM except for the R/python parts (I'm not familiar with these parts and I'll leave them to @felixcheung and @HyukjinKwon). |
|
ok to test |
felixcheung
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok R LGTM
|
Test build #96481 has finished for PR 22227 at commit
|
| #' head(select(df, split_string(df$Class, "\\d"))) | ||
| #' head(select(df, split_string(df$Class, "\\d", 2))) | ||
| #' # This is equivalent to the following SQL expression | ||
| #' head(selectExpr(df, "split(Class, '\\\\d')"))} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm i think L3418 shall be followed by L3420?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good point - also the example should run in the order documented.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes will make that change @viirya @felixcheung
| #' Equivalent to \code{split} SQL function. | ||
| #' | ||
| #' @rdname column_string_functions | ||
| #' @param limit determines the length of the returned array. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we mention this is an optional param?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
going to include this in the @details section, as other functions like ltrim handle optionality of one of its params there.
| * and the resulting array's last entry will contain all input beyond the last | ||
| * matched regex.</li> | ||
| * <li>limit less than or equal to 0: `regex` will be applied as many times as | ||
| * possible, and the resulting array can be of any size.</li> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we don't need </li>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was asked to do <li> earlier in this PR conversation. @HyukjinKwon -- thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean we may not need ending tag </li>.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, I'll look into that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viirya throughout this repository, the </li> has always been included. For consistency, I think we should just keep it as is. Let me know what you think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. Then it's fine. Thanks for looking at it.
|
Test build #96826 has finished for PR 22227 at commit
|
|
@phegstrom, can you close and reopen this PR to retrigger the AppVeyor test above? Looks it's failed due to time limitation. |
|
@HyukjinKwon are things passing? |
|
retest this please |
|
Test build #96966 has finished for PR 22227 at commit
|
|
retest this please |
|
Test build #96981 has finished for PR 22227 at commit
|
|
Merged to master. |
## What changes were proposed in this pull request? Adds support for the setting limit in the sql split function ## How was this patch tested? 1. Updated unit tests 2. Tested using Scala spark shell Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22227 from phegstrom/master. Authored-by: Parker Hegstrom <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>
| */ | ||
| def split(str: Column, pattern: String): Column = withExpr { | ||
| StringSplit(str.expr, lit(pattern).expr) | ||
| def split(str: Column, regex: String): Column = withExpr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this an API breaking change?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes it is for source compatibility in scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yea Scala is sensitive to parameter name, as the caller can do: split(str = ..., pattern = ...)
so this is binary-compatible but not source-compatible. @HyukjinKwon can you help revert this line?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, but for the record such changes already have been made so far not only in SQL but SS sides if I am not remembering wrongly because users are expected to likely edit their source when they compile against Spark 3.0, and it doesn't break existing compiled apps. I am not sure why this one is special but sure it's easy to keep the compat with a minimal change.
…t split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in #22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes #27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
…t split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in #22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes #27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 3956e95) Signed-off-by: HyukjinKwon <[email protected]>
…t split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in apache#22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes apache#27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>
What changes were proposed in this pull request?
Adds support for the setting limit in the sql split function
How was this patch tested?
Please review http://spark.apache.org/contributing.html before opening a pull request.