[SPARK-25202] [SQL] Implements split with limit sql function #22227

phegstrom · 2018-08-24T20:55:11Z

What changes were proposed in this pull request?

Adds support for the setting limit in the sql split function

How was this patch tested?

Updated unit tests
Tested using Scala spark shell

Please review http://spark.apache.org/contributing.html before opening a pull request.

maropu · 2018-08-25T00:21:57Z

not [CORE] but [SQL] in the title.

maropu · 2018-08-25T00:22:13Z

@gatorsmile @ueshin can you trigger this test?

maropu · 2018-08-25T00:29:53Z

Can you add tests in StringFunctionsSuite, too? Also, you'd be better to add tests in sql-tests/inputs/string-functions.sql for tests via the parser.

maropu · 2018-08-25T00:37:59Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @note Pattern is a string representation of the regular expression.
+   *
+   * @group string_funcs
+   * @since 1.5.0


1.5.0 -> 2.4.0

ueshin · 2018-08-25T00:56:07Z

ok to test.

maropu · 2018-08-25T00:58:32Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+  }
+
+  /**
+   * Splits str around pattern (pattern is a regular expression) up to `limit-1` times.


Drop up to limit-1 times in the first line? That's because the behaviour depends on values described below.

maropu · 2018-08-25T01:03:38Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

 */
 @ExpressionDescription(
-  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
+  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +


Can you refine the description and the format along with the others, e.g., RLike

spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

Line 78 in ceb3f41

* pattern - a string expression. The pattern is a string which is matched literally, with

ueshin · 2018-08-25T00:57:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
       ["one","two","three",""]
+|      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
+ |       ["one","twoBthreeC"]


nit: remove |.

ueshin · 2018-08-25T00:59:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

  """)
-case class StringSplit(str: Expression, pattern: Expression)
-  extends BinaryExpression with ImplicitCastInputTypes {
+case class StringSplit(str: Expression, pattern: Expression, limit: Expression)


We still need to support 2 arguments. Please add a constructor def this(str: Expression, pattern: Expression).

For test coverage, better to add tests in string-functions.sql for the two cases: two arguments and three arguments.

@maropu which tests use string-functions.sql? would like to add tests here but not sure how to explicitly kick off the test as there are no *Suites which use this file it seems.

^ ignore this! found it @maropu

maropu · 2018-08-25T01:11:32Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * @since 1.5.0
+   */
+  def split(str: Column, pattern: String, limit: Int): Column = withExpr {
+    StringSplit(str.expr, lit(pattern).expr, lit(limit).expr)


nit: better to directly use Literal

ueshin · 2018-08-25T01:16:37Z

@phegstrom Thanks for your contribution!
Btw, seems like your email address in your commits is not connected to your GitHub account. Could you connect the address to your account, or use other address connected to your account? Otherwise, your contribution will not be connected to you.

viirya · 2018-08-25T01:21:58Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+    "non-positive then the pattern will be applied as many times as " +
+    "possible and the array can have any length. If n is zero then the " +
+    "pattern will be applied as many times as possible, the array can " +
+    "have any length, and trailing empty strings will be discarded.",


hmm, is it possible to make this usage more compact? I think the usage here should be concise.

@viirya i'll take a crack at it -- the usage is a bit funky given the different behavior based on what limit is, I wanted to err on the side of verbose

viirya · 2018-08-25T01:22:48Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

  examples = """
    Examples:
-      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
+      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);


I think it is better to keep original example for default value.

viirya · 2018-08-25T01:25:51Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala


-  override def nullSafeEval(string: Any, regex: Any): Any = {
-    val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
+  override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {


I think we still need to do some check on limit. According to Presto document, limit must be a positive number. -1 is only used when no limit parameter is given (default value).

@viirya the underlying implementation of this method is Java.lang.String, correct? This method does allow non-positive values for limit, not sure what Presto is using.

SparkQA · 2018-08-25T02:40:05Z

Test build #95237 has finished for PR 22227 at commit ceb3f41.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rxin · 2018-08-26T06:05:01Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

 @ExpressionDescription(
-  usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
+  usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
+    "The `limit` parameter controls the number of times the pattern is applied and " +


can we be more concise? e.g. presto's doc is just

"Splits string on delimiter and returns an array of size at most limit. The last element in the array always contain everything left in the string. limit must be a positive number."

you should say if limit is ignored if it is a non-positive number.

@rxin the underlying implementation of this method is Java.lang.String, correct? This method does allow non-positive values for limit, not sure what Presto is using. The text I've put here corresponds with the definition (rather long) from Java.lang.String.

maropu · 2018-08-28T00:32:03Z

sql/core/src/test/resources/sql-tests/inputs/string-functions.sql

+-- split function
+select split('aa1cc2ee', '[1-9]+', 2);
+select split('aa1cc2ee', '[1-9]+');
+


Can you move these tests to the end of this file in order to decrease unnecessary changes in the golden file.

HyukjinKwon · 2018-08-28T03:41:39Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+    "less than 0, then the pattern will be applied as many times as " +
+    "possible and the array can have any length. If n is zero then the " +
+    "pattern will be applied as many times as possible, the array can " +
+    "have any length, and trailing empty strings will be discarded.",


+1 for #22227 (comment). The doc should better be concise.

Can we just move those limit specific description into the arguments at limit - a..? This looks a bit messy.

SparkQA · 2018-08-28T03:43:27Z

Test build #95311 has finished for PR 22227 at commit 4e10733.

This patch fails from timeout after a configured wait of `400m`.
This patch merges cleanly.
This patch adds no public classes.

phegstrom · 2018-08-28T21:53:22Z

@maropu @HyukjinKwon let me know what you think, took care of your comments

HyukjinKwon · 2018-08-29T01:23:14Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   * n, and the array's last entry will contain all input beyond the last matched delimiter.
+   * If n is non-positive then the pattern will be applied as many times as possible and the
+   * array can have any length. If n is zero then the pattern will be applied as many times as
+   * possible, the array can have any length, and trailing empty strings will be discarded.


Can you copy SQL's doc here? You could describe them via @param here as well.

HyukjinKwon · 2018-08-29T01:23:49Z

Seems okay

SparkQA · 2018-08-29T01:45:31Z

Test build #95388 has finished for PR 22227 at commit e8c8c8c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

maropu · 2018-08-29T03:57:20Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0);
+       ["one","two","three"]
+|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
+       ["one","twoBthreeC"]


Add the netative case?

maropu · 2018-08-29T04:00:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala


 /**
- * Splits str around pat (pattern is a regular expression).
+ * Splits str around pattern (pattern is a regular expression).


pattern? regex? we should use a consisntent word.

going to switch to regex, makes more sense given that with the use of pattern we always have to define it as a regex

maropu · 2018-08-29T04:12:52Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

    Examples:
      > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
       ["one","two","three",""]
+|     > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0);


SparkQA · 2018-09-14T15:39:50Z

Test build #96075 has finished for PR 22227 at commit b5994ad.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-09-16T04:18:43Z

R/pkg/tests/fulltests/test_sparkSQL.R

  )
+  expect_equal(
+    collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
+    list(list("a", "[email protected]   1\\b"))


let's add a test for limit = 0 or limit = -1 too - while it's the default value, is any of the test cases changes behavior for limit = -1?

added for limit = 0 to catch the "change behavior" case

SparkQA · 2018-09-17T03:39:16Z

Test build #96112 has finished for PR 22227 at commit 5c8f487.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2018-09-20T04:29:25Z

long thread, are we all good with this?

maropu · 2018-09-21T14:46:59Z

LGTM except for the R/python parts (I'm not familiar with these parts and I'll leave them to @felixcheung and @HyukjinKwon).

HyukjinKwon · 2018-09-23T02:39:44Z

ok to test

felixcheung

ok R LGTM

SparkQA · 2018-09-23T07:05:02Z

Test build #96481 has finished for PR 22227 at commit 5c8f487.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2018-09-23T07:34:13Z

R/pkg/R/functions.R

 #' head(select(df, split_string(df$Class, "\\d")))
+#' head(select(df, split_string(df$Class, "\\d", 2)))
 #' # This is equivalent to the following SQL expression
 #' head(selectExpr(df, "split(Class, '\\\\d')"))}


hmm i think L3418 shall be followed by L3420?

good point - also the example should run in the order documented.

yes will make that change @viirya @felixcheung

viirya · 2018-09-23T07:38:07Z

R/pkg/R/functions.R

 #' Equivalent to \code{split} SQL function.
 #'
 #' @rdname column_string_functions
+#' @param limit determines the length of the returned array.


shall we mention this is an optional param?

going to include this in the @details section, as other functions like ltrim handle optionality of one of its params there.

viirya · 2018-09-23T07:47:19Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

+   *          and the resulting array's last entry will contain all input beyond the last
+   *          matched regex.</li>
+   *          <li>limit less than or equal to 0: `regex` will be applied as many times as
+   *          possible, and the resulting array can be of any size.</li>


I think we don't need </li>.

I was asked to do <li> earlier in this PR conversation. @HyukjinKwon -- thoughts here?

I mean we may not need ending tag </li>.

ah, I'll look into that

@viirya throughout this repository, the </li> has always been included. For consistency, I think we should just keep it as is. Let me know what you think

Ok. Then it's fine. Thanks for looking at it.

SparkQA · 2018-10-01T23:21:58Z

Test build #96826 has finished for PR 22227 at commit 34ba74f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-01T23:46:57Z

@phegstrom, can you close and reopen this PR to retrigger the AppVeyor test above? Looks it's failed due to time limitation.

phegstrom · 2018-10-03T10:17:19Z

@HyukjinKwon are things passing?

HyukjinKwon · 2018-10-05T03:40:05Z

retest this please

SparkQA · 2018-10-05T07:05:02Z

Test build #96966 has finished for PR 22227 at commit 34ba74f.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-05T08:27:11Z

retest this please

SparkQA · 2018-10-05T13:21:53Z

Test build #96981 has finished for PR 22227 at commit 34ba74f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-10-06T06:30:39Z

Merged to master.

## What changes were proposed in this pull request? Adds support for the setting limit in the sql split function ## How was this patch tested? 1. Updated unit tests 2. Tested using Scala spark shell Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#22227 from phegstrom/master. Authored-by: Parker Hegstrom <[email protected]> Signed-off-by: hyukjinkwon <[email protected]>

gatorsmile · 2020-03-02T01:55:33Z

sql/core/src/main/scala/org/apache/spark/sql/functions.scala

   */
-  def split(str: Column, pattern: String): Column = withExpr {
-    StringSplit(str.expr, lit(pattern).expr)
+  def split(str: Column, regex: String): Column = withExpr {


Is this an API breaking change?

yes it is for source compatibility in scala

yea Scala is sensitive to parameter name, as the caller can do: split(str = ..., pattern = ...)

so this is binary-compatible but not source-compatible. @HyukjinKwon can you help revert this line?

Okay, but for the record such changes already have been made so far not only in SQL but SS sides if I am not remembering wrongly because users are expected to likely edit their source when they compile against Spark 3.0, and it doesn't break existing compiled apps. I am not sure why this one is special but sure it's easy to keep the compat with a minimal change.

…t split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in #22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes #27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

…t split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in #22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes #27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]> (cherry picked from commit 3956e95) Signed-off-by: HyukjinKwon <[email protected]>

…t split in Scala API ### What changes were proposed in this pull request? To address the concern pointed out in apache#22227. This will make `split` source-compatible by removing minimal cosmetic changes. ### Why are the changes needed? For source compatibility. ### Does this PR introduce any user-facing change? No (it will prevent potential user-facing change from the original PR) ### How was this patch tested? Unittest was changed (in order for us to detect that source compatibility easily). Closes apache#27756 from HyukjinKwon/SPARK-25202. Authored-by: HyukjinKwon <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

Parker Hegstrom added 2 commits August 24, 2018 15:19

implement split with limit

15362be

linting

ceb3f41

maropu reviewed Aug 25, 2018

View reviewed changes

ueshin reviewed Aug 25, 2018

View reviewed changes

maropu reviewed Aug 25, 2018

View reviewed changes

viirya reviewed Aug 25, 2018

View reviewed changes

rxin reviewed Aug 26, 2018

View reviewed changes

phegstrom changed the title ~~[SPARK-25202] [Core] Implements split with limit sql function~~ [SPARK-25202] [SQL] Implements split with limit sql function Aug 27, 2018

Parker Hegstrom added 2 commits August 27, 2018 16:27

most comments

e564a68

sql function tests

4e10733

maropu reviewed Aug 28, 2018

View reviewed changes

HyukjinKwon reviewed Aug 28, 2018

View reviewed changes

fixing test file, comments

5135cb2

adding another example

e8c8c8c

HyukjinKwon reviewed Aug 29, 2018

View reviewed changes

maropu reviewed Aug 29, 2018

View reviewed changes

felixcheung reviewed Sep 16, 2018

View reviewed changes

felix comments for R tests

5c8f487

felixcheung reviewed Sep 23, 2018

View reviewed changes

viirya reviewed Sep 23, 2018

View reviewed changes

viirya comments

34ba74f

phegstrom closed this Oct 2, 2018

phegstrom reopened this Oct 2, 2018

HyukjinKwon approved these changes Oct 3, 2018

View reviewed changes

asfgit closed this in 17781d7 Oct 6, 2018

gatorsmile reviewed Mar 2, 2020

View reviewed changes

HyukjinKwon mentioned this pull request Mar 2, 2020

[SPARK-25202][SQL][FOLLOW-UP] Keep the old parameter name 'pattern' at split in Scala API #27756

Closed

[SPARK-25202] [SQL] Implements split with limit sql function #22227

[SPARK-25202] [SQL] Implements split with limit sql function #22227

Uh oh!

Conversation

phegstrom commented Aug 24, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

maropu commented Aug 25, 2018

Uh oh!

maropu commented Aug 25, 2018

Uh oh!

maropu commented Aug 25, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented Aug 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ueshin commented Aug 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

phegstrom Aug 27, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 25, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 28, 2018

Uh oh!

phegstrom commented Aug 28, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon commented Aug 29, 2018

Uh oh!

SparkQA commented Aug 29, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

maropu commented Aug 25, 2018 •

edited

Loading

phegstrom Aug 27, 2018 •

edited

Loading