Skip to content

Conversation

@phegstrom
Copy link

What changes were proposed in this pull request?

Adds support for the setting limit in the sql split function

How was this patch tested?

  1. Updated unit tests
  2. Tested using Scala spark shell

Please review http://spark.apache.org/contributing.html before opening a pull request.

@maropu
Copy link
Member

maropu commented Aug 25, 2018

not [CORE] but [SQL] in the title.

@maropu
Copy link
Member

maropu commented Aug 25, 2018

@gatorsmile @ueshin can you trigger this test?

@maropu
Copy link
Member

maropu commented Aug 25, 2018

Can you add tests in StringFunctionsSuite, too? Also, you'd be better to add tests in sql-tests/inputs/string-functions.sql for tests via the parser.

* @note Pattern is a string representation of the regular expression.
*
* @group string_funcs
* @since 1.5.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1.5.0 -> 2.4.0

@ueshin
Copy link
Member

ueshin commented Aug 25, 2018

ok to test.

}

/**
* Splits str around pattern (pattern is a regular expression) up to `limit-1` times.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Drop up to limit-1 times in the first line? That's because the behaviour depends on values described below.

*/
@ExpressionDescription(
usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you refine the description and the format along with the others, e.g., RLike

* pattern - a string expression. The pattern is a string which is matched literally, with

> SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
["one","two","three",""]
| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
| ["one","twoBthreeC"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: remove |.

""")
case class StringSplit(str: Expression, pattern: Expression)
extends BinaryExpression with ImplicitCastInputTypes {
case class StringSplit(str: Expression, pattern: Expression, limit: Expression)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need to support 2 arguments. Please add a constructor def this(str: Expression, pattern: Expression).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For test coverage, better to add tests in string-functions.sql for the two cases: two arguments and three arguments.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@maropu which tests use string-functions.sql? would like to add tests here but not sure how to explicitly kick off the test as there are no *Suites which use this file it seems.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

^ ignore this! found it @maropu

* @since 1.5.0
*/
def split(str: Column, pattern: String, limit: Int): Column = withExpr {
StringSplit(str.expr, lit(pattern).expr, lit(limit).expr)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: better to directly use Literal

@ueshin
Copy link
Member

ueshin commented Aug 25, 2018

@phegstrom Thanks for your contribution!
Btw, seems like your email address in your commits is not connected to your GitHub account. Could you connect the address to your account, or use other address connected to your account? Otherwise, your contribution will not be connected to you.

"non-positive then the pattern will be applied as many times as " +
"possible and the array can have any length. If n is zero then the " +
"pattern will be applied as many times as possible, the array can " +
"have any length, and trailing empty strings will be discarded.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm, is it possible to make this usage more compact? I think the usage here should be concise.

Copy link
Author

@phegstrom phegstrom Aug 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya i'll take a crack at it -- the usage is a bit funky given the different behavior based on what limit is, I wanted to err on the side of verbose

examples = """
Examples:
> SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
> SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', -1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is better to keep original example for default value.


override def nullSafeEval(string: Any, regex: Any): Any = {
val strings = string.asInstanceOf[UTF8String].split(regex.asInstanceOf[UTF8String], -1)
override def nullSafeEval(string: Any, regex: Any, limit: Any): Any = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to do some check on limit. According to Presto document, limit must be a positive number. -1 is only used when no limit parameter is given (default value).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya the underlying implementation of this method is Java.lang.String, correct? This method does allow non-positive values for limit, not sure what Presto is using.

@SparkQA
Copy link

SparkQA commented Aug 25, 2018

Test build #95237 has finished for PR 22227 at commit ceb3f41.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ExpressionDescription(
usage = "_FUNC_(str, regex) - Splits `str` around occurrences that match `regex`.",
usage = "_FUNC_(str, regex, limit) - Splits `str` around occurrences that match `regex`." +
"The `limit` parameter controls the number of times the pattern is applied and " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we be more concise? e.g. presto's doc is just

"Splits string on delimiter and returns an array of size at most limit. The last element in the array always contain everything left in the string. limit must be a positive number."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you should say if limit is ignored if it is a non-positive number.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rxin the underlying implementation of this method is Java.lang.String, correct? This method does allow non-positive values for limit, not sure what Presto is using. The text I've put here corresponds with the definition (rather long) from Java.lang.String.

@phegstrom phegstrom changed the title [SPARK-25202] [Core] Implements split with limit sql function [SPARK-25202] [SQL] Implements split with limit sql function Aug 27, 2018
-- split function
select split('aa1cc2ee', '[1-9]+', 2);
select split('aa1cc2ee', '[1-9]+');

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you move these tests to the end of this file in order to decrease unnecessary changes in the golden file.

"less than 0, then the pattern will be applied as many times as " +
"possible and the array can have any length. If n is zero then the " +
"pattern will be applied as many times as possible, the array can " +
"have any length, and trailing empty strings will be discarded.",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for #22227 (comment). The doc should better be concise.

Can we just move those limit specific description into the arguments at limit - a..? This looks a bit messy.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will do!

@SparkQA
Copy link

SparkQA commented Aug 28, 2018

Test build #95311 has finished for PR 22227 at commit 4e10733.

  • This patch fails from timeout after a configured wait of `400m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

@phegstrom
Copy link
Author

@maropu @HyukjinKwon let me know what you think, took care of your comments

* n, and the array's last entry will contain all input beyond the last matched delimiter.
* If n is non-positive then the pattern will be applied as many times as possible and the
* array can have any length. If n is zero then the pattern will be applied as many times as
* possible, the array can have any length, and trailing empty strings will be discarded.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you copy SQL's doc here? You could describe them via @param here as well.

@HyukjinKwon
Copy link
Member

Seems okay

@SparkQA
Copy link

SparkQA commented Aug 29, 2018

Test build #95388 has finished for PR 22227 at commit e8c8c8c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0);
["one","two","three"]
| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 2);
["one","twoBthreeC"]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the netative case?


/**
* Splits str around pat (pattern is a regular expression).
* Splits str around pattern (pattern is a regular expression).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pattern? regex? we should use a consisntent word.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to switch to regex, makes more sense given that with the use of pattern we always have to define it as a regex

Examples:
> SELECT _FUNC_('oneAtwoBthreeC', '[ABC]');
["one","two","three",""]
| > SELECT _FUNC_('oneAtwoBthreeC', '[ABC]', 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

drop |

@SparkQA
Copy link

SparkQA commented Sep 14, 2018

Test build #96075 has finished for PR 22227 at commit b5994ad.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

)
expect_equal(
collect(select(df4, split_string(df4$a, "\\.", 2)))[1, 1],
list(list("a", "[email protected] 1\\b"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's add a test for limit = 0 or limit = -1 too - while it's the default value, is any of the test cases changes behavior for limit = -1?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added for limit = 0 to catch the "change behavior" case

@SparkQA
Copy link

SparkQA commented Sep 17, 2018

Test build #96112 has finished for PR 22227 at commit 5c8f487.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@felixcheung
Copy link
Member

long thread, are we all good with this?

@maropu
Copy link
Member

maropu commented Sep 21, 2018

LGTM except for the R/python parts (I'm not familiar with these parts and I'll leave them to @felixcheung and @HyukjinKwon).

@HyukjinKwon
Copy link
Member

ok to test

Copy link
Member

@felixcheung felixcheung left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok R LGTM

@SparkQA
Copy link

SparkQA commented Sep 23, 2018

Test build #96481 has finished for PR 22227 at commit 5c8f487.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

#' head(select(df, split_string(df$Class, "\\d")))
#' head(select(df, split_string(df$Class, "\\d", 2)))
#' # This is equivalent to the following SQL expression
#' head(selectExpr(df, "split(Class, '\\\\d')"))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm i think L3418 shall be followed by L3420?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point - also the example should run in the order documented.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes will make that change @viirya @felixcheung

#' Equivalent to \code{split} SQL function.
#'
#' @rdname column_string_functions
#' @param limit determines the length of the returned array.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we mention this is an optional param?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

going to include this in the @details section, as other functions like ltrim handle optionality of one of its params there.

* and the resulting array's last entry will contain all input beyond the last
* matched regex.</li>
* <li>limit less than or equal to 0: `regex` will be applied as many times as
* possible, and the resulting array can be of any size.</li>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we don't need </li>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was asked to do <li> earlier in this PR conversation. @HyukjinKwon -- thoughts here?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean we may not need ending tag </li>.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, I'll look into that

Copy link
Author

@phegstrom phegstrom Oct 1, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya throughout this repository, the </li> has always been included. For consistency, I think we should just keep it as is. Let me know what you think

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. Then it's fine. Thanks for looking at it.

@SparkQA
Copy link

SparkQA commented Oct 1, 2018

Test build #96826 has finished for PR 22227 at commit 34ba74f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@phegstrom, can you close and reopen this PR to retrigger the AppVeyor test above? Looks it's failed due to time limitation.

@phegstrom phegstrom closed this Oct 2, 2018
@phegstrom phegstrom reopened this Oct 2, 2018
@phegstrom
Copy link
Author

@HyukjinKwon are things passing?

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96966 has finished for PR 22227 at commit 34ba74f.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

retest this please

@SparkQA
Copy link

SparkQA commented Oct 5, 2018

Test build #96981 has finished for PR 22227 at commit 34ba74f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in 17781d7 Oct 6, 2018
jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
## What changes were proposed in this pull request?

Adds support for the setting limit in the sql split function

## How was this patch tested?

1. Updated unit tests
2. Tested using Scala spark shell

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#22227 from phegstrom/master.

Authored-by: Parker Hegstrom <[email protected]>
Signed-off-by: hyukjinkwon <[email protected]>
*/
def split(str: Column, pattern: String): Column = withExpr {
StringSplit(str.expr, lit(pattern).expr)
def split(str: Column, regex: String): Column = withExpr {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this an API breaking change?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes it is for source compatibility in scala

Copy link
Contributor

@cloud-fan cloud-fan Mar 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yea Scala is sensitive to parameter name, as the caller can do: split(str = ..., pattern = ...)

so this is binary-compatible but not source-compatible. @HyukjinKwon can you help revert this line?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, but for the record such changes already have been made so far not only in SQL but SS sides if I am not remembering wrongly because users are expected to likely edit their source when they compile against Spark 3.0, and it doesn't break existing compiled apps. I am not sure why this one is special but sure it's easy to keep the compat with a minimal change.

HyukjinKwon added a commit that referenced this pull request Mar 3, 2020
…t split in Scala API

### What changes were proposed in this pull request?

To address the concern pointed out in #22227. This will make `split` source-compatible by removing minimal cosmetic changes.

### Why are the changes needed?

For source compatibility.

### Does this PR introduce any user-facing change?

No (it will prevent potential user-facing change from the original PR)

### How was this patch tested?

Unittest was changed (in order for us to detect that source compatibility easily).

Closes #27756 from HyukjinKwon/SPARK-25202.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
HyukjinKwon added a commit that referenced this pull request Mar 3, 2020
…t split in Scala API

### What changes were proposed in this pull request?

To address the concern pointed out in #22227. This will make `split` source-compatible by removing minimal cosmetic changes.

### Why are the changes needed?

For source compatibility.

### Does this PR introduce any user-facing change?

No (it will prevent potential user-facing change from the original PR)

### How was this patch tested?

Unittest was changed (in order for us to detect that source compatibility easily).

Closes #27756 from HyukjinKwon/SPARK-25202.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
(cherry picked from commit 3956e95)
Signed-off-by: HyukjinKwon <[email protected]>
sjincho pushed a commit to sjincho/spark that referenced this pull request Apr 15, 2020
…t split in Scala API

### What changes were proposed in this pull request?

To address the concern pointed out in apache#22227. This will make `split` source-compatible by removing minimal cosmetic changes.

### Why are the changes needed?

For source compatibility.

### Does this PR introduce any user-facing change?

No (it will prevent potential user-facing change from the original PR)

### How was this patch tested?

Unittest was changed (in order for us to detect that source compatibility easily).

Closes apache#27756 from HyukjinKwon/SPARK-25202.

Authored-by: HyukjinKwon <[email protected]>
Signed-off-by: HyukjinKwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.