Skip to content

Conversation

@asmello
Copy link

@asmello asmello commented Jan 24, 2021

What changes were proposed in this pull request?

This PR implements SPARK-34214, by exposing the already existing regexp_extract_all SQL function in the PySpark API.

Why are the changes needed?

Please refer to SPARK-24884 for why this function is useful. This PR merely exposes it to the PySpark API, for added consistency and greater availability for users.

Does this PR introduce any user-facing change?

Yes, a new function is made available in the PySpark API. The associated docstring is included. Also I tweaked the description of the original regexp_extract function to highlight how its behaviour differs from that of regexp_extract_all.

How was this patch tested?

I tested it locally in a pyspark console session. I didn't find any tests for regexp_extract, so I didn't add any for the new function, but happy to do so if desired.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

r"""Extract all matches of the given group in a regex, from the specified string column.
If the regex did not match, or the specified group did not match, an empty array is returned.
.. versionadded:: 3.1.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's target 3.2.0

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I thought this was the version where the Scala function was introduced. Changed.

@HyukjinKwon
Copy link
Member

Actually, why is regexp_extract_all added into functions.scala @cloud-fan and @beliefer? This function more looks for SQL compliance instead of language friendly APIs. Can we instead remove the Scala function version? It's also described here: https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59

@beliefer
Copy link
Contributor

@HyukjinKwon According the description, it seems good for remove the Scala function version. Although I think is regexp_extract_all is very useful, if we just reference the description.

@asmello
Copy link
Author

asmello commented Jan 25, 2021

Do you wish to remove support for regexp_extract_all completely, @HyukjinKwon? I think that would be a regression from the user's perspective, and as this function fills an obvious gap in native functionality that can only be implemented with a UDF otherwise (i.e. there's no general workaround using only native functions). I don't get why deprecating would be deemed preferable to making it more widely available, but perhaps I misunderstand.

@HyukjinKwon
Copy link
Member

No, I meant to keep regexp_extract_all as SQL expression only but remove in DSL (functions.scala). This was added in Spark 3.1.0 which isn't released yet.

@HyukjinKwon
Copy link
Member

@beliefer, can you make a PR to remove this in functions.scala? We should also use Column signature, see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's drop this. I think we shouldn't expose this as DSL at the first place. This is more for SQL compliance, and you can use it via expr anyway.

@beliefer
Copy link
Contributor

@beliefer, can you make a PR to remove this in functions.scala? We should also use Column signature, see https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/functions.scala#L41-L59.

OK

@beliefer
Copy link
Contributor

@HyukjinKwon I created #31346

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants