[SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat #25193

zero323 · 2019-07-18T11:49:56Z

What changes were proposed in this pull request?

This adds simple check for count argument:

If it is a Column we apply _to_java_column before invoking JVM counterpart
Otherwise we proceed as before.

How was this patch tested?

Manual testing.

SparkQA · 2019-07-18T12:21:29Z

Test build #107848 has finished for PR 25193 at commit 4e28f29.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2019-07-18T13:27:16Z

@zero323, although it's pretty straightforward, let's file a JIRA and add a simple test. (BTW, I am happy to see you back in Spark community :D)

zero323 · 2019-07-18T13:35:12Z

although it's pretty straightforward, let's file a JIRA

Sorry, my bad. I don't know why I've included JIRA info. Fixed.

and add a simple test.

I'll try to add one later today.

SparkQA · 2019-07-18T15:00:13Z

Test build #107857 has finished for PR 25193 at commit 6fedb9d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dongjoon-hyun

+1, LGTM.

This will work for int type column.

>>> sql("CREATE TABLE t(a STRING, b int)")
>>> sql("INSERT INTO t VALUES('ab', 4)")
>>> sql("SELECT a, b FROM t").show()
+---+---+
|  a|  b|
+---+---+
| ab|  4|
+---+---+

>>> sql("SELECT array_repeat(a, b) FROM t").show()
+------------------+
|array_repeat(a, b)|
+------------------+
|  [ab, ab, ab, ab]|
+------------------+

Please note that it will fail if we create like the following due to the difference between Python and Scala.

>>> df = spark.createDataFrame([('ab',1)], ['data','c'])
>>> df.printSchema()
root
 |-- data: string (nullable = true)
 |-- c: long (nullable = true)

scala> val df = Seq(("ab", 3)).toDF("data", "c")
df: org.apache.spark.sql.DataFrame = [data: string, c: int]
scala> df.printSchema()
root
 |-- data: string (nullable = true)
 |-- c: integer (nullable = false)

dongjoon-hyun · 2019-07-18T19:58:04Z

Merged to master. Thank you, @zero323 and @HyukjinKwon .

nucflash · 2020-04-15T04:43:09Z

If the column passed as count is not int, array_repeat will choke. In the line

_to_java_column(count) if isinstance(count, Column) else count

We may need to explicitly cast to int:

_to_java_column(count.cast('int')) if isinstance(count, Column) else count

zero323 · 2020-04-17T17:36:45Z

If the column passed as count is not int, array_repeat will choke. In the line

That seems like an expected behavior here, but assuming it is to be modified, it should be done in the underlying expression, so behavior is consistent across API's (including SQL).

nucflash · 2020-04-23T18:53:02Z

I see your point @zero323. I ran into this issue when I passed a column which was the result of a count aggregation and was typed as bigint instead of int. Explicitly casting columns returned from native functions feels a bit awkward as it hints that the internals are not compatible with each other. But the issue may not lie in array_repeat but in count(), that I don't know what is best to return, int or bigint.

zero323 · 2020-04-23T22:56:52Z

@nucflash In general count should return long as size of the result can potentially exceed integer precision.

From the other hand array_repeat cannot, in any practical way, exceed size of the integer, so allowing for long would be misleading at best, and so would be automatic downcasting.

Add support for count: Column in array_repeat

4e28f29

zero323 changed the title ~~Add support for count: Column in array_repeat~~ [SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat Jul 18, 2019

Test if both column and integer count are supported

6fedb9d

HyukjinKwon approved these changes Jul 18, 2019

View reviewed changes

dongjoon-hyun approved these changes Jul 18, 2019

View reviewed changes

dongjoon-hyun closed this in a0c2fa6 Jul 18, 2019

dongjoon-hyun added PYSPARK SQL labels Jul 25, 2019

zero323 deleted the SPARK-28278 branch February 2, 2020 17:54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat #25193

[SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat #25193

Uh oh!

zero323 commented Jul 18, 2019

Uh oh!

SparkQA commented Jul 18, 2019

Uh oh!

HyukjinKwon commented Jul 18, 2019

Uh oh!

zero323 commented Jul 18, 2019

Uh oh!

SparkQA commented Jul 18, 2019

Uh oh!

dongjoon-hyun left a comment

Uh oh!

dongjoon-hyun commented Jul 18, 2019

Uh oh!

nucflash commented Apr 15, 2020

Uh oh!

zero323 commented Apr 17, 2020

Uh oh!

nucflash commented Apr 23, 2020 •

edited

Loading

Uh oh!

zero323 commented Apr 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

[SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat #25193

[SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat #25193

Uh oh!

Conversation

zero323 commented Jul 18, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jul 18, 2019

Uh oh!

HyukjinKwon commented Jul 18, 2019

Uh oh!

zero323 commented Jul 18, 2019

Uh oh!

SparkQA commented Jul 18, 2019

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

dongjoon-hyun commented Jul 18, 2019

Uh oh!

nucflash commented Apr 15, 2020

Uh oh!

zero323 commented Apr 17, 2020

Uh oh!

nucflash commented Apr 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Apr 23, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

nucflash commented Apr 23, 2020 •

edited

Loading