Skip to content

Conversation

@zero323
Copy link
Member

@zero323 zero323 commented Jul 18, 2019

What changes were proposed in this pull request?

This adds simple check for count argument:

  • If it is a Column we apply _to_java_column before invoking JVM counterpart
  • Otherwise we proceed as before.

How was this patch tested?

Manual testing.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107848 has finished for PR 25193 at commit 4e28f29.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

@zero323, although it's pretty straightforward, let's file a JIRA and add a simple test. (BTW, I am happy to see you back in Spark community :D)

@zero323 zero323 changed the title Add support for count: Column in array_repeat [SPARK-28439][PYTHON][SQL] Add support for count: Column in array_repeat Jul 18, 2019
@zero323
Copy link
Member Author

zero323 commented Jul 18, 2019

although it's pretty straightforward, let's file a JIRA

Sorry, my bad. I don't know why I've included JIRA info. Fixed.

and add a simple test.

I'll try to add one later today.

@SparkQA
Copy link

SparkQA commented Jul 18, 2019

Test build #107857 has finished for PR 25193 at commit 6fedb9d.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1, LGTM.

This will work for int type column.

>>> sql("CREATE TABLE t(a STRING, b int)")
>>> sql("INSERT INTO t VALUES('ab', 4)")
>>> sql("SELECT a, b FROM t").show()
+---+---+
|  a|  b|
+---+---+
| ab|  4|
+---+---+

>>> sql("SELECT array_repeat(a, b) FROM t").show()
+------------------+
|array_repeat(a, b)|
+------------------+
|  [ab, ab, ab, ab]|
+------------------+

Please note that it will fail if we create like the following due to the difference between Python and Scala.

>>> df = spark.createDataFrame([('ab',1)], ['data','c'])
>>> df.printSchema()
root
 |-- data: string (nullable = true)
 |-- c: long (nullable = true)
scala> val df = Seq(("ab", 3)).toDF("data", "c")
df: org.apache.spark.sql.DataFrame = [data: string, c: int]
scala> df.printSchema()
root
 |-- data: string (nullable = true)
 |-- c: integer (nullable = false)

@dongjoon-hyun
Copy link
Member

Merged to master. Thank you, @zero323 and @HyukjinKwon .

@nucflash
Copy link

If the column passed as count is not int, array_repeat will choke. In the line

_to_java_column(count) if isinstance(count, Column) else count

We may need to explicitly cast to int:

_to_java_column(count.cast('int')) if isinstance(count, Column) else count

@zero323
Copy link
Member Author

zero323 commented Apr 17, 2020

If the column passed as count is not int, array_repeat will choke. In the line

That seems like an expected behavior here, but assuming it is to be modified, it should be done in the underlying expression, so behavior is consistent across API's (including SQL).

@nucflash
Copy link

nucflash commented Apr 23, 2020

I see your point @zero323. I ran into this issue when I passed a column which was the result of a count aggregation and was typed as bigint instead of int. Explicitly casting columns returned from native functions feels a bit awkward as it hints that the internals are not compatible with each other. But the issue may not lie in array_repeat but in count(), that I don't know what is best to return, int or bigint.

@zero323
Copy link
Member Author

zero323 commented Apr 23, 2020

@nucflash In general count should return long as size of the result can potentially exceed integer precision.

From the other hand array_repeat cannot, in any practical way, exceed size of the integer, so allowing for long would be misleading at best, and so would be automatic downcasting.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants