Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 25, 2022

What changes were proposed in this pull request?

Make PandasMode copy keys before inserting into Map

Why are the changes needed?

correctness issue similar to #38383, make it a separate PR since it is dedicated for Pandas API

In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4']

In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f)

In [26]: df = spark.createDataFrame(rdd, schema='string')

In [27]: psdf = df.pandas_api()

In [28]: psdf.mode()
Out[28]: 
  value
0     4

In [29]: psdf._to_pandas().mode()
Out[29]: 
  value
0     3

Does this PR introduce any user-facing change?

No

How was this patch tested?

added UT

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not strong feeling about my nit comment, LGTM.

Comment on lines +6050 to +6054
rdd = self.spark.sparkContext.parallelize(
[
1,
],
4,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: can we just is one or two line some thing like:

rdd = self.spark.sparkContext.parallelize([1], 4)
    .mapPartitionsWithIndex(f)

?? I suspect it's maybe adjusted by black script tho, 😂

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it was just reformated by the script😅

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng zhengruifeng deleted the ps_mode_fix branch October 26, 2022 02:00
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…into Map

### What changes were proposed in this pull request?
Make `PandasMode` copy keys before inserting into Map

### Why are the changes needed?
correctness issue similar to apache#38383, make it a separate PR since it is dedicated for Pandas API

```
In [24]: def f(index, iterator): return ['3', '3', '3', '3', '4'] if index == 3 else ['0', '1', '2', '3', '4']

In [25]: rdd = sc.parallelize([1, ], 4).mapPartitionsWithIndex(f)

In [26]: df = spark.createDataFrame(rdd, schema='string')

In [27]: psdf = df.pandas_api()

In [28]: psdf.mode()
Out[28]:
  value
0     4

In [29]: psdf._to_pandas().mode()
Out[29]:
  value
0     3
```

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
added UT

Closes apache#38385 from zhengruifeng/ps_mode_fix.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants