Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Implement numeric_only and min_count in GroupBy.sum

Why are the changes needed?

for API coverage

Does this PR introduce any user-facing change?

new parameter

In [2]: df = ps.DataFrame({"A": [1, 2, 1, 2], "B": [True, False, False, True], "C": [3, 4, 3, 4], "D": ["a", "a", "b", "a"]})

In [3]: df.groupby("A").sum(numeric_only=False).sort_index()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:975: PandasAPIOnSparkAdviceWarning: GroupBy.sum() can only support numeric and bool columns even ifnumeric_only=False, skip unsupported columns: ['D']
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
                                                                                
   B  C
A      
1  1  6
2  1  8

In [4]: df._to_pandas().groupby("A").sum(numeric_only=False).sort_index()
Out[4]: 
   B  C   D
A          
1  1  6  ab
2  1  8  aa

In [5]: df.groupby("D").sum(min_count=3).sort_index()
Out[5]: 
     A    B     C
D                
a  5.0  2.0  11.0
b  NaN  NaN   NaN

In [6]: df._to_pandas().groupby("D").sum(min_count=3).sort_index()
Out[6]: 
     A    B     C
D                
a  5.0  2.0  11.0
b  NaN  NaN   NaN


How was this patch tested?

added UT

raise TypeError("min_count must be integer")

if numeric_only is not None and not numeric_only:
unsupported = [
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given a non-numeric column, for example, str type, the final result is sensitive to the order, so not easy to implement for now.
Right now, warn the users that such columns will be skiped:
PandasAPIOnSparkAdviceWarning: GroupBy.sum() can only support numeric and bool columns even ifnumeric_only=False, skip unsupported columns: ['D']

@HyukjinKwon
Copy link
Member

Merged to master.

@zhengruifeng
Copy link
Contributor Author

@HyukjinKwon thank you for reviewing!

@zhengruifeng zhengruifeng deleted the ps_groupby_sum_numonly_mc branch October 2, 2022 07:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants