-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40399][PS] Make pearson correlation in DataFrame.corr support missing values and min_periods
#37845
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Pearson support missing values and min_periods Pearson correlation in DataFrame.corr support missing values and min_periods
Pearson correlation in DataFrame.corr support missing values and min_periods pearson correlation in DataFrame.corr support missing values and min_periods
| self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False) | ||
| self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False) | ||
|
|
||
| def test_corr(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
existing test is mixed by df.corr and ser.corr, not easy to reuse, so add a new one.
will delete it after both df.corr and ser.corr are refactored
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we comment this at the top of test_dataframe_corr so as not to forget ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
|
Hm, does another library or method in Spark do this? It feels weird to have a method that computes "mostly a correlation" ignoring data |
|
@srowen Existing implementation calls the But Pandas-API-on-Spark uses |
|
also cc @HyukjinKwon @itholic @xinrong-meng |
|
Where data is missing, the correlation is really just undefined. Why not return NaN? this is mixing in some arbitrary logic to skip some data, I don't see the point of that. If the caller wants to do that, the caller can just do that first. |
But Pandas-API-on-Spark should follow the behavior of Pandas, which will ignore the missing values, and compute the correlation based on remaining data. |
|
Ok if it matches pandas behavior. Do we need this min_periods option, is that from pandas? |
yes, Pandas' |
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good otherwise
| if min_periods is not None and not isinstance(min_periods, int): | ||
| raise TypeError(f"Invalid min_periods type {type(min_periods).__name__}") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like pandas allows float:
>>> pdf.corr('pearson', min_periods=1.4)
dogs cats
dogs 1.000000 -0.851064
cats -0.851064 1.000000But I'm not sure if it's intended behavior or not, since they raises TypeError: an integer is required when the type is str as below:
>>> pdf.corr('pearson', min_periods='a')
Traceback (most recent call last):
...
TypeError: an integer is required
>>>There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am also not sure, but the type of min_periods is also expected to be int in Pandas.
I think that pdf.corr('pearson', min_periods=1.4) can work in Pandas just because a validation is missing in Pandas
python/pyspark/pandas/frame.py
Outdated
| tmp_index_1_col = verify_temp_column_name(sdf, "__tmp_index_1_col__") | ||
| tmp_index_2_col = verify_temp_column_name(sdf, "__tmp_index_2_col__") | ||
| tmp_value_1_col = verify_temp_column_name(sdf, "__tmp_value_1_col__") | ||
| tmp_value_2_col = verify_temp_column_name(sdf, "__tmp_value_2_col__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: how about tmp_index_x_col_name instead of tmp_index_x_col to explicitly indicate it's the name of column rather than column itself ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, let me update them
| # +---+---+----+ | ||
|
|
||
| pair_scols: List[Column] = [] | ||
| for i in range(0, num_scols): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: we can omit the 0 since it's default ?
- for i in range(0, num_scols):
+ for i in range(num_scols):Either looks okay, though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just because that other places use range(0, x) instead of range(x)
python/pyspark/pandas/frame.py
Outdated
| # | 1| 2| 1.0| null| | ||
| # | 2| 2| null| null| | ||
| # +-------------------+-------------------+-------------------+-------------------+ | ||
| tmp_tuple_col = verify_temp_column_name(sdf, "__tmp_tuple_col__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto ? I think tmp_tuple_col_name is a bit more explicit.
python/pyspark/pandas/frame.py
Outdated
| tmp_corr_col = verify_temp_column_name(sdf, "__tmp_pearson_corr_col__") | ||
| tmp_count_col = verify_temp_column_name(sdf, "__tmp_count_col__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
python/pyspark/pandas/frame.py
Outdated
| # | 1|[{0, -1.0}, {1, 1...| | ||
| # | 2|[{0, null}, {1, n...| | ||
| # +-------------------+--------------------+ | ||
| tmp_array_col = verify_temp_column_name(sdf, "__tmp_array_col__") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
|
|
||
| self.assert_eq(psdf.corr(), pdf.corr(), check_exact=False) | ||
| self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False) | ||
| self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we also test for chained operations?
e.g.
self.assert_eq((psdf + 1).corr(), (pdf + 1).corr(), check_exact=False)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice, let me add it
|
qq: btw, can we say it's a "new API" in the PR description ?? Maybe "new parameter support" or something like that instead ? |
| """ | ||
| Compute pairwise correlation of columns, excluding NA/null values. | ||
| .. versionadded:: 3.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice
itholic
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good!
|
Merged into master, thanks @srowen @itholic @HyukjinKwon for reviews! |
### What changes were proposed in this pull request? Remove `pyspark.pandas.ml` ### Why are the changes needed? `pyspark.pandas.ml` is no longer needed, since we implemented correlations based on Spark SQL: 1. pearson corrleation implemented in #37845 2. spearman corrleation implemented #37874 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated suites Closes #37968 from zhengruifeng/ps_del_ml. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
refactor
pearsoncorrelation inDataFrame.corrto:min_periods;VectorUDT;before
after
Why are the changes needed?
for API coverage and support common cases containing missing values
Does this PR introduce any user-facing change?
yes, API change, new parameter supported
How was this patch tested?
added UT