Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Sep 9, 2022

What changes were proposed in this pull request?

refactor pearson correlation in DataFrame.corr to:

  1. support missing values;
  2. add parameter min_periods;
  3. enable arrow execution since no longer depend on VectorUDT;
  4. support lazy evaluation;

before

In [1]: import pyspark.pandas as ps

In [2]: df = ps.DataFrame([[1,2], [3,None]])

In [3]: df
                                                                                
   0    1
0  1  2.0
1  3  NaN

In [4]: df.corr()
22/09/09 16:53:18 ERROR Executor: Exception in task 9.0 in stage 5.0 (TID 24)
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (VectorAssembler$$Lambda$2660/0x0000000801215840: (struct<0_double_VectorAssembler_0915f96ec689:double,1:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)

after

In [1]: import pyspark.pandas as ps

In [2]: df = ps.DataFrame([[1,2], [3,None]])

In [3]: df.corr()
                                                                                
     0   1
0  1.0 NaN
1  NaN NaN

In [4]: df.to_pandas().corr()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[4]: 
     0   1
0  1.0 NaN
1  NaN NaN

Why are the changes needed?

for API coverage and support common cases containing missing values

Does this PR introduce any user-facing change?

yes, API change, new parameter supported

How was this patch tested?

added UT

support multi index

refine tests

refine tests

simplify

simplify

simplify
@zhengruifeng zhengruifeng changed the title [SPARK-40399][PS] DataFrame.corr Pearson support missing values and min_periods [SPARK-40399][PS] Make Pearson correlation in DataFrame.corr support missing values and min_periods Sep 9, 2022
@zhengruifeng zhengruifeng changed the title [SPARK-40399][PS] Make Pearson correlation in DataFrame.corr support missing values and min_periods [SPARK-40399][PS] Make pearson correlation in DataFrame.corr support missing values and min_periods Sep 9, 2022
self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False)
self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False)

def test_corr(self):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

existing test is mixed by df.corr and ser.corr, not easy to reuse, so add a new one.
will delete it after both df.corr and ser.corr are refactored

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we comment this at the top of test_dataframe_corr so as not to forget ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

@srowen
Copy link
Member

srowen commented Sep 9, 2022

Hm, does another library or method in Spark do this? It feels weird to have a method that computes "mostly a correlation" ignoring data

@zhengruifeng
Copy link
Contributor Author

@srowen Existing implementation calls the Correlation.corr in the ML side, it accepts a vector column, and it can also handle NaN.

But Pandas-API-on-Spark uses null to internally represent missing values, which will cause an error in Correlation.corr. Moreover, in order to support new parameter and lazy evluation, new scenarios (support groupBy/expanding/rolling/corrwith in the future), I think we need a new implementation for correlation.

@zhengruifeng
Copy link
Contributor Author

also cc @HyukjinKwon @itholic @xinrong-meng

@srowen
Copy link
Member

srowen commented Sep 12, 2022

Where data is missing, the correlation is really just undefined. Why not return NaN? this is mixing in some arbitrary logic to skip some data, I don't see the point of that. If the caller wants to do that, the caller can just do that first.

@zhengruifeng
Copy link
Contributor Author

Where data is missing, the correlation is really just undefined. Why not return NaN?

Correlation.corr behaves like this, when a column contains NaN, its correlation with other columns are NaN.

But Pandas-API-on-Spark should follow the behavior of Pandas, which will ignore the missing values, and compute the correlation based on remaining data.

@srowen
Copy link
Member

srowen commented Sep 12, 2022

Ok if it matches pandas behavior. Do we need this min_periods option, is that from pandas?

@zhengruifeng
Copy link
Contributor Author

Do we need this min_periods option, is that from pandas?

yes, Pandas' DataFrame.corr has this option

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good otherwise

Comment on lines +1472 to +1473
if min_periods is not None and not isinstance(min_periods, int):
raise TypeError(f"Invalid min_periods type {type(min_periods).__name__}")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like pandas allows float:

>>> pdf.corr('pearson', min_periods=1.4)
          dogs      cats
dogs  1.000000 -0.851064
cats -0.851064  1.000000

But I'm not sure if it's intended behavior or not, since they raises TypeError: an integer is required when the type is str as below:

>>> pdf.corr('pearson', min_periods='a')
Traceback (most recent call last):
...
TypeError: an integer is required
>>>

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am also not sure, but the type of min_periods is also expected to be int in Pandas.
I think that pdf.corr('pearson', min_periods=1.4) can work in Pandas just because a validation is missing in Pandas

Comment on lines 1492 to 1495
tmp_index_1_col = verify_temp_column_name(sdf, "__tmp_index_1_col__")
tmp_index_2_col = verify_temp_column_name(sdf, "__tmp_index_2_col__")
tmp_value_1_col = verify_temp_column_name(sdf, "__tmp_value_1_col__")
tmp_value_2_col = verify_temp_column_name(sdf, "__tmp_value_2_col__")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: how about tmp_index_x_col_name instead of tmp_index_x_col to explicitly indicate it's the name of column rather than column itself ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, let me update them

# +---+---+----+

pair_scols: List[Column] = []
for i in range(0, num_scols):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: we can omit the 0 since it's default ?

-  for i in range(0, num_scols):
+  for i in range(num_scols):

Either looks okay, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just because that other places use range(0, x) instead of range(x)

# | 1| 2| 1.0| null|
# | 2| 2| null| null|
# +-------------------+-------------------+-------------------+-------------------+
tmp_tuple_col = verify_temp_column_name(sdf, "__tmp_tuple_col__")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto ? I think tmp_tuple_col_name is a bit more explicit.

Comment on lines 1551 to 1552
tmp_corr_col = verify_temp_column_name(sdf, "__tmp_pearson_corr_col__")
tmp_count_col = verify_temp_column_name(sdf, "__tmp_count_col__")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

# | 1|[{0, -1.0}, {1, 1...|
# | 2|[{0, null}, {1, n...|
# +-------------------+--------------------+
tmp_array_col = verify_temp_column_name(sdf, "__tmp_array_col__")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto


self.assert_eq(psdf.corr(), pdf.corr(), check_exact=False)
self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False)
self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also test for chained operations?
e.g.

self.assert_eq((psdf + 1).corr(), (pdf + 1).corr(), check_exact=False)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice, let me add it

@itholic
Copy link
Contributor

itholic commented Sep 13, 2022

qq: btw, can we say it's a "new API" in the PR description ??

Maybe "new parameter support" or something like that instead ?

"""
Compute pairwise correlation of columns, excluding NA/null values.
.. versionadded:: 3.3.0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice

Copy link
Contributor

@itholic itholic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good!

@zhengruifeng
Copy link
Contributor Author

Merged into master, thanks @srowen @itholic @HyukjinKwon for reviews!

@zhengruifeng zhengruifeng deleted the ps_df_corr_missing_value branch September 13, 2022 06:46
HyukjinKwon pushed a commit that referenced this pull request Sep 22, 2022
### What changes were proposed in this pull request?
Remove `pyspark.pandas.ml`

### Why are the changes needed?
`pyspark.pandas.ml` is no longer needed, since we implemented correlations based on Spark SQL:

1. pearson corrleation implemented in #37845
2. spearman corrleation implemented #37874

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
updated suites

Closes #37968 from zhengruifeng/ps_del_ml.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants