[SPARK-40399][PS] Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods` #37845

zhengruifeng · 2022-09-09T08:57:48Z

What changes were proposed in this pull request?

refactor pearson correlation in DataFrame.corr to:

support missing values;
add parameter min_periods;
enable arrow execution since no longer depend on VectorUDT;
support lazy evaluation;

before

In [1]: import pyspark.pandas as ps

In [2]: df = ps.DataFrame([[1,2], [3,None]])

In [3]: df
                                                                                
   0    1
0  1  2.0
1  3  NaN

In [4]: df.corr()
22/09/09 16:53:18 ERROR Executor: Exception in task 9.0 in stage 5.0 (TID 24)
org.apache.spark.SparkException: [FAILED_EXECUTE_UDF] Failed to execute user defined function (VectorAssembler$$Lambda$2660/0x0000000801215840: (struct<0_double_VectorAssembler_0915f96ec689:double,1:double>) => struct<type:tinyint,size:int,indices:array<int>,values:array<double>>)

after

In [1]: import pyspark.pandas as ps

In [2]: df = ps.DataFrame([[1,2], [3,None]])

In [3]: df.corr()
                                                                                
     0   1
0  1.0 NaN
1  NaN NaN

In [4]: df.to_pandas().corr()
/Users/ruifeng.zheng/Dev/spark/python/pyspark/pandas/utils.py:976: PandasAPIOnSparkAdviceWarning: `to_pandas` loads all data into the driver's memory. It should only be used if the resulting pandas DataFrame is expected to be small.
  warnings.warn(message, PandasAPIOnSparkAdviceWarning)
Out[4]: 
     0   1
0  1.0 NaN
1  NaN NaN

Why are the changes needed?

for API coverage and support common cases containing missing values

Does this PR introduce any user-facing change?

yes, API change, new parameter supported

How was this patch tested?

added UT

support multi index refine tests refine tests simplify simplify simplify

zhengruifeng · 2022-09-09T09:11:23Z

python/pyspark/pandas/tests/test_stats.py

+        self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False)
+        self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False)
+
    def test_corr(self):


existing test is mixed by df.corr and ser.corr, not easy to reuse, so add a new one.
will delete it after both df.corr and ser.corr are refactored

Can we comment this at the top of test_dataframe_corr so as not to forget ?

srowen · 2022-09-09T13:22:09Z

Hm, does another library or method in Spark do this? It feels weird to have a method that computes "mostly a correlation" ignoring data

zhengruifeng · 2022-09-12T01:27:38Z

@srowen Existing implementation calls the Correlation.corr in the ML side, it accepts a vector column, and it can also handle NaN.

But Pandas-API-on-Spark uses null to internally represent missing values, which will cause an error in Correlation.corr. Moreover, in order to support new parameter and lazy evluation, new scenarios (support groupBy/expanding/rolling/corrwith in the future), I think we need a new implementation for correlation.

zhengruifeng · 2022-09-12T01:30:51Z

also cc @HyukjinKwon @itholic @xinrong-meng

srowen · 2022-09-12T01:33:02Z

Where data is missing, the correlation is really just undefined. Why not return NaN? this is mixing in some arbitrary logic to skip some data, I don't see the point of that. If the caller wants to do that, the caller can just do that first.

zhengruifeng · 2022-09-12T02:47:38Z

Where data is missing, the correlation is really just undefined. Why not return NaN?

Correlation.corr behaves like this, when a column contains NaN, its correlation with other columns are NaN.

But Pandas-API-on-Spark should follow the behavior of Pandas, which will ignore the missing values, and compute the correlation based on remaining data.

srowen · 2022-09-12T02:55:12Z

Ok if it matches pandas behavior. Do we need this min_periods option, is that from pandas?

zhengruifeng · 2022-09-12T02:57:17Z

Do we need this min_periods option, is that from pandas?

yes, Pandas' DataFrame.corr has this option

itholic

Looks pretty good otherwise

itholic · 2022-09-13T02:00:36Z

python/pyspark/pandas/frame.py

+        if min_periods is not None and not isinstance(min_periods, int):
+            raise TypeError(f"Invalid min_periods type {type(min_periods).__name__}")


Seems like pandas allows float:

>>> pdf.corr('pearson', min_periods=1.4) dogs cats dogs 1.000000 -0.851064 cats -0.851064 1.000000

But I'm not sure if it's intended behavior or not, since they raises TypeError: an integer is required when the type is str as below:

>>> pdf.corr('pearson', min_periods='a') Traceback (most recent call last): ... TypeError: an integer is required >>>

I am also not sure, but the type of min_periods is also expected to be int in Pandas.
I think that pdf.corr('pearson', min_periods=1.4) can work in Pandas just because a validation is missing in Pandas

itholic · 2022-09-13T02:08:46Z

python/pyspark/pandas/frame.py

+            tmp_index_1_col = verify_temp_column_name(sdf, "__tmp_index_1_col__")
+            tmp_index_2_col = verify_temp_column_name(sdf, "__tmp_index_2_col__")
+            tmp_value_1_col = verify_temp_column_name(sdf, "__tmp_value_1_col__")
+            tmp_value_2_col = verify_temp_column_name(sdf, "__tmp_value_2_col__")


nit: how about tmp_index_x_col_name instead of tmp_index_x_col to explicitly indicate it's the name of column rather than column itself ?

ok, let me update them

itholic · 2022-09-13T02:11:48Z

python/pyspark/pandas/frame.py

+            # +---+---+----+
+
+            pair_scols: List[Column] = []
+            for i in range(0, num_scols):


nit: we can omit the 0 since it's default ?

- for i in range(0, num_scols): + for i in range(num_scols):

Either looks okay, though.

just because that other places use range(0, x) instead of range(x)

itholic · 2022-09-13T02:14:24Z

python/pyspark/pandas/frame.py

+            # |                  1|                  2|                1.0|               null|
+            # |                  2|                  2|               null|               null|
+            # +-------------------+-------------------+-------------------+-------------------+
+            tmp_tuple_col = verify_temp_column_name(sdf, "__tmp_tuple_col__")


ditto ? I think tmp_tuple_col_name is a bit more explicit.

itholic · 2022-09-13T02:15:58Z

python/pyspark/pandas/frame.py

+            tmp_corr_col = verify_temp_column_name(sdf, "__tmp_pearson_corr_col__")
+            tmp_count_col = verify_temp_column_name(sdf, "__tmp_count_col__")


itholic · 2022-09-13T02:17:59Z

python/pyspark/pandas/frame.py

+            # |                  1|[{0, -1.0}, {1, 1...|
+            # |                  2|[{0, null}, {1, n...|
+            # +-------------------+--------------------+
+            tmp_array_col = verify_temp_column_name(sdf, "__tmp_array_col__")


itholic · 2022-09-13T02:24:01Z

python/pyspark/pandas/tests/test_stats.py

+
+        self.assert_eq(psdf.corr(), pdf.corr(), check_exact=False)
+        self.assert_eq(psdf.corr(min_periods=1), pdf.corr(min_periods=1), check_exact=False)
+        self.assert_eq(psdf.corr(min_periods=3), pdf.corr(min_periods=3), check_exact=False)


Can we also test for chained operations?
e.g.

self.assert_eq((psdf + 1).corr(), (pdf + 1).corr(), check_exact=False)

nice, let me add it

itholic · 2022-09-13T02:26:16Z

qq: btw, can we say it's a "new API" in the PR description ??

Maybe "new parameter support" or something like that instead ?

HyukjinKwon · 2022-09-13T03:49:14Z

python/pyspark/pandas/frame.py

        """
        Compute pairwise correlation of columns, excluding NA/null values.

+        .. versionadded:: 3.3.0


itholic

Looks good!

zhengruifeng · 2022-09-13T06:46:19Z

Merged into master, thanks @srowen @itholic @HyukjinKwon for reviews!

### What changes were proposed in this pull request? Remove `pyspark.pandas.ml` ### Why are the changes needed? `pyspark.pandas.ml` is no longer needed, since we implemented correlations based on Spark SQL: 1. pearson corrleation implemented in #37845 2. spearman corrleation implemented #37874 ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? updated suites Closes #37968 from zhengruifeng/ps_del_ml. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

init

4b7d91b

support multi index refine tests refine tests simplify simplify simplify

zhengruifeng changed the title ~~[SPARK-40399][PS] DataFrame.corr Pearson support missing values and min_periods~~ [SPARK-40399][PS] Make Pearson correlation in DataFrame.corr support missing values and min_periods Sep 9, 2022

github-actions bot added CORE PANDAS API ON SPARK PYTHON labels Sep 9, 2022

zhengruifeng changed the title ~~[SPARK-40399][PS] Make Pearson correlation in DataFrame.corr support missing values and min_periods~~ [SPARK-40399][PS] Make pearson correlation in DataFrame.corr support missing values and min_periods Sep 9, 2022

zhengruifeng commented Sep 9, 2022

View reviewed changes

srowen approved these changes Sep 12, 2022

View reviewed changes

itholic reviewed Sep 13, 2022

View reviewed changes

HyukjinKwon reviewed Sep 13, 2022

View reviewed changes

HyukjinKwon approved these changes Sep 13, 2022

View reviewed changes

address comments

6c31ce5

itholic approved these changes Sep 13, 2022

View reviewed changes

zhengruifeng closed this in ae08787 Sep 13, 2022

zhengruifeng deleted the ps_df_corr_missing_value branch September 13, 2022 06:46

zhengruifeng mentioned this pull request Sep 22, 2022

[SPARK-40529][PS] Remove pyspark.pandas.ml #37968

Closed

		if min_periods is not None and not isinstance(min_periods, int):
		raise TypeError(f"Invalid min_periods type {type(min_periods).__name__}")

		tmp_corr_col = verify_temp_column_name(sdf, "__tmp_pearson_corr_col__")
		tmp_count_col = verify_temp_column_name(sdf, "__tmp_count_col__")

[SPARK-40399][PS] Make pearson correlation in DataFrame.corr support missing values and min_periods #37845

[SPARK-40399][PS] Make pearson correlation in DataFrame.corr support missing values and min_periods #37845

Uh oh!

Conversation

zhengruifeng commented Sep 9, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

srowen commented Sep 9, 2022

Uh oh!

zhengruifeng commented Sep 12, 2022

Uh oh!

zhengruifeng commented Sep 12, 2022

Uh oh!

srowen commented Sep 12, 2022

Uh oh!

zhengruifeng commented Sep 12, 2022

Uh oh!

srowen commented Sep 12, 2022

Uh oh!

zhengruifeng commented Sep 12, 2022

Uh oh!

itholic left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic commented Sep 13, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

itholic left a comment

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Sep 13, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-40399][PS] Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods` #37845

[SPARK-40399][PS] Make `pearson` correlation in `DataFrame.corr` support missing values and `min_periods` #37845

zhengruifeng commented Sep 9, 2022 •

edited

Loading

itholic commented Sep 13, 2022 •

edited

Loading