[SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation#17494
[SPARK-20076][ML][PySpark] Add Python interface for ml.stats.Correlation#17494viirya wants to merge 7 commits intoapache:masterfrom
Conversation
|
cc @jkbradley |
|
cc @thunterdb |
|
cc @holdenk |
|
Test build #75430 has finished for PR 17494 at commit
|
9b81ce9 to
47f0257
Compare
47f0257 to
e129e06
Compare
|
Test build #75432 has finished for PR 17494 at commit
|
MLnick
left a comment
There was a problem hiding this comment.
Minor comment, LGTM otherwise.
| ... [Vectors.dense([9, 0, 0, 1])]] | ||
| >>> dataset = spark.createDataFrame(dataset, ["features"]) | ||
| >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0] | ||
| >>> print(str(pearsonCorr).replace('nan', 'NaN')) |
There was a problem hiding this comment.
Any reason for this replacement?
There was a problem hiding this comment.
The test is mainly modified from mllib's old Correlation. I can't think why it does the replacement except for better representation of the 'NaN' values.
| * val Row(coeff: Matrix) = Correlation.corr(data, "value").head | ||
| * // coeff now contains the Pearson correlation matrix. | ||
| * }}} | ||
| * |
There was a problem hiding this comment.
While we're here - below it says "cache the input RDD" but we that should be "the input Dataset"
There was a problem hiding this comment.
Also since we are here as well, there is a reference to input RDD up above in the docstring.
|
Test build #75434 has finished for PR 17494 at commit
|
|
ping @MLnick |
| [ 0.05564149, 1. , NaN, 0.91359586], | ||
| [ NaN, NaN, 1. , NaN], | ||
| [ 0.40047142, 0.91359586, NaN, 1. ]]) | ||
| >>> spearmanCorr = Correlation.corr(dataset, 'features', method="spearman").collect()[0][0] |
There was a problem hiding this comment.
Super minor nit - but let's use single ' everywhere here rather than have a mix of single & double - as in "spearman" here -> spearman' and above
| >>> from pyspark.ml.stat import Correlation | ||
| >>> dataset = [[Vectors.dense([1, 0, 0, -2])], | ||
| ... [Vectors.dense([4, 5, 0, 3])], | ||
| ... [Vectors.dense([6, 7, 0, 8])], |
There was a problem hiding this comment.
another minor nit - seems an extra space here
| .. note:: Experimental | ||
|
|
||
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. |
There was a problem hiding this comment.
So the Scala documentation had a warning about caching being suggested when using Spearman, would it make sense to copy this warning over as well?
ec6c003 to
8936880
Compare
8936880 to
fbcc1fe
Compare
|
Test build #75495 has finished for PR 17494 at commit
|
|
Test build #75496 has finished for PR 17494 at commit
|
|
@jkbradley @MLnick @holdenk If there is no more questions about this change, maybe we can make it into 2.2 in time? |
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. | ||
|
|
||
| @note For Spearman, a rank correlation, we need to create an RDD[Double] for each column |
There was a problem hiding this comment.
I don't think @note will work for PyDoc?
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. | ||
|
|
||
| Notice: For Spearman, a rank correlation, we need to create an RDD[Double] for each column |
There was a problem hiding this comment.
I think we should use a .. note::?
There was a problem hiding this comment.
Ok. Not quite familiar with PyDoc...
| Compute the correlation matrix for the input dataset of Vectors using the specified method. | ||
| Methods currently supported: `pearson` (default), `spearman`. | ||
|
|
||
| .. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column |
There was a problem hiding this comment.
Sorry, I picked up that the doc gen will fail here - there needs to be 2 spaces before the start of each subsequent line, like this:
.. note:: For Spearman, a rank correlation, we need to create an RDD[Double] for each column
and sort it in order to retrieve the ranks and then join the columns back into an RDD[Vector],
which is fairly costly. Cache the input Dataset before calling corr with `method = 'spearman'`
to avoid recomputing the common lineage.
There was a problem hiding this comment.
ah. ok. fixed. see if this time it's ok.
|
Test build #75568 has finished for PR 17494 at commit
|
|
LGTM pending Jenkins confirming. |
|
Test build #75564 has finished for PR 17494 at commit
|
|
Thanks @MLnick |
|
Test build #75569 has finished for PR 17494 at commit
|
| >>> dataset = spark.createDataFrame(dataset, ['features']) | ||
| >>> pearsonCorr = Correlation.corr(dataset, 'features', 'pearson').collect()[0][0] | ||
| >>> print(str(pearsonCorr).replace('nan', 'NaN')) | ||
| DenseMatrix([[ 1. , 0.05564149, NaN, 0.40047142], |
There was a problem hiding this comment.
So maybe I'm being overly cautious - but doctests with floats have bit me in the past - would it be good to use the ... syntax here or is this going to be ok? (Just asking).
There was a problem hiding this comment.
Fair point - it may lead to flaky tests I guess at some point.
There was a problem hiding this comment.
Although we have many tests in pyspark now with floats like this, it is a fair point, I agreed.
1a7b0f9 to
601d9eb
Compare
|
Test build #75574 has finished for PR 17494 at commit
|
|
LGTM if others are ok too |
|
Thanks @jkbradley |
|
LGTM as well |
|
Thanks @holdenk |
|
Merged to master. Thanks! |
What changes were proposed in this pull request?
The Dataframes-based support for the correlation statistics is added in #17108. This patch adds the Python interface for it.
How was this patch tested?
Python unit test.
Please review http://spark.apache.org/contributing.html before opening a pull request.