-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-20040][ML][python] pyspark wrapper for ChiSquareTest #17421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
add to whitelist |
|
Test build #75192 has finished for PR 17421 at commit
|
|
Just remembered: you'll also need to update python/docs/pyspark.ml.rst for doc gen |
|
RAT tests are for checking that the Apache license appears at the top of each file |
|
Test build #75195 has finished for PR 17421 at commit
|
|
Test build #75198 has finished for PR 17421 at commit
|
b71caef to
32a0b0c
Compare
jkbradley
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR! I made a first review pass.
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a little arbitrary, but to follow other examples, write this as: test_chisquaretest
python/pyspark/ml/stat.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mark as Experimental (Search for other example of this)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, we put the triple-quotes on their own line elsewhere in pyspark
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use DenseVector, not _convert_to_vector. (use public APIs wherever possible)
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It can also be nicer to write this in a per-row format, rather than zipping labels and vectors which are defined separately. See other examples of createDataFrame in this file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the doc test
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(Noting that this can be updated once the Spark SQL bug is fixed)
python/pyspark/ml/stat.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just saw you changed this from the Scala doc b/c I left "RDD" there. Would you mind correcting the Scala doc too?
python/pyspark/ml/stat.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Copy param text from the Scala doc, unless there's a need to customize it for Python
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same for the return value text
|
Test build #75199 has finished for PR 17421 at commit
|
|
Test build #3612 has finished for PR 17421 at commit
|
|
Test build #3615 has finished for PR 17421 at commit
|
|
add to whitelist |
|
LGTM pending tests |
e00fc49 to
60d268c
Compare
60d268c to
3e7163c
Compare
|
Test build #75269 has finished for PR 17421 at commit
|
|
Test build #75270 has finished for PR 17421 at commit
|
| "pyspark.ml.linalg.__init__", | ||
| "pyspark.ml.recommendation", | ||
| "pyspark.ml.regression", | ||
| "pyspark.ml.stat", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We just took it out in 314cf51 , but since this is adding back in ml.stat we also need to update setup.py (you might need to update your branch from the latest master to see this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk thanks for catching that, should be fixed now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, do we need to update setup.py? This is creating a module, not a package, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sub-modules aren't automatically packaged so we do need to explicitly add it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @jkbradley, I reverted setup.py.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@holdenk If we need to add pyspark.ml.stat to setup.py, then why are we not adding the other analogous modules: pyspark.ml.{classification, clustering, regression,...}?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh yah sorry, its anything which is a new sub-directory and when I was reading this PR yesterday I thought this was a new directory, but looking it today that isn't the case, sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, no problem, I just wanted to check.
114baf0 to
3e7163c
Compare
holdenk
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Quick read through, thanks for working on this :)
python/pyspark/ml/stat.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this test is using the temp path?
python/pyspark/ml/tests.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for cleaning up the numpy imports :) +1
| "pyspark.ml.linalg.__init__", | ||
| "pyspark.ml.recommendation", | ||
| "pyspark.ml.regression", | ||
| "pyspark.ml.stat", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sub-modules aren't automatically packaged so we do need to explicitly add it.
|
Test build #75278 has finished for PR 17421 at commit
|
|
Test build #75280 has finished for PR 17421 at commit
|
|
Test build #75281 has finished for PR 17421 at commit
|
|
Test build #3617 has finished for PR 17421 at commit
|
1ce5966 to
e79f968
Compare
|
LGTM pending tests |
|
Test build #75329 has finished for PR 17421 at commit
|
|
Test build #75330 has finished for PR 17421 at commit
|
|
Merging with master |
What changes were proposed in this pull request?
A pyspark wrapper for spark.ml.stat.ChiSquareTest
How was this patch tested?
unit tests
doctests