-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors #24953
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
add to whitelist |
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this exist in Scala, even?
| import org.apache.spark.mllib.tree.loss.Losses | ||
| import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel, | ||
| RandomForestModel} | ||
| import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel, RandomForestModel} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: you need to revert the import changes here and above
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
Show resolved
Hide resolved
henrydavidge
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@srowen In Scala you would use the one liner in the implementation of createRowMatrix. The issue is that from Python this conversion isn't possible without using a Python UDF, which can blow up the execution time.
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala
Show resolved
Hide resolved
| import org.apache.spark.mllib.linalg._ | ||
| import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary} | ||
| import org.apache.spark.rdd.RDD | ||
| import org.apache.spark.sql.{Dataset, Row} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Likewise I think the changes in this file need to be reverted.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤦♂ oops
|
Thanks for the initial look @srowen. I fixed the accidental import changes. @jkbradley Looks like the incantation to enable tests didn't work |
|
Test build #4815 has finished for PR 24953 at commit
|
srowen
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some import order issues:
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala:56:0: org.apache.spark.sql. is in wrong order relative to org.apache.spark.sql.types.LongType.
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala:24:21: inv should come before MatrixSingularException.
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala:24:21: axpy should come before SparseVector.
|
Test build #4818 has finished for PR 24953 at commit
|
|
Merged to master |
|
Thanks @srowen ! |
What changes were proposed in this pull request?
In both cases, the input
DataFrameschema must contain only the information that's required for the matrix object, so a vector column in the case ofRowMatrixand long and vector columns forIndexedRowMatrix.How was this patch tested?
Unit tests that verify:
RowMatrixandIndexedRowMatrixcan be created fromDataFramesIllegalArgumentExceptionPlease review https://spark.apache.org/contributing.html before opening a pull request.