[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors #24953

henrydavidge · 2019-06-24T18:40:04Z

What changes were proposed in this pull request?

In both cases, the input DataFrame schema must contain only the information that's required for the matrix object, so a vector column in the case of RowMatrix and long and vector columns for IndexedRowMatrix.

How was this patch tested?

Unit tests that verify:

RowMatrix and IndexedRowMatrix can be created from DataFrames
If the schema does not match expectations, we throw an IllegalArgumentException

Please review https://spark.apache.org/contributing.html before opening a pull request.

jkbradley · 2019-06-24T18:44:12Z

add to whitelist

srowen

Does this exist in Scala, even?

srowen · 2019-06-24T19:04:39Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

 import org.apache.spark.mllib.tree.loss.Losses
-import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel,
-  RandomForestModel}
+import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel, RandomForestModel}


Nit: you need to revert the import changes here and above

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

henrydavidge

@srowen In Scala you would use the one liner in the implementation of createRowMatrix. The issue is that from Python this conversion isn't possible without using a Python UDF, which can blow up the execution time.

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

srowen · 2019-07-01T13:37:27Z

mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala

 import org.apache.spark.mllib.linalg._
 import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary}
 import org.apache.spark.rdd.RDD
+import org.apache.spark.sql.{Dataset, Row}


Likewise I think the changes in this file need to be reverted.

🤦‍♂ oops

henrydavidge · 2019-07-08T04:30:50Z

Thanks for the initial look @srowen. I fixed the accidental import changes.

@jkbradley Looks like the incantation to enable tests didn't work

SparkQA · 2019-07-08T16:00:35Z

Test build #4815 has finished for PR 24953 at commit d620b43.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

srowen

Just some import order issues:

[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala:56:0: org.apache.spark.sql. is in wrong order relative to org.apache.spark.sql.types.LongType.
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala:24:21: inv should come before MatrixSingularException.
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala:24:21: axpy should come before SparseVector.

SparkQA · 2019-07-09T16:24:24Z

Test build #4818 has finished for PR 24953 at commit 4a40143.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2019-07-09T21:39:48Z

Merged to master

henrydavidge · 2019-07-11T18:42:54Z

Thanks @srowen !

henrydavidge added 3 commits June 21, 2019 16:22

Accept DataFrames in RowMatrix and IndexedRowMatrix constructors

09dda1b

improve docs a little

9c55664

style

7f941e9

srowen reviewed Jun 24, 2019

View reviewed changes

henrydavidge commented Jun 24, 2019

View reviewed changes

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala Show resolved Hide resolved

dongjoon-hyun added MLLIB PYSPARK labels Jun 26, 2019

srowen requested changes Jul 1, 2019

View reviewed changes

undo import changes

d620b43

srowen requested changes Jul 8, 2019

View reviewed changes

import order

4a40143

srowen closed this in a32c92c Jul 9, 2019

[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors #24953

[SPARK-28140][MLLIB][PYTHON] Accept DataFrames in RowMatrix and IndexedRowMatrix constructors #24953

Uh oh!

Conversation

henrydavidge commented Jun 24, 2019

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

jkbradley commented Jun 24, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

srowen Jun 24, 2019

Choose a reason for hiding this comment

Uh oh!

Uh oh!

henrydavidge left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

srowen Jul 1, 2019

Choose a reason for hiding this comment

Uh oh!

henrydavidge Jul 8, 2019

Choose a reason for hiding this comment

Uh oh!

henrydavidge commented Jul 8, 2019

Uh oh!

SparkQA commented Jul 8, 2019

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 9, 2019

Uh oh!

srowen commented Jul 9, 2019

Uh oh!

henrydavidge commented Jul 11, 2019

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants