Skip to content

Conversation

@henrydavidge
Copy link
Contributor

What changes were proposed in this pull request?

In both cases, the input DataFrame schema must contain only the information that's required for the matrix object, so a vector column in the case of RowMatrix and long and vector columns for IndexedRowMatrix.

How was this patch tested?

Unit tests that verify:

  • RowMatrix and IndexedRowMatrix can be created from DataFrames
  • If the schema does not match expectations, we throw an IllegalArgumentException

Please review https://spark.apache.org/contributing.html before opening a pull request.

@jkbradley
Copy link
Member

add to whitelist

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this exist in Scala, even?

import org.apache.spark.mllib.tree.loss.Losses
import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel,
RandomForestModel}
import org.apache.spark.mllib.tree.model.{DecisionTreeModel, GradientBoostedTreesModel, RandomForestModel}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: you need to revert the import changes here and above

Copy link
Contributor Author

@henrydavidge henrydavidge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@srowen In Scala you would use the one liner in the implementation of createRowMatrix. The issue is that from Python this conversion isn't possible without using a Python UDF, which can blow up the execution time.

import org.apache.spark.mllib.linalg._
import org.apache.spark.mllib.stat.{MultivariateOnlineSummarizer, MultivariateStatisticalSummary}
import org.apache.spark.rdd.RDD
import org.apache.spark.sql.{Dataset, Row}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Likewise I think the changes in this file need to be reverted.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤦‍♂ oops

@henrydavidge
Copy link
Contributor Author

Thanks for the initial look @srowen. I fixed the accidental import changes.

@jkbradley Looks like the incantation to enable tests didn't work

@SparkQA
Copy link

SparkQA commented Jul 8, 2019

Test build #4815 has finished for PR 24953 at commit d620b43.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some import order issues:

[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala:56:0: org.apache.spark.sql. is in wrong order relative to org.apache.spark.sql.types.LongType.
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala:24:21: inv should come before MatrixSingularException.
[error] /home/jenkins/workspace/NewSparkPullRequestBuilder/mllib/src/main/scala/org/apache/spark/mllib/linalg/distributed/RowMatrix.scala:24:21: axpy should come before SparseVector.

@SparkQA
Copy link

SparkQA commented Jul 9, 2019

Test build #4818 has finished for PR 24953 at commit 4a40143.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen
Copy link
Member

srowen commented Jul 9, 2019

Merged to master

@srowen srowen closed this in a32c92c Jul 9, 2019
@henrydavidge
Copy link
Contributor Author

Thanks @srowen !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants