[SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training #15893

sethah · 2016-11-15T23:11:41Z

What changes were proposed in this pull request?

This is a follow up to some of the discussion here. During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.

Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit MutableInnerProductModule[DenseMatrix[Double], Double] which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the regParamL1Fun because OWLQN in Breeze requires a MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]] (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.

How was this patch tested?

This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.

sethah · 2016-11-15T23:12:11Z

cc @MLnick @dbtsai

SparkQA · 2016-11-16T01:34:22Z

Test build #68679 has finished for PR 15893 at commit 28f67fb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

dbtsai · 2016-11-19T01:20:32Z

Only minor naming. LGTM. My interest can not access ssh to merge the code, will merge later tonight. Thanks.

dbtsai · 2016-11-19T01:19:25Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

+          allCoefficients)
+        val denseCoefficientMatrix = new DenseMatrix(numCoefficientSets, numFeatures,
+          new Array[Double](numCoefficientSets * numFeatures), isTransposed = true)
+        val interceptVec = if ($(fitIntercept) || !isMultinomial) {


Should we consistently use interceptVector?

dbtsai · 2016-11-20T01:47:09Z

LGTM. Since this doesn't have impact on performance, and make the codebase cleaner, I merged this PR into master and branch 2.1. Thanks.

…n LogisticRegression training ## What changes were proposed in this pull request? This is a follow up to some of the discussion [here](#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing. Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark. ## How was this patch tested? This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch. Author: sethah <[email protected]> Closes #15893 from sethah/logreg_refactor. (cherry picked from commit 856e004) Signed-off-by: DB Tsai <[email protected]>

MLnick · 2016-11-21T03:31:16Z

👍

…n LogisticRegression training ## What changes were proposed in this pull request? This is a follow up to some of the discussion [here](apache#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing. Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark. ## How was this patch tested? This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch. Author: sethah <[email protected]> Closes apache#15893 from sethah/logreg_refactor.

sethah added 2 commits November 15, 2016 14:11

refactor logistic regression to use matrix abstraction where possible

fef9951

clean up some comments and naming

28f67fb

dbtsai reviewed Nov 19, 2016

View reviewed changes

asfgit closed this in 856e004 Nov 20, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training #15893

[SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training #15893

Uh oh!

sethah commented Nov 15, 2016

Uh oh!

sethah commented Nov 15, 2016

Uh oh!

SparkQA commented Nov 16, 2016

Uh oh!

dbtsai commented Nov 19, 2016

Uh oh!

dbtsai Nov 19, 2016

Uh oh!

dbtsai commented Nov 20, 2016

Uh oh!

MLnick commented Nov 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training #15893

[SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training #15893

Uh oh!

Conversation

sethah commented Nov 15, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

sethah commented Nov 15, 2016

Uh oh!

SparkQA commented Nov 16, 2016

Uh oh!

dbtsai commented Nov 19, 2016

Uh oh!

dbtsai Nov 19, 2016

Choose a reason for hiding this comment

Uh oh!

dbtsai commented Nov 20, 2016

Uh oh!

MLnick commented Nov 21, 2016

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants