-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-18456][ML][FOLLOWUP] Use matrix abstraction for coefficients in LogisticRegression training #15893
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #68679 has finished for PR 15893 at commit
|
|
Only minor naming. LGTM. My interest can not access ssh to merge the code, will merge later tonight. Thanks. |
| allCoefficients) | ||
| val denseCoefficientMatrix = new DenseMatrix(numCoefficientSets, numFeatures, | ||
| new Array[Double](numCoefficientSets * numFeatures), isTransposed = true) | ||
| val interceptVec = if ($(fitIntercept) || !isMultinomial) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consistently use interceptVector?
|
LGTM. Since this doesn't have impact on performance, and make the codebase cleaner, I merged this PR into master and branch 2.1. Thanks. |
…n LogisticRegression training ## What changes were proposed in this pull request? This is a follow up to some of the discussion [here](#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing. Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark. ## How was this patch tested? This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch. Author: sethah <[email protected]> Closes #15893 from sethah/logreg_refactor. (cherry picked from commit 856e004) Signed-off-by: DB Tsai <[email protected]>
|
👍 |
…n LogisticRegression training ## What changes were proposed in this pull request? This is a follow up to some of the discussion [here](apache#15593). During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing. Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit `MutableInnerProductModule[DenseMatrix[Double], Double]` which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify the `regParamL1Fun` because OWLQN in Breeze requires a `MutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]]` (since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark. ## How was this patch tested? This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch. Author: sethah <[email protected]> Closes apache#15893 from sethah/logreg_refactor.
What changes were proposed in this pull request?
This is a follow up to some of the discussion here. During LogisticRegression training, we store the coefficients combined with intercepts as a flat vector, but a more natural abstraction is a matrix. Here, we refactor the code to use matrix where possible, which makes the code more readable and greatly simplifies the indexing.
Note: We do not use a Breeze matrix for the cost function as was mentioned in the linked PR. This is because LBFGS/OWLQN require an implicit
MutableInnerProductModule[DenseMatrix[Double], Double]which is not natively defined in Breeze. We would need to extend Breeze in Spark to define it ourselves. Also, we do not modify theregParamL1Funbecause OWLQN in Breeze requires aMutableEnumeratedCoordinateField[(Int, Int), DenseVector[Double]](since we still use a dense vector for coefficients). Here again we would have to extend Breeze inside Spark.How was this patch tested?
This is internal code refactoring - the current unit tests passing show us that the change did not break anything. No added functionality in this patch.