[SPARK-3181] [ML] Implement RobustRegression with huber loss. #14326

yanboliang · 2016-07-23T07:05:28Z

What changes were proposed in this pull request?

The current implementation is a straight forward porting for Python scikit-learn HuberRegressor, so it produces the same result with that.
The code is used for discussion and please overpass trivial issues now, since I think we may have slightly different idea for our Spark implementation.

Here I listed some major issues should be discussed:

Objective function.

We use Eq.(6) in A robust hybrid of lasso and ridge regression as the objective function.

But the convention is different from other Spark ML code such as LinearRegression in two aspects:
1, The loss is total loss rather than mean loss. We use lossSum/weightSum as the mean loss in LinearRegression.
2, We do not multiply the loss function and L2 regularization by 1/2. This is not a problem since it does not affect the result if we multiply the whole formula by a factor.
So should we turn to use the modified objective function like following which will be consistent with other Spark ML code?

Implement a new class RobustRegression or a new loss function for LinearRegression？

Both LinearRegression and RobustRegression accomplish the same goal, but the output of fit will be different: LinearRegressionModel and RobustRegressionModel. The former only contains coefficients, intercept; but the latter contains coefficients, intercept, scale/sigma (and even the outlier samples similar to sklearn HuberRegressor.outliers_). It will also involve save/load compatibility issue if we combine the two models become one. One trick method is we can drop scale/sigma and make the fit by this huber cost function still output LinearRegressionModel, but I don't think it's an appropriate way since it will miss some model attributes. So I implemented RobustRegression in a new class, and we can port this loss function to LinearRegression if needed at later time.

Bugs of breeze LBFGS-B and work around.

The estimated parameter \sigma must > 0 which is a bound optimize problem and we should use LBFGS-B to solve, but there is a bug in breeze LBFGS-B. We figure out the work around with modified LBFGS.
Since we known the huber loss function is convex in space \sigma > 0 and the bound \sigma = 0 is unreachable. The solution of loss function will not be on the bound. We still optimize the loss function by LBFGS but limit the step size when doing line search of each iteration. We should verify the step size generated by line search in the space \sigma > 0.
We override the function LBFGS.determineStepSize to limit the step size. We should make sure that \sigma > 0 after take step operation: x(k+1) = x(k) + alpha * dir (\sigma is the first element of parameter vector in my implementation). We use BacktrackingLineSearch to do line search since it can be set the upper bound of the returned step size. Meanwhile, BacktrackingLineSearch still checks the strong wolfe conditions.

How was this patch tested?

Unit tests.

SparkQA · 2016-07-23T07:51:08Z

Test build #62747 has finished for PR 14326 at commit 8fd0ca1.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RobustRegression @Since(\"2.1.0\") (@Since(\"2.1.0\") override val uid: String)

sethah · 2016-07-23T17:43:42Z

mllib/src/main/scala/org/apache/spark/ml/regression/RobustRegression.scala

+    featuresStd: Array[Double],
+    m: Double) extends Serializable {
+
+  private val coefficients: Array[Double] = parameters.toArray.slice(2, parameters.size)


This aggregator will serialize the featuresStd and the coefficients between aggregation steps, which is not necessary. You can mark them as @transient or simply pass them to the add function as LogisticRegression does. You can see #14109 for more details.

Very good suggestion. Thanks!

SparkQA · 2016-07-25T09:21:26Z

Test build #62813 has finished for PR 14326 at commit 9e2adae.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-07-25T09:27:30Z

cc @dbtsai @MechCoder

dbtsai · 2016-08-05T06:31:38Z

I'm making through the first pass now.

SparkQA · 2016-08-05T09:47:02Z

Test build #63261 has finished for PR 14326 at commit 3883154.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

talebzeghmi · 2016-10-02T02:13:21Z

Could we instead implement a more general Robust Linear Model M-estimator type like is done in statsmodels RLM, see RLM.py? The Huber loss would then be one of the M-estimators, maybe the default as done in statsmodels.

I think that the IterativelyReweightedLeastSquares was made and intended to aid in developing a robust M-Estimator framework.

SparkQA · 2017-02-07T15:59:32Z

Test build #72507 has finished for PR 14326 at commit 3883154.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123 · 2017-08-07T21:01:12Z

mllib/src/main/scala/org/apache/spark/ml/regression/RobustRegression.scala

+   * 95% statistical efficiency for normally distributed data.
+   */
+  @Since("2.1.0")
+  final val m = new DoubleParam(this, "m", "The shape parameter to control the amount of " +


Change @SInCE

WeichenXu123 · 2017-08-07T21:08:10Z

mllib/src/main/scala/org/apache/spark/ml/regression/RobustRegression.scala

+     * space "\sigma > 0".
+     */
+    val optimizer = new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) {
+      override protected def determineStepSize(


Update LBFGS-B as already fix bug in scalanlp/breeze#633

yanboliang · 2017-08-08T03:33:21Z

@WeichenXu123 Thanks for your comment. I will update my PR ASAP.

yanboliang · 2017-08-22T07:33:04Z

I'll close this PR and open a new one. Feel free to review and comment. Thanks.

yanboliang · 2017-08-22T08:00:13Z

Please go to #19020 for reviewing and comments. Thanks.

Implement RobustRegression with huber loss.

8fd0ca1

sethah reviewed Jul 23, 2016
View reviewed changes

yanboliang added 2 commits July 25, 2016 01:16

Remove unnecessary serialization in RobustRegression.

3ac20cd

Replace RDD.map with Dataset.as

9e2adae

Add modified LBFGS to solve huber loss function.

3883154

WeichenXu123 mentioned this pull request Mar 31, 2017

Fix lbfgsb linesearch out of bound and findAlpha method scalanlp/breeze#633

Merged

WeichenXu123 reviewed Aug 7, 2017

View reviewed changes

yanboliang closed this Aug 22, 2017

yanboliang deleted the spark-3181 branch August 22, 2017 07:33

[SPARK-3181] [ML] Implement RobustRegression with huber loss. #14326

[SPARK-3181] [ML] Implement RobustRegression with huber loss. #14326

Uh oh!

Conversation

yanboliang commented Jul 23, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Objective function.

Implement a new class RobustRegression or a new loss function for LinearRegression？

Bugs of breeze LBFGS-B and work around.

How was this patch tested?

Uh oh!

SparkQA commented Jul 23, 2016

Uh oh!

sethah Jul 23, 2016

Choose a reason for hiding this comment

Uh oh!

yanboliang Jul 25, 2016

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jul 25, 2016

Uh oh!

yanboliang commented Jul 25, 2016

Uh oh!

dbtsai commented Aug 5, 2016

Uh oh!

SparkQA commented Aug 5, 2016

Uh oh!

talebzeghmi commented Oct 2, 2016

Uh oh!

SparkQA commented Feb 7, 2017

Uh oh!

WeichenXu123 Aug 7, 2017

Choose a reason for hiding this comment

Uh oh!

WeichenXu123 Aug 7, 2017

Choose a reason for hiding this comment

Uh oh!

yanboliang commented Aug 8, 2017

Uh oh!

yanboliang commented Aug 22, 2017

Uh oh!

yanboliang commented Aug 22, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

yanboliang commented Jul 23, 2016 •

edited

Loading