Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Jul 23, 2016

What changes were proposed in this pull request?

The current implementation is a straight forward porting for Python scikit-learn HuberRegressor, so it produces the same result with that.
The code is used for discussion and please overpass trivial issues now, since I think we may have slightly different idea for our Spark implementation.

Here I listed some major issues should be discussed:

Objective function.

We use Eq.(6) in A robust hybrid of lasso and ridge regression as the objective function.
image
But the convention is different from other Spark ML code such as LinearRegression in two aspects:
1, The loss is total loss rather than mean loss. We use lossSum/weightSum as the mean loss in LinearRegression.
2, We do not multiply the loss function and L2 regularization by 1/2. This is not a problem since it does not affect the result if we multiply the whole formula by a factor.
So should we turn to use the modified objective function like following which will be consistent with other Spark ML code?
image

Implement a new class RobustRegression or a new loss function for LinearRegression?

Both LinearRegression and RobustRegression accomplish the same goal, but the output of fit will be different: LinearRegressionModel and RobustRegressionModel. The former only contains coefficients, intercept; but the latter contains coefficients, intercept, scale/sigma (and even the outlier samples similar to sklearn HuberRegressor.outliers_). It will also involve save/load compatibility issue if we combine the two models become one. One trick method is we can drop scale/sigma and make the fit by this huber cost function still output LinearRegressionModel, but I don't think it's an appropriate way since it will miss some model attributes. So I implemented RobustRegression in a new class, and we can port this loss function to LinearRegression if needed at later time. 

Bugs of breeze LBFGS-B and work around.

The estimated parameter \sigma must > 0 which is a bound optimize problem and we should use LBFGS-B to solve, but there is a bug in breeze LBFGS-B. We figure out the work around with modified LBFGS.
Since we known the huber loss function is convex in space \sigma > 0 and the bound \sigma = 0 is unreachable. The solution of loss function will not be on the bound. We still optimize the loss function by LBFGS but limit the step size when doing line search of each iteration. We should verify the step size generated by line search in the space \sigma > 0.
We override the function LBFGS.determineStepSize to limit the step size. We should make sure that \sigma > 0 after take step operation: x(k+1) = x(k) + alpha * dir (\sigma is the first element of parameter vector in my implementation). We use BacktrackingLineSearch to do line search since it can be set the upper bound of the returned step size. Meanwhile, BacktrackingLineSearch still checks the strong wolfe conditions.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Jul 23, 2016

Test build #62747 has finished for PR 14326 at commit 8fd0ca1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RobustRegression @Since(\"2.1.0\") (@Since(\"2.1.0\") override val uid: String)

featuresStd: Array[Double],
m: Double) extends Serializable {

private val coefficients: Array[Double] = parameters.toArray.slice(2, parameters.size)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This aggregator will serialize the featuresStd and the coefficients between aggregation steps, which is not necessary. You can mark them as @transient or simply pass them to the add function as LogisticRegression does. You can see #14109 for more details.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good suggestion. Thanks!

@SparkQA
Copy link

SparkQA commented Jul 25, 2016

Test build #62813 has finished for PR 14326 at commit 9e2adae.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

cc @dbtsai @MechCoder

@dbtsai
Copy link
Member

dbtsai commented Aug 5, 2016

I'm making through the first pass now.

@SparkQA
Copy link

SparkQA commented Aug 5, 2016

Test build #63261 has finished for PR 14326 at commit 3883154.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@talebzeghmi
Copy link

Could we instead implement a more general Robust Linear Model M-estimator type like is done in statsmodels RLM, see RLM.py? The Huber loss would then be one of the M-estimators, maybe the default as done in statsmodels.

I think that the IterativelyReweightedLeastSquares was made and intended to aid in developing a robust M-Estimator framework.

@SparkQA
Copy link

SparkQA commented Feb 7, 2017

Test build #72507 has finished for PR 14326 at commit 3883154.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* 95% statistical efficiency for normally distributed data.
*/
@Since("2.1.0")
final val m = new DoubleParam(this, "m", "The shape parameter to control the amount of " +
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change @SInCE

* space "\sigma > 0".
*/
val optimizer = new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) {
override protected def determineStepSize(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update LBFGS-B as already fix bug in scalanlp/breeze#633

@yanboliang
Copy link
Contributor Author

@WeichenXu123 Thanks for your comment. I will update my PR ASAP.

@yanboliang
Copy link
Contributor Author

I'll close this PR and open a new one. Feel free to review and comment. Thanks.

@yanboliang yanboliang closed this Aug 22, 2017
@yanboliang yanboliang deleted the spark-3181 branch August 22, 2017 07:33
@yanboliang
Copy link
Contributor Author

Please go to #19020 for reviewing and comments. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants