-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-3181] [ML] Implement RobustRegression with huber loss. #14326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #62747 has finished for PR 14326 at commit
|
| featuresStd: Array[Double], | ||
| m: Double) extends Serializable { | ||
|
|
||
| private val coefficients: Array[Double] = parameters.toArray.slice(2, parameters.size) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This aggregator will serialize the featuresStd and the coefficients between aggregation steps, which is not necessary. You can mark them as @transient or simply pass them to the add function as LogisticRegression does. You can see #14109 for more details.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very good suggestion. Thanks!
|
Test build #62813 has finished for PR 14326 at commit
|
|
I'm making through the first pass now. |
|
Test build #63261 has finished for PR 14326 at commit
|
|
Could we instead implement a more general Robust Linear Model M-estimator type like is done in statsmodels RLM, see RLM.py? The Huber loss would then be one of the M-estimators, maybe the default as done in statsmodels. I think that the IterativelyReweightedLeastSquares was made and intended to aid in developing a robust M-Estimator framework. |
|
Test build #72507 has finished for PR 14326 at commit
|
| * 95% statistical efficiency for normally distributed data. | ||
| */ | ||
| @Since("2.1.0") | ||
| final val m = new DoubleParam(this, "m", "The shape parameter to control the amount of " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Change @SInCE
| * space "\sigma > 0". | ||
| */ | ||
| val optimizer = new BreezeLBFGS[BDV[Double]]($(maxIter), 10, $(tol)) { | ||
| override protected def determineStepSize( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update LBFGS-B as already fix bug in scalanlp/breeze#633
|
@WeichenXu123 Thanks for your comment. I will update my PR ASAP. |
|
I'll close this PR and open a new one. Feel free to review and comment. Thanks. |
|
Please go to #19020 for reviewing and comments. Thanks. |
What changes were proposed in this pull request?
The current implementation is a straight forward porting for Python scikit-learn
HuberRegressor, so it produces the same result with that.The code is used for discussion and please overpass trivial issues now, since I think we may have slightly different idea for our Spark implementation.
Here I listed some major issues should be discussed:
Objective function.
We use Eq.(6) in A robust hybrid of lasso and ridge regression as the objective function.


But the convention is different from other Spark ML code such as
LinearRegressionin two aspects:1, The loss is total loss rather than mean loss. We use
lossSum/weightSumas the mean loss inLinearRegression.2, We do not multiply the loss function and L2 regularization by 1/2. This is not a problem since it does not affect the result if we multiply the whole formula by a factor.
So should we turn to use the modified objective function like following which will be consistent with other Spark ML code?
Implement a new class RobustRegression or a new loss function for LinearRegression?
Both
LinearRegressionandRobustRegressionaccomplish the same goal, but the output offitwill be different:LinearRegressionModelandRobustRegressionModel. The former only containscoefficients,intercept; but the latter containscoefficients,intercept,scale/sigma(and even the outlier samples similar to sklearnHuberRegressor.outliers_). It will also involve save/load compatibility issue if we combine the two models become one. One trick method is we can dropscale/sigmaand make thefitby this huber cost function still outputLinearRegressionModel, but I don't think it's an appropriate way since it will miss some model attributes. So I implementedRobustRegressionin a new class, and we can port this loss function toLinearRegressionif needed at later time.Bugs of breeze LBFGS-B and work around.
The estimated parameter
\sigmamust > 0 which is a bound optimize problem and we should useLBFGS-Bto solve, but there is a bug in breezeLBFGS-B. We figure out the work around with modifiedLBFGS.Since we known the huber loss function is convex in space
\sigma > 0and the bound\sigma = 0is unreachable. The solution of loss function will not be on the bound. We still optimize the loss function byLBFGSbut limit the step size when doing line search of each iteration. We should verify the step size generated by line search in the space\sigma > 0.We override the function
LBFGS.determineStepSizeto limit the step size. We should make sure that\sigma > 0after take step operation:x(k+1) = x(k) + alpha * dir(\sigmais the first element of parameter vector in my implementation). We useBacktrackingLineSearchto do line search since it can be set the upper bound of the returned step size. Meanwhile,BacktrackingLineSearchstill checks the strong wolfe conditions.How was this patch tested?
Unit tests.