[SPARK-3181] [ML] Implement huber loss for LinearRegression. #19020

yanboliang · 2017-08-22T07:43:09Z

What changes were proposed in this pull request?

MLlib LinearRegression supports huber loss addition to leastSquares loss. The huber loss objective function is:

Refer Eq.(6) and Eq.(8) in A robust hybrid of lasso and ridge regression. This objective is jointly convex as a function of (w, σ) ∈ R × (0,∞), we can use L-BFGS-B to solve it.

The current implementation is a straight forward porting for Python scikit-learn HuberRegressor. There are some differences:

We use mean loss (lossSum/weightSum), but sklearn uses total loss (lossSum).
We multiply the loss function and L2 regularization by 1/2. It does not affect the result if we multiply the whole formula by a factor, we just keep consistent with leastSquares loss.

So if fitting w/o regularization, MLlib and sklearn produce the same output. If fitting w/ regularization, MLlib should set regParam divide by the number of instances to match the output of sklearn.

How was this patch tested?

Unit tests.

SparkQA · 2017-08-22T07:57:35Z

Test build #80975 has finished for PR 19020 at commit 9142471.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T08:35:11Z

Test build #80977 has finished for PR 19020 at commit 5e0a868.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-08-22T10:44:53Z

Test build #80981 has finished for PR 19020 at commit 00484b4.

This patch fails to generate documentation.
This patch merges cleanly.
This patch adds no public classes.

MLnick · 2017-08-22T13:11:10Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

this can be factored out and only instantiated once, no?

I think no. I was trying to factored it out, but because LeastSquaresAggregator and HuberAggregator pass in the type of itself, so the compile will complain. Maybe @sethah can give some suggestion?

Oh right, ok.

MLnick · 2017-08-22T13:20:12Z

So did we decide not to expose the equivalents of scale_ and outliers_ from sklearn?

yanboliang · 2017-08-22T13:49:32Z

@MLnick Yeah, I think we have get an agreement in JIRA discussion.

SparkQA · 2017-08-22T14:23:52Z

Test build #80983 has finished for PR 19020 at commit 5985f7e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

WeichenXu123

Great work! I leave some minor comments.

WeichenXu123 · 2017-08-30T13:32:39Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala

Use coefficients(numFeatures)

coefficients doesn't contain intercept element, so we can't get intercept from coefficients array.

Is it OK to include intercept into coefficients? or create a class variable for intercept?
Anyway maybe we should avoid getting values from Broadcast in each add.

I created a class variable for intercept.

WeichenXu123 · 2017-08-30T13:38:45Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala

The epsilon is parameter M. Use consistent name is better.

WeichenXu123 · 2017-08-30T13:45:13Z

mllib/src/test/scala/org/apache/spark/ml/optim/aggregator/HuberAggregatorSuite.scala

Use ~== relTol instead of ===

WeichenXu123 · 2017-08-30T13:48:34Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

SparkQA · 2017-09-01T08:20:40Z

Test build #81310 has finished for PR 19020 at commit 0220df7.

This patch fails to build.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-01T10:35:35Z

Test build #81308 has finished for PR 19020 at commit d95a382.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-09-01T12:07:34Z

Test build #81315 has finished for PR 19020 at commit 4836810.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-09-01T14:48:58Z

@MLnick @WeichenXu123 Thanks for your comments, also cc @jkbradley @hhbyyh @sethah, would you mind to have a look? Thanks.

WeichenXu123 · 2017-09-06T00:32:32Z

Looks good. cc @jkbradley Thanks!

felixcheung · 2017-09-06T07:07:14Z

big vote for python and R :)

jkbradley · 2017-09-19T17:41:23Z

I'll check this out now

jkbradley

This seems very useful to add--thanks! I have a few questions:

Echoing @WeichenXu123 's comment: Why use "epsilon" as the Param name?
I'd like us to provide the estimated scaling factor (sigma from the paper) in the Model. That seems useful for model interpretation and debugging.

We should update the persistence test by updating allParamSettings.

jkbradley · 2017-09-19T17:41:07Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala

This description is misleadingly general since this claim only applies to normally distributed data. How about referencing the part of the paper which talks about this so that people can look up what is meant here?

jkbradley · 2017-09-19T17:41:12Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala

style: It'd be nice to put parentheses around (linearLoss / sigma) for clarity

jkbradley · 2017-09-19T17:41:34Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

Mark as expertParam (same for set/get)

jkbradley · 2017-09-19T17:41:38Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

Let's keep exact specifications of the losses being used. This is one of my big annoyances with many ML libraries: It's hard to tell exactly what loss is being used, which makes it hard to compare/validate results across different ML libraries.

It'd also be nice to make it clear what we mean by "huber," in particular that we estimate the scale parameter from data.

I agree, and I added math formula for both squaredError and huber loss function.

jkbradley · 2017-09-19T17:41:40Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

Log epsilon (M) as well

jkbradley · 2017-09-19T18:08:54Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

How about calling this "squaredError" since the loss is "squared error," not "least squares."

jkbradley · 2017-09-19T20:24:26Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

Do you know if these integration tests are in the L1 penalty regime or in the L2 regime? It'd be nice to make sure we're testing both.

Yes, the test data was composed by two parts: inlierData and outlierData, and I have checked both regimes have been test. Thanks.

sethah · 2017-09-20T20:54:58Z

I disagree that this should be combined with Linear Regression. IMO, this belongs as its own algorithm. The fact that there would be code duplication in that case is indicative that we don't have good abstractions and code sharing in place, not that we should combine different algorithms using case expressions internally.

yanboliang · 2017-09-22T08:58:16Z

@jkbradley Thanks for your comments, I have addressed all your inline comments. Please see replies to your other questions below:

Echoing @WeichenXu123 's comment: Why use "epsilon" as the Param name?

We have two candidate name: epsilon or m , both of them are not very descriptive. I referred sklearn HuberRegressor, and keep consistent with it.

I'd like us to provide the estimated scaling factor (sigma from the paper) in the Model. That seems useful for model interpretation and debugging.

I'm hesitating to add it to LinearRegression in case to confuse users who just try with different losses, but I'm OK to add it(and will add it after collecting all comments). What should be output for sigma if users fit with squaredError? A default value or throwing exception? I'd prefer to 1.0 as default value, what do you think of it? Thanks.

yanboliang · 2017-09-22T09:08:31Z

@sethah To the issue that whether huber linear regression should share codebase with LinearRegression, we already have discussion at JIRA. At last @dbtsai and I reached an agreement to combine them in a single class. Also, in this paper, huber regression is a robust hybrid of lasso and ridge regression, so I think we can regard it as one case of linear regression and combine them together.
@jkbradley @MLnick @WeichenXu123 @hhbyyh What's your opinion? Thanks.

SparkQA · 2017-09-22T11:14:45Z

Test build #82075 has finished for PR 19020 at commit aa7f454.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-09-22T14:45:59Z

Jenkins, test this please.

SparkQA · 2017-09-22T14:47:23Z

Test build #82086 has started for PR 19020 at commit aa7f454.

yanboliang · 2017-09-25T15:54:39Z

Jenkins, test this please.

sethah · 2017-09-25T16:44:22Z

@yanboliang Yeah, I saw the discussion and it seems to me the reason was: there would be too much code duplication. Sure, it's true that there would be code duplication, but to me that's a reason to work on the internals so that there is less code duplication, rather than just to continue patching around a design that doesn't work very well. We can combine them, I just don't think we should. I know I'm late to the discussion, so there's already been a lot of work. But these things can't really be undone due to backwards compatibility. We could work on creating better interfaces for plugging in loss/prediction/optimizer, which I think is the best way to approach it. Linear and logistic regression seem like they are just becoming giant, monolithic pieces of code.

I guess the argument against it will be lack of developer bandwidth. If that's the case, ok, but I'd argue to just leave Huber regression to be implemented by an external package in that case. If we don't have bandwidth do it in a robust, well-designed way then I don't think doing it the easy way is a good solution either. My first vote is to implement as a separate estimator, my second vote would be to leave it for a Spark package.

SparkQA · 2017-09-25T19:06:58Z

Test build #82151 has finished for PR 19020 at commit aa7f454.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-09-26T23:03:55Z

We have two candidate name: epsilon or m

I see; that seems fine then, though I worry that we use "epsilon" in MLlib (tests) for "a very small positive number." Can we document it more clearly, including the comment that it matches sklearn and is "M" from the paper?

provide the estimated scaling factor (sigma from the paper)

I'd say:

Either we provide it as 1 for regular linear regression (since that is technically correct)
Or we take this as indication that @sethah 's comment about separating the classes is better.

Re: @sethah 's comment about separating classes, I'll comment in the JIRA since that's a bigger discussion.

WeichenXu123 · 2017-09-27T14:22:16Z

I also vote to combine them as one estimator, here are my two cents:
1, Regression with Huber loss is one kind of linear regression. It makes sense to switch between different loss functions.
2, To combine them as one estimator should be more visible to users. Users should be easy to try linear regression with different loss function.
3, It will reduce lots of code duplication.
thanks!

WeichenXu123 · 2017-11-07T08:02:10Z

LGTM. thanks!

SparkQA · 2017-12-12T21:29:36Z

Test build #84790 has finished for PR 19020 at commit 2c404ff.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-12-12T21:56:31Z

Jenkins, test this please.

SparkQA · 2017-12-13T01:18:08Z

Test build #84797 has finished for PR 19020 at commit 2c404ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-12-13T05:17:55Z

Test build #84817 has finished for PR 19020 at commit 4304b6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh

LGTM.

One thing I noticed is that we did not really compare the loss with other lib (like sklearn), which is something also missing for other linear algorithms. Do you think it would be a good idea to add it?

hhbyyh · 2017-12-13T19:29:07Z

mllib/src/main/scala/org/apache/spark/ml/optim/aggregator/HuberAggregator.scala

+      val linearLoss = label - margin
+
+      if (math.abs(linearLoss) <= sigma * epsilon) {
+        lossSum += 0.5 * weight * (sigma +  math.pow(linearLoss, 2.0) / sigma)


extra space after +

hhbyyh · 2017-12-13T19:30:59Z

mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala

  with LinearRegressionParams with MLWritable {

+  def this(uid: String, coefficients: Vector, intercept: Double) =
+    this(uid, coefficients, intercept, 1.0)


is it better to set default scale to a impossible value like 0, or -1

scale denotes that |y - X'w - c| is scaled down, I think it makes sense to be set 1.0 for least squares regression.

SparkQA · 2017-12-14T02:32:31Z

Test build #84883 has finished for PR 19020 at commit d4369ff.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2017-12-14T05:19:41Z

Merged into master, thanks for all your reviewing.

yanboliang mentioned this pull request Aug 22, 2017

[SPARK-3181] [ML] Implement RobustRegression with huber loss. #14326

Closed

MLnick reviewed Aug 22, 2017

View reviewed changes

WeichenXu123 reviewed Aug 30, 2017

View reviewed changes

yanboliang force-pushed the spark-3181 branch from 0220df7 to 4836810 Compare September 1, 2017 09:03

jkbradley reviewed Sep 19, 2017

View reviewed changes

yanboliang force-pushed the spark-3181 branch from aa7f454 to 8c6622f Compare October 3, 2017 05:55

yanboliang force-pushed the spark-3181 branch from 8356ffb to 2c404ff Compare December 12, 2017 21:23

yanboliang added 16 commits December 12, 2017 17:44

Implement HuberAggregator and add tests.

2c7ff56

Implement huber loss for LinearRegression.

1ffb703

Update HuberAggregator and tests.

08e97f9

Update params doc and check for illegal params.

e03f6f3

Update LinearRegression test suites.

a5093a5

Add mima excludes.

82f13bb

Fix docs.

860a8f2

Fix annotation.

056d487

Update and reorg test cases.

785b072

Minor update for tests.

6f52acf

Rename m to epsilon.

6a12e04

Address review comments.

bd28df5

Add loss function formula for LinearRegression.

b7cb318

Expose scale for LinearRegressionModel.

8712bff

Address comments.

0657626

Update sharedParams.

4304b6e

yanboliang force-pushed the spark-3181 branch from 2c404ff to 4304b6e Compare December 13, 2017 01:50

hhbyyh approved these changes Dec 13, 2017

View reviewed changes

Fix nit.

d4369ff

asfgit closed this in 1e44dd0 Dec 14, 2017

yanboliang deleted the spark-3181 branch December 14, 2017 05:23

[SPARK-3181] [ML] Implement huber loss for LinearRegression. #19020

[SPARK-3181] [ML] Implement huber loss for LinearRegression. #19020

Uh oh!

Conversation

yanboliang commented Aug 22, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

MLnick commented Aug 22, 2017

Uh oh!

yanboliang commented Aug 22, 2017

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

WeichenXu123 left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Sep 1, 2017

Uh oh!

SparkQA commented Sep 1, 2017

Uh oh!

SparkQA commented Sep 1, 2017

Uh oh!

yanboliang commented Sep 1, 2017

Uh oh!

WeichenXu123 commented Sep 6, 2017

Uh oh!

felixcheung commented Sep 6, 2017

Uh oh!

jkbradley commented Sep 19, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

yanboliang commented Aug 22, 2017 •

edited

Loading

yanboliang commented Sep 22, 2017 •

edited

Loading

yanboliang commented Sep 22, 2017 •

edited

Loading