Skip to content

Conversation

@yanboliang
Copy link
Contributor

@yanboliang yanboliang commented Aug 22, 2017

What changes were proposed in this pull request?

MLlib LinearRegression supports huber loss addition to leastSquares loss. The huber loss objective function is:
image
Refer Eq.(6) and Eq.(8) in A robust hybrid of lasso and ridge regression. This objective is jointly convex as a function of (w, σ) ∈ R × (0,∞), we can use L-BFGS-B to solve it.

The current implementation is a straight forward porting for Python scikit-learn HuberRegressor. There are some differences:

  • We use mean loss (lossSum/weightSum), but sklearn uses total loss (lossSum).
  • We multiply the loss function and L2 regularization by 1/2. It does not affect the result if we multiply the whole formula by a factor, we just keep consistent with leastSquares loss.

So if fitting w/o regularization, MLlib and sklearn produce the same output. If fitting w/ regularization, MLlib should set regParam divide by the number of instances to match the output of sklearn.

How was this patch tested?

Unit tests.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80975 has finished for PR 19020 at commit 9142471.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80977 has finished for PR 19020 at commit 5e0a868.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80981 has finished for PR 19020 at commit 00484b4.

  • This patch fails to generate documentation.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this can be factored out and only instantiated once, no?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think no. I was trying to factored it out, but because LeastSquaresAggregator and HuberAggregator pass in the type of itself, so the compile will complain. Maybe @sethah can give some suggestion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh right, ok.

@MLnick
Copy link
Contributor

MLnick commented Aug 22, 2017

So did we decide not to expose the equivalents of scale_ and outliers_ from sklearn?

@yanboliang
Copy link
Contributor Author

@MLnick Yeah, I think we have get an agreement in JIRA discussion.

@SparkQA
Copy link

SparkQA commented Aug 22, 2017

Test build #80983 has finished for PR 19020 at commit 5985f7e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@WeichenXu123 WeichenXu123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! I leave some minor comments.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use coefficients(numFeatures)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

coefficients doesn't contain intercept element, so we can't get intercept from coefficients array.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it OK to include intercept into coefficients? or create a class variable for intercept?
Anyway maybe we should avoid getting values from Broadcast in each add.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I created a class variable for intercept.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The epsilon is parameter M. Use consistent name is better.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ~== relTol instead of ===

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use ===

@SparkQA
Copy link

SparkQA commented Sep 1, 2017

Test build #81310 has finished for PR 19020 at commit 0220df7.

  • This patch fails to build.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2017

Test build #81308 has finished for PR 19020 at commit d95a382.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2017

Test build #81315 has finished for PR 19020 at commit 4836810.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

@MLnick @WeichenXu123 Thanks for your comments, also cc @jkbradley @hhbyyh @sethah, would you mind to have a look? Thanks.

@WeichenXu123
Copy link
Contributor

Looks good. cc @jkbradley Thanks!

@felixcheung
Copy link
Member

big vote for python and R :)

@jkbradley
Copy link
Member

I'll check this out now

Copy link
Member

@jkbradley jkbradley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems very useful to add--thanks! I have a few questions:

  • Echoing @WeichenXu123 's comment: Why use "epsilon" as the Param name?
  • I'd like us to provide the estimated scaling factor (sigma from the paper) in the Model. That seems useful for model interpretation and debugging.

We should update the persistence test by updating allParamSettings.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This description is misleadingly general since this claim only applies to normally distributed data. How about referencing the part of the paper which talks about this so that people can look up what is meant here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

style: It'd be nice to put parentheses around (linearLoss / sigma) for clarity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mark as expertParam (same for set/get)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's keep exact specifications of the losses being used. This is one of my big annoyances with many ML libraries: It's hard to tell exactly what loss is being used, which makes it hard to compare/validate results across different ML libraries.

It'd also be nice to make it clear what we mean by "huber," in particular that we estimate the scale parameter from data.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, and I added math formula for both squaredError and huber loss function.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Log epsilon (M) as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about calling this "squaredError" since the loss is "squared error," not "least squares."

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you know if these integration tests are in the L1 penalty regime or in the L2 regime? It'd be nice to make sure we're testing both.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the test data was composed by two parts: inlierData and outlierData, and I have checked both regimes have been test. Thanks.

@sethah
Copy link
Contributor

sethah commented Sep 20, 2017

I disagree that this should be combined with Linear Regression. IMO, this belongs as its own algorithm. The fact that there would be code duplication in that case is indicative that we don't have good abstractions and code sharing in place, not that we should combine different algorithms using case expressions internally.

@yanboliang
Copy link
Contributor Author

yanboliang commented Sep 22, 2017

@jkbradley Thanks for your comments, I have addressed all your inline comments. Please see replies to your other questions below:

Echoing @WeichenXu123 's comment: Why use "epsilon" as the Param name?

We have two candidate name: epsilon or m , both of them are not very descriptive. I referred sklearn HuberRegressor, and keep consistent with it.

I'd like us to provide the estimated scaling factor (sigma from the paper) in the Model. That seems useful for model interpretation and debugging.

I'm hesitating to add it to LinearRegression in case to confuse users who just try with different losses, but I'm OK to add it(and will add it after collecting all comments). What should be output for sigma if users fit with squaredError? A default value or throwing exception? I'd prefer to 1.0 as default value, what do you think of it? Thanks.

@yanboliang
Copy link
Contributor Author

yanboliang commented Sep 22, 2017

@sethah To the issue that whether huber linear regression should share codebase with LinearRegression, we already have discussion at JIRA. At last @dbtsai and I reached an agreement to combine them in a single class. Also, in this paper, huber regression is a robust hybrid of lasso and ridge regression, so I think we can regard it as one case of linear regression and combine them together.
@jkbradley @MLnick @WeichenXu123 @hhbyyh What's your opinion? Thanks.

@SparkQA
Copy link

SparkQA commented Sep 22, 2017

Test build #82075 has finished for PR 19020 at commit aa7f454.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Sep 22, 2017

Test build #82086 has started for PR 19020 at commit aa7f454.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@sethah
Copy link
Contributor

sethah commented Sep 25, 2017

@yanboliang Yeah, I saw the discussion and it seems to me the reason was: there would be too much code duplication. Sure, it's true that there would be code duplication, but to me that's a reason to work on the internals so that there is less code duplication, rather than just to continue patching around a design that doesn't work very well. We can combine them, I just don't think we should. I know I'm late to the discussion, so there's already been a lot of work. But these things can't really be undone due to backwards compatibility. We could work on creating better interfaces for plugging in loss/prediction/optimizer, which I think is the best way to approach it. Linear and logistic regression seem like they are just becoming giant, monolithic pieces of code.

I guess the argument against it will be lack of developer bandwidth. If that's the case, ok, but I'd argue to just leave Huber regression to be implemented by an external package in that case. If we don't have bandwidth do it in a robust, well-designed way then I don't think doing it the easy way is a good solution either. My first vote is to implement as a separate estimator, my second vote would be to leave it for a Spark package.

@SparkQA
Copy link

SparkQA commented Sep 25, 2017

Test build #82151 has finished for PR 19020 at commit aa7f454.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jkbradley
Copy link
Member

We have two candidate name: epsilon or m

I see; that seems fine then, though I worry that we use "epsilon" in MLlib (tests) for "a very small positive number." Can we document it more clearly, including the comment that it matches sklearn and is "M" from the paper?

provide the estimated scaling factor (sigma from the paper)

I'd say:

  • Either we provide it as 1 for regular linear regression (since that is technically correct)
  • Or we take this as indication that @sethah 's comment about separating the classes is better.

Re: @sethah 's comment about separating classes, I'll comment in the JIRA since that's a bigger discussion.

@WeichenXu123
Copy link
Contributor

I also vote to combine them as one estimator, here are my two cents:
1, Regression with Huber loss is one kind of linear regression. It makes sense to switch between different loss functions.
2, To combine them as one estimator should be more visible to users. Users should be easy to try linear regression with different loss function.
3, It will reduce lots of code duplication.
thanks!

@WeichenXu123
Copy link
Contributor

LGTM. thanks!

@SparkQA
Copy link

SparkQA commented Dec 12, 2017

Test build #84790 has finished for PR 19020 at commit 2c404ff.

  • This patch fails Scala style tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84797 has finished for PR 19020 at commit 2c404ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Dec 13, 2017

Test build #84817 has finished for PR 19020 at commit 4304b6e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Contributor

@hhbyyh hhbyyh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

One thing I noticed is that we did not really compare the loss with other lib (like sklearn), which is something also missing for other linear algorithms. Do you think it would be a good idea to add it?

val linearLoss = label - margin

if (math.abs(linearLoss) <= sigma * epsilon) {
lossSum += 0.5 * weight * (sigma + math.pow(linearLoss, 2.0) / sigma)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra space after +

with LinearRegressionParams with MLWritable {

def this(uid: String, coefficients: Vector, intercept: Double) =
this(uid, coefficients, intercept, 1.0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it better to set default scale to a impossible value like 0, or -1

Copy link
Contributor Author

@yanboliang yanboliang Dec 14, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scale denotes that |y - X'w - c| is scaled down, I think it makes sense to be set 1.0 for least squares regression.

@SparkQA
Copy link

SparkQA commented Dec 14, 2017

Test build #84883 has finished for PR 19020 at commit d4369ff.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor Author

Merged into master, thanks for all your reviewing.

@asfgit asfgit closed this in 1e44dd0 Dec 14, 2017
@yanboliang yanboliang deleted the spark-3181 branch December 14, 2017 05:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants