[SPARK-14709][ML] spark.ml API for linear SVM #15211

hhbyyh · 2016-09-23T06:35:29Z

What changes were proposed in this pull request?

jira: https://issues.apache.org/jira/browse/SPARK-14709

Provide API for SVM algorithm for DataFrames. As discussed in jira, the initial implementation uses OWL-QN with Hinge loss function.
The API should mimic existing spark.ml.classification APIs.
Currently only Binary Classification is supported. Multinomial support can be added in this or following release.

How was this patch tested?

new unit tests and simple manual test

hhbyyh · 2016-09-23T06:42:16Z

@yanboliang this is a quick prototype for the ML SVM. I'll make another pass tomorrow to refine the code and add more unit tests. It mimics many behavior as LR, yet I'm not sure if we need to achieve similar complexity as LR .
For reviewers, appreciate if you can take a glance just in case there's something out of expectation. We can work on details after I remove the WIP tag.

SparkQA · 2016-09-23T07:33:54Z

Test build #65816 has finished for PR 15211 at commit f8ddc3b.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SVM @Since(\"2.1.0\") (
- class SVMModelWriter(instance: SVMModel)

SparkQA · 2016-09-26T07:34:34Z

Test build #65898 has finished for PR 15211 at commit 73b8011.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-09-29T13:32:34Z

mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala

+ */
+@Since("2.1.0")
+@Experimental
+class SVM @Since("2.1.0") (


What about SVMClassifier? we can also train regression model with SVM.

I changed it to LinearSVC, so we can have other SVM Classifier in the future.

yanboliang · 2016-09-29T13:33:22Z

mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala

+@Experimental
+class SVM @Since("2.1.0") (
+    @Since("2.1.0") override val uid: String)
+  extends Predictor[Vector, SVM, SVMModel]


Under the framework Classifier or ProbabilisticClassifier?

Thanks I changed it to Classifier. AFAIK, SVM raw result may not be used to indicate the probability.

zhengruifeng · 2016-11-04T07:08:50Z

mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala

+  override protected def train(dataset: Dataset[_]): SVMModel = {
+    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) lit(1.0) else col($(weightCol))
+    val instances: RDD[Instance] =
+      dataset.select(col($(labelCol)).cast(DoubleType), w, col($(featuresCol))).rdd.map {


labelCol is now not need to be casted to DoubleType, because it is casted in Predictor.fit()
see #15414

Yes, thanks.

zhengruifeng · 2016-11-04T07:09:49Z

mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala

+  with SVMParams with DefaultParamsWritable {
+
+  @Since("2.1.0")
+  def this() = this(Identifiable.randomUID("svm"))


if we rename this to SVMClassifier, the uid should be svc

zhengruifeng · 2016-11-04T07:14:09Z

mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala

+      }
+
+    val instr = Instrumentation.create(this, instances)
+    instr.logParams(regParam, standardization, threshold, maxIter, tol, fitIntercept)


To keep in line with other algos, labelCol, weightCol, featuresCol should be added here

zhengruifeng · 2016-11-04T07:17:16Z

mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala

+        n
+      case None => histogram.length
+    }
+


require(numClasses == 2, "...") ?

zhengruifeng · 2016-11-04T07:27:32Z

mllib/src/main/scala/org/apache/spark/ml/classification/SVM.scala

+      val localCoefficientsArray = coefficientsArray
+      val localGradientSumArray = gradientSumArray
+
+      numClasses match {


what about moving checking of numClasses outside of add()?

hhbyyh · 2016-12-10T01:51:10Z

Thanks for suggestions @yanboliang @zhengruifeng. I'm thinking to rename to class to LinearSVMClassifier. Just as other linearSVM implementation, this will only be a binary classifier and multi-classification will be supported via one-vs-rest.

SparkQA · 2016-12-15T00:39:15Z

Test build #70155 has finished for PR 15211 at commit 4902517.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-12-15T01:06:39Z

Sent an update to address some comments.

SparkQA · 2016-12-16T04:16:45Z

Test build #70232 has finished for PR 15211 at commit c8a7553.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2016-12-16T06:25:12Z

Remove WIP. This is ready for review. Thanks.

SparkQA · 2017-01-10T01:36:21Z

Test build #71105 has finished for PR 15211 at commit 05fbd02.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-01-10T04:15:53Z

Sent an update to include a R unit test. Yet I met a problem that there's a constant scaling difference between LinearSVC and R 1071 (which essentially is LibSVM). It's possible that it's caused by some parameter setting. Post it anyway to see if there's any suggestions.

Sorry @zhengruifeng, I'll address your comment in the next update.

jkbradley

For comparing with R, I'm wondering if the main issue is that it's hard to calculate the appropriate C given a regParam setting. Would it be easier to use this R package instead? https://cran.r-project.org/web/packages/svmpath/

Also, the test with sample weights takes 40 seconds. Does it still pass if you increase the 'tol' Param to make the test faster?

jkbradley · 2017-01-13T02:33:11Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+      Use the following R code to load the data and train the model using glmnet package.
+
+      library(e1071)
+      data <- read.csv("/home/yuhao/workspace/github/hhbyyh/Test/SVM/svm/part-00000", header=FALSE)


How about basing the data location at target/tmp/LinearSVC/binaryDataset to match the data export?

jkbradley · 2017-01-13T02:33:13Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+     */
+    val coefficientsR = Vectors.dense(-7.310475, -14.89742, -22.21019, -29.83495)
+    val interceptR = -7.440296
+    assert(model1.intercept / interceptR ~== -0.9 relTol 2E-2)


This is a strange way to write the comparison. Was this a temporary thing to make the tests pass?

hhbyyh · 2017-01-18T21:57:40Z

Thanks @jkbradley.

I reviewed the gradient and loss function. The corresponding regularization lambda should satisfy lambda * c * N (data size) = 2.
c equals to 10 by default in R and sklearn. thus we should set reg = 0.00002 (2 / 10000 / 10).

I corrected the unit test against R 1071 and added another unit test against sklearn.

SparkQA · 2017-01-18T22:54:17Z

Test build #71618 has finished for PR 15211 at commit 36f585c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-18T23:01:47Z

Test build #71619 has finished for PR 15211 at commit 72719fc.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

@hhbyyh This looks great. Combining the tests to reduce test time is the only remaining issue, I believe.

jkbradley · 2017-01-18T23:06:35Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+    assert(model1.coefficients ~== coefficientsR relTol 1E-2)
+  }
+
+  test("linearSVC comparison with scikit-learn") {


Let's combine these to avoid retraining since this is the same as the R test. (And training takes 35 sec)

jkbradley · 2017-01-18T23:08:41Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+      features <- as.matrix(data.frame(data$V2, data$V3, data$V4, data$V5))
+      svm_model <- svm(features, label, type='C', kernel='linear', cost=10, scale=F, tolerance=1e-4)
+      summary(svm_model)
+      w <- -t(svm_model$coefs) %*% svm_model$SV


Remove "-" to make values match lines coefficientsR, interceptR below

jkbradley · 2017-01-18T23:08:46Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+      > w
+             data.V2   data.V3   data.V4   data.V5
+      [1,] -7.310475 -14.89742 -22.21019 -29.83495
+      > -svm_model$rho


"-" here may be necessary as b = -model$rho #Offset

jkbradley · 2017-01-18T23:09:48Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+
+  test("linearSVC comparison with scikit-learn") {
+    val trainer1 = new LinearSVC()
+      .setRegParam(0.00002)


Add a comment that this matches C above.

SparkQA · 2017-01-18T23:21:47Z

Test build #71620 has finished for PR 15211 at commit 2e99a0f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

hhbyyh · 2017-01-18T23:27:49Z

I see, will work on combining the tests now. Also I'm thinking if we should consider using c (cost) to replace RegParam in `LinearSVC' to be more friendly for SVM users. Yet the change may be confusing for Spark users. I'm neutral on this.

SparkQA · 2017-01-19T01:01:14Z

Test build #71623 has finished for PR 15211 at commit d62c107.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-01-19T01:32:10Z

Test build #71625 has finished for PR 15211 at commit 9b71a3a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley

I'd like to keep regParam. I think it's about as common in literature and practice as specifying the constraint C.

jkbradley · 2017-01-20T18:26:34Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+
+      > w
+             data.V2   data.V3   data.V4   data.V5
+      [1,] -7.310338 -14.89741 -22.21005 -29.83508


w and the intercept are still negative, which isn't what we wanted, right?

I misunderstood your last comment. Changing them to positive now.

jkbradley · 2017-01-20T18:26:38Z

mllib/src/test/scala/org/apache/spark/ml/classification/LinearSVCSuite.scala

+      [1] -7.440177
+
+     */
+    val coefficientsR = Vectors.dense(7.310475, 14.89742, 22.21019, 29.83495)


Why are these values changed?

The updated weights are the correct ones and they are stable. I forgot how to generate the original weights... maybe use a different random seed for data generation.

SparkQA · 2017-01-20T22:15:14Z

Test build #71742 has finished for PR 15211 at commit bbcb7cb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2017-01-23T20:17:04Z

LGTM
Thanks @hhbyyh and also @yanboliang and @zhengruifeng for helping with review!
Merging with master

One more step towards feature parity for the DataFrame-based API!

jkbradley · 2017-01-23T20:19:52Z

I'll create follow-up JIRAs (linked from this PR's JIRA). @hhbyyh Can I assign one or more to you?

hhbyyh · 2017-01-23T21:24:59Z

Thanks @jkbradley for driving the review process.
Also thanks @yanboliang and @zhengruifeng for the helpful comments.
Sure I'd like to keep working on the follow-up tasks.

## What changes were proposed in this pull request? jira: https://issues.apache.org/jira/browse/SPARK-14709 Provide API for SVM algorithm for DataFrames. As discussed in jira, the initial implementation uses OWL-QN with Hinge loss function. The API should mimic existing spark.ml.classification APIs. Currently only Binary Classification is supported. Multinomial support can be added in this or following release. ## How was this patch tested? new unit tests and simple manual test Author: Yuhao <[email protected]> Author: Yuhao Yang <[email protected]> Closes apache#15211 from hhbyyh/mlsvm.

hhbyyh added 4 commits April 19, 2016 12:59

prototype

b0558c1

Merge remote-tracking branch 'upstream/master' into mlsvm

7258189

Merge remote-tracking branch 'upstream/master' into mlsvm

03e5893

svm prototype

f8ddc3b

YY-OnCall and others added 3 commits September 23, 2016 11:16

Merge remote-tracking branch 'upstream/master' into mlsvm

7ae33c2

Merge remote-tracking branch 'upstream/master' into mlsvm

dcb8d2f

param cleanup

73b8011

yanboliang reviewed Sep 29, 2016

View reviewed changes

zhengruifeng reviewed Nov 4, 2016

View reviewed changes

YY-OnCall added 5 commits December 14, 2016 11:21

Merge remote-tracking branch 'upstream/master' into mlsvm

0ffaf5c

rename to linearsvc

0437f61

Merge remote-tracking branch 'upstream/master' into mlsvm

f04db9e

merge

5409776

style update

4902517

YY-OnCall added 2 commits December 15, 2016 15:44

Merge remote-tracking branch 'upstream/master' into mlsvm

9a17703

add more ut and small refine

c8a7553

hhbyyh changed the title ~~[SPARK-14709][ML] [WIP] spark.ml API for linear SVM~~ [SPARK-14709][ML] spark.ml API for linear SVM Dec 16, 2016

rename weight to coefficient

6bab7c9

Merge remote-tracking branch 'upstream/master' into mlsvm

00a2be0

jkbradley reviewed Jan 17, 2017

View reviewed changes

YY-OnCall added 4 commits January 17, 2017 18:39

Merge remote-tracking branch 'upstream/master' into mlsvm

109ee9b

Merge remote-tracking branch 'upstream/master' into mlsvm

927a9ca

update ut

36f585c

remove trait bracket

72719fc

update unit test path and remove require

2e99a0f

jkbradley reviewed Jan 18, 2017

View reviewed changes

YY-OnCall added 2 commits January 18, 2017 15:48

combine ut

d62c107

add comment for regParam

9b71a3a

jkbradley reviewed Jan 20, 2017

View reviewed changes

YY-OnCall added 2 commits January 20, 2017 12:23

Merge remote-tracking branch 'upstream/master' into mlsvm

83b72aa

change R ut

bbcb7cb

asfgit closed this in 4a11d02 Jan 23, 2017

[SPARK-14709][ML] spark.ml API for linear SVM #15211

[SPARK-14709][ML] spark.ml API for linear SVM #15211

Conversation

hhbyyh commented Sep 23, 2016

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

hhbyyh commented Sep 23, 2016

Uh oh!

SparkQA commented Sep 23, 2016

Uh oh!

SparkQA commented Sep 26, 2016

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Dec 10, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Dec 15, 2016

Uh oh!

hhbyyh commented Dec 15, 2016

Uh oh!

SparkQA commented Dec 16, 2016

Uh oh!

hhbyyh commented Dec 16, 2016 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Jan 10, 2017

Uh oh!

hhbyyh commented Jan 10, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hhbyyh commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

jkbradley left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jan 18, 2017

Uh oh!

hhbyyh commented Jan 18, 2017

Uh oh!

SparkQA commented Jan 19, 2017

hhbyyh commented Dec 10, 2016 •

edited

Loading

hhbyyh commented Dec 16, 2016 •

edited

Loading

hhbyyh Jan 20, 2017 •

edited

Loading