[SPARK-32455][ML][Follow-Up] LogisticRegressionModel prediction optimization - fix incorrect initialization #30013

zhengruifeng · 2020-10-12T08:21:43Z

What changes were proposed in this pull request?

use lazy array instead of var for auxiliary variables in binary lor

Why are the changes needed?

In #29255, I made a mistake:
the private var _threshold and _rawThreshold are initialized by defaut values of threshold, that is beacuse:
1, param threshold is set default value at first;
2, _threshold and _rawThreshold are initialized based on the default value;
3, param threshold is updated by the value from estimator, by copyValues method:

      if (map.contains(param) && to.hasParam(param.name)) {
        to.set(param.name, map(param))
      }

We can update _threshold and _rawThreshold in setThreshold and setThresholds, but we can not update them in set/copyValues so their values are kept until methods setThreshold and setThresholds are called.

Does this PR introduce any user-facing change?

No

How was this patch tested?

test in repl

use lazy array

zhengruifeng · 2020-10-12T08:25:14Z

test in commit 27eab00:

import scala.util.Random
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.regression._
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel



val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t").withColumn("aftcensor", (col("label")+1)/2).withColumn("aftlabel", (col("label")+2)/2).withColumn("label", (col("label")+1)/2).limit(100)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

val vec = df.select("features").head.getAs[Vector](0)

val lor = new LogisticRegression().setMaxIter(1).setThreshold(0.1)

val lorm = lor.fit(df)

lorm.getThreshold

lorm.predict(vec)

results:
// master
scala> val lorm = lor.fit(df)
20/10/12 15:47:23 WARN LogisticRegressionModel: _threshold=0.5, _rawThreshold=0.0
lorm: org.apache.spark.ml.classification.LogisticRegressionModel = LogisticRegressionModel: uid=logreg_4c79066a563d, numClasses=2, numFeatures=2000

scala> lorm.getThreshold
res9: Double = 0.1

scala> lorm.predict(vec)
20/10/12 15:47:29 WARN LogisticRegressionModel: _threshold=0.5, _rawThreshold=0.0
res10: Double = 0.0

The _threshold and _rawThreshold here are incorrect.

// this PR
scala> lorm.predict(vec)
20/10/12 16:01:09 WARN LogisticRegressionModel: _threshold=0.1, _rawThreshold=-2.197224577336219
res3: Double = 0.0

zhengruifeng · 2020-10-12T08:35:04Z

@srowen @huaxingao

SparkQA · 2020-10-12T09:28:50Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34282/

SparkQA · 2020-10-12T09:32:13Z

Test build #129677 has finished for PR 30013 at commit 49690eb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-12T09:45:35Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34282/

huaxingao · 2020-10-13T01:22:16Z

In #29255, the values of _threshold and _rawThreshold are not correct if threshold is explicitly set, right? Does that mean the results for predict and transform are not correct? I am just trying to figure out why the tests didn't catch the problem last time. Do we need to somehow improve the test coverage?

zhengruifeng · 2020-10-13T01:35:27Z

@huaxingao In #29255, if threshold is set in binary lor Estimator, then in the Model, the values of _threshold and _rawThreshold are wrong (they are initialized by default threshold). If setThreshold or setThresholds in the Model are called, the values of _threshold and _rawThreshold will be updated to correct values.

Do we need to somehow improve the test coverage?

Yes, I think we need to figure out why the testsuites failed to catch this, and update them if necessary.

zhengruifeng · 2020-10-13T01:44:57Z

@huaxingao I find that in suite test("thresholds prediction"), threshold is always explicitly set before transform verification. I will try to add a case for this.

zhengruifeng · 2020-10-13T01:58:14Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

    val binaryModel = blr.fit(smallBinaryDataset)

-    binaryModel.setThreshold(1.0)
    testTransformer[(Double, Vector)](smallBinaryDataset.toDF(), binaryModel, "prediction") {


master fails in this modified case:

SparkQA · 2020-10-13T02:41:27Z

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34324/

SparkQA · 2020-10-13T03:04:11Z

Test build #129718 has finished for PR 30013 at commit 7a1749c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-10-13T03:06:25Z

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34324/

zhengruifeng · 2020-10-13T05:10:54Z

Merged to master, thanks all!

zhengruifeng added 2 commits October 12, 2020 15:48

check code

27eab00

use lazy array

49690eb

use lazy array

zhengruifeng changed the title ~~[SPARK-32455][ML][Follow-Up] LogisticRegressionModel prediction optimization~~ [SPARK-32455][ML][Follow-Up] LogisticRegressionModel prediction optimization - fix incorrect initialization Oct 12, 2020

srowen approved these changes Oct 12, 2020

View reviewed changes

update testsuite

7a1749c

zhengruifeng commented Oct 13, 2020

View reviewed changes

huaxingao approved these changes Oct 13, 2020

View reviewed changes

zhengruifeng closed this in 86d26b4 Oct 13, 2020

zhengruifeng deleted the lor_threshold_init branch October 13, 2020 05:10

[SPARK-32455][ML][Follow-Up] LogisticRegressionModel prediction optimization - fix incorrect initialization #30013

[SPARK-32455][ML][Follow-Up] LogisticRegressionModel prediction optimization - fix incorrect initialization #30013

Uh oh!

Conversation

zhengruifeng commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

zhengruifeng commented Oct 12, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

SparkQA commented Oct 12, 2020

Uh oh!

huaxingao commented Oct 13, 2020

Uh oh!

zhengruifeng commented Oct 13, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Oct 13, 2020

Uh oh!

zhengruifeng Oct 13, 2020

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

SparkQA commented Oct 13, 2020

Uh oh!

zhengruifeng commented Oct 13, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

zhengruifeng commented Oct 12, 2020 •

edited

Loading

zhengruifeng commented Oct 12, 2020 •

edited

Loading

zhengruifeng commented Oct 13, 2020 •

edited

Loading