Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

@zhengruifeng zhengruifeng commented Oct 12, 2020

What changes were proposed in this pull request?

use lazy array instead of var for auxiliary variables in binary lor

Why are the changes needed?

In #29255, I made a mistake:
the private var _threshold and _rawThreshold are initialized by defaut values of threshold, that is beacuse:
1, param threshold is set default value at first;
2, _threshold and _rawThreshold are initialized based on the default value;
3, param threshold is updated by the value from estimator, by copyValues method:

      if (map.contains(param) && to.hasParam(param.name)) {
        to.set(param.name, map(param))
      }

We can update _threshold and _rawThreshold in setThreshold and setThresholds, but we can not update them in set/copyValues so their values are kept until methods setThreshold and setThresholds are called.

Does this PR introduce any user-facing change?

No

How was this patch tested?

test in repl

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Oct 12, 2020

test in commit 27eab00:

import scala.util.Random
import org.apache.spark.ml.linalg._
import org.apache.spark.ml.classification._
import org.apache.spark.ml.regression._
import org.apache.spark.sql.functions._
import org.apache.spark.storage.StorageLevel



val df = spark.read.option("numFeatures", "2000").format("libsvm").load("/data1/Datasets/epsilon/epsilon_normalized.t").withColumn("aftcensor", (col("label")+1)/2).withColumn("aftlabel", (col("label")+2)/2).withColumn("label", (col("label")+1)/2).limit(100)
df.persist(StorageLevel.MEMORY_AND_DISK)
df.count

val vec = df.select("features").head.getAs[Vector](0)

val lor = new LogisticRegression().setMaxIter(1).setThreshold(0.1)

val lorm = lor.fit(df)

lorm.getThreshold

lorm.predict(vec)

results:
// master
scala> val lorm = lor.fit(df)
20/10/12 15:47:23 WARN LogisticRegressionModel: _threshold=0.5, _rawThreshold=0.0
lorm: org.apache.spark.ml.classification.LogisticRegressionModel = LogisticRegressionModel: uid=logreg_4c79066a563d, numClasses=2, numFeatures=2000

scala> lorm.getThreshold
res9: Double = 0.1

scala> lorm.predict(vec)
20/10/12 15:47:29 WARN LogisticRegressionModel: _threshold=0.5, _rawThreshold=0.0
res10: Double = 0.0

The _threshold and _rawThreshold here are incorrect.

// this PR
scala> lorm.predict(vec)
20/10/12 16:01:09 WARN LogisticRegressionModel: _threshold=0.1, _rawThreshold=-2.197224577336219
res3: Double = 0.0

@zhengruifeng zhengruifeng changed the title [SPARK-32455][ML][Follow-Up] LogisticRegressionModel prediction optimization [SPARK-32455][ML][Follow-Up] LogisticRegressionModel prediction optimization - fix incorrect initialization Oct 12, 2020
@zhengruifeng
Copy link
Contributor Author

@srowen @huaxingao

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34282/

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Test build #129677 has finished for PR 30013 at commit 49690eb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 12, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34282/

@huaxingao
Copy link
Contributor

In #29255, the values of _threshold and _rawThreshold are not correct if threshold is explicitly set, right? Does that mean the results for predict and transform are not correct? I am just trying to figure out why the tests didn't catch the problem last time. Do we need to somehow improve the test coverage?

@zhengruifeng
Copy link
Contributor Author

zhengruifeng commented Oct 13, 2020

@huaxingao In #29255, if threshold is set in binary lor Estimator, then in the Model, the values of _threshold and _rawThreshold are wrong (they are initialized by default threshold). If setThreshold or setThresholds in the Model are called, the values of _threshold and _rawThreshold will be updated to correct values.

Do we need to somehow improve the test coverage?

Yes, I think we need to figure out why the testsuites failed to catch this, and update them if necessary.

@zhengruifeng
Copy link
Contributor Author

@huaxingao I find that in suite test("thresholds prediction"), threshold is always explicitly set before transform verification. I will try to add a case for this.

val binaryModel = blr.fit(smallBinaryDataset)

binaryModel.setThreshold(1.0)
testTransformer[(Double, Vector)](smallBinaryDataset.toDF(), binaryModel, "prediction") {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

master fails in this modified case:
tx

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test starting
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34324/

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Test build #129718 has finished for PR 30013 at commit 7a1749c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Oct 13, 2020

Kubernetes integration test status success
URL: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder-K8s/34324/

@zhengruifeng
Copy link
Contributor Author

Merged to master, thanks all!

@zhengruifeng zhengruifeng deleted the lor_threshold_init branch October 13, 2020 05:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants