[SPARK-31925][ML] Summary.totalIterations greater than maxIters #28786

huaxingao · 2020-06-10T18:08:12Z

What changes were proposed in this pull request?

In LogisticRegression and LinearRegression, if set maxIter=n, the model.summary.totalIterations returns n+1 if the training procedure does not drop out. This is because we use objectiveHistory.length as totalIterations, but objectiveHistory contains init sate, thus objectiveHistory.length is 1 larger than number of training iterations.

Why are the changes needed?

correctness

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new tests and also modify existing tests

SparkQA · 2020-06-10T18:19:16Z

Test build #123774 has finished for PR 28786 at commit 91529d0.

This patch fails MiMa tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-06-10T18:35:39Z

mllib/src/main/scala/org/apache/spark/ml/classification/LogisticRegression.scala

      }
      if (instances.getStorageLevel != StorageLevel.NONE) instances.unpersist()
-      return createModel(dataset, numClasses, coefMatrix, interceptVec, Array.empty)
+      return createModel(dataset, numClasses, coefMatrix, interceptVec, Array(0.0))


When training is not needed, LinearRegression set objectiveHistory to Array(0.0). https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L511
I think LogisticRegression should have the same bahavior.

huaxingao · 2020-06-10T18:35:57Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

      }
    }
-    assert(model2.summary.totalIterations === 1)
+    assert(model2.summary.totalIterations === 0)


InitialModel is set in this case. init state is good so no training is needed. I think totalIterations should be 0 instead of 1.

huaxingao · 2020-06-10T18:36:09Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

      }
    }
-    assert(model4.summary.totalIterations === 1)
+    assert(model4.summary.totalIterations === 0)


Same as L2576. I think totalIterations should be 0 instead of 1 since no training.

huaxingao · 2020-06-10T18:36:23Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

    assert(allZeroInterceptModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
    assert(allZeroInterceptModel.intercept === Double.NegativeInfinity)
    assert(allZeroInterceptModel.summary.totalIterations === 0)
+    assert(allZeroInterceptModel.summary.objectiveHistory(0) ~== 0.0 absTol 1e-4)


since I change the objectiveHistory from Array.empty to Array(0.0) for no training case, summary.objectiveHistory(0) should be 0.0 here.

huaxingao · 2020-06-10T18:37:15Z

mllib/src/test/scala/org/apache/spark/ml/classification/LogisticRegressionSuite.scala

        assert(pred === 0.0)
    }
-    assert(modelZeroLabel.summary.totalIterations > 0)
+    assert(modelZeroLabel.summary.totalIterations === 0)


No training here so I think the totalIterations should be 0 instead of 1

huaxingao · 2020-06-10T18:40:04Z

mllib/src/test/scala/org/apache/spark/ml/regression/LinearRegressionSuite.scala

+    Seq("auto", "normal").foreach { solver =>
+      val trainer = new LinearRegression().setSolver(solver)
+      val model = trainer.fit(datasetWithDenseFeature)
+      assert(model.summary.totalIterations === 0)


before my change, summary.totalIterations is 1 for "normal". I think it should be 0 since Normal Equation is not an iterative method. totalIterations should be 0 too for "auto", since "auto" uses Normal Equation in this test (a small dataset).

I think summary.totalIterations is 1 is reasonable, since it needs one pass on the dataset

If no iterative optimizer was run, I think 0 makes sense?

huaxingao · 2020-06-10T18:40:55Z

python/pyspark/ml/tests/test_training_summary.py

        s = model.summary
        # test that api is callable and returns expected types
-        self.assertGreater(s.totalIterations, 0)
+        self.assertEqual(s.totalIterations, 0)


solver="normal" in this case, so I think totalIterations should be 0.

SparkQA · 2020-06-10T21:16:03Z

Test build #123780 has finished for PR 28786 at commit 949a0b6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-06-10T22:55:31Z

Test build #123784 has finished for PR 28786 at commit b8a0d72.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-06-11T16:06:55Z

cc @srowen @zhengruifeng

srowen

I think it's fine if the objective history has one additional initial state element, as it does now, but I agree it feels funny to return n+1 as total iterations when n iterations were requested.

Looks fine, the only other thing I can think of is maybe documenting anywhere the history is returned that it will contain one more element, the initial state, than number of iterations

SparkQA · 2020-06-11T22:25:35Z

Test build #123866 has finished for PR 28786 at commit 74d7b79.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

zhengruifeng · 2020-06-12T01:45:54Z

I just check this in ml.clustering, numIter in summary of KMeans/BiKMeans/GMM will be exactly maxIter

srowen · 2020-06-14T21:48:19Z

@zhengruifeng if you don't strongly object to #28786 (comment) I think this one can be merged.

Just needs a rebase at your convenience @huaxingao

zhengruifeng · 2020-06-15T01:49:28Z

@srowen I do't feel strongly about it.
LGTM

SparkQA · 2020-06-15T02:06:02Z

Test build #124018 has finished for PR 28786 at commit 4c4d52b.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

huaxingao · 2020-06-15T02:18:43Z

retest this please

SparkQA · 2020-06-15T03:41:01Z

Test build #124025 has finished for PR 28786 at commit 4c4d52b.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2020-06-15T13:49:21Z

Merged to master. It could go in 3.0.1 too; I dont' feel strongly about it.

huaxingao · 2020-06-15T15:46:32Z

Thanks! @srowen @zhengruifeng

probot-autolabeler bot added ML PYTHON labels Jun 10, 2020

huaxingao commented Jun 10, 2020

View reviewed changes

srowen reviewed Jun 11, 2020

View reviewed changes

huaxingao added 4 commits June 14, 2020 17:57

[SPARK-31925][ML] Summary.totalIterations greater than maxIters

f2209f7

fix

1031591

fix

dcf7122

modify objectiveHistory docstring

4c4d52b

huaxingao force-pushed the summary_iter branch from 74d7b79 to 4c4d52b Compare June 15, 2020 01:02

srowen closed this in f83cb3c Jun 15, 2020

huaxingao deleted the summary_iter branch June 15, 2020 15:46

[SPARK-31925][ML] Summary.totalIterations greater than maxIters #28786

[SPARK-31925][ML] Summary.totalIterations greater than maxIters #28786

Uh oh!

Conversation

huaxingao commented Jun 10, 2020

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

huaxingao Jun 10, 2020

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 10, 2020

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 10, 2020

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

srowen Jun 12, 2020

Choose a reason for hiding this comment

Uh oh!

huaxingao Jun 10, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

SparkQA commented Jun 10, 2020

Uh oh!

huaxingao commented Jun 11, 2020

Uh oh!

srowen left a comment

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Jun 11, 2020

Uh oh!

zhengruifeng commented Jun 12, 2020

Uh oh!

srowen commented Jun 14, 2020

Uh oh!

zhengruifeng commented Jun 15, 2020

Uh oh!

SparkQA commented Jun 15, 2020

Uh oh!

huaxingao commented Jun 15, 2020

Uh oh!

SparkQA commented Jun 15, 2020

Uh oh!

srowen commented Jun 15, 2020

Uh oh!

huaxingao commented Jun 15, 2020

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

huaxingao Jun 10, 2020 •

edited

Loading

huaxingao Jun 10, 2020 •

edited

Loading

huaxingao Jun 10, 2020 •

edited

Loading

huaxingao Jun 10, 2020 •

edited

Loading