Skip to content

Conversation

@huaxingao
Copy link
Contributor

What changes were proposed in this pull request?

In LogisticRegression and LinearRegression, if set maxIter=n, the model.summary.totalIterations returns n+1 if the training procedure does not drop out. This is because we use objectiveHistory.length as totalIterations, but objectiveHistory contains init sate, thus objectiveHistory.length is 1 larger than number of training iterations.

Why are the changes needed?

correctness

Does this PR introduce any user-facing change?

No

How was this patch tested?

add new tests and also modify existing tests

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123774 has finished for PR 28786 at commit 91529d0.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

}
if (instances.getStorageLevel != StorageLevel.NONE) instances.unpersist()
return createModel(dataset, numClasses, coefMatrix, interceptVec, Array.empty)
return createModel(dataset, numClasses, coefMatrix, interceptVec, Array(0.0))
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When training is not needed, LinearRegression set objectiveHistory to Array(0.0). https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/regression/LinearRegression.scala#L511
I think LogisticRegression should have the same bahavior.

}
}
assert(model2.summary.totalIterations === 1)
assert(model2.summary.totalIterations === 0)
Copy link
Contributor Author

@huaxingao huaxingao Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

InitialModel is set in this case. init state is good so no training is needed. I think totalIterations should be 0 instead of 1.

}
}
assert(model4.summary.totalIterations === 1)
assert(model4.summary.totalIterations === 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as L2576. I think totalIterations should be 0 instead of 1 since no training.

assert(allZeroInterceptModel.coefficients ~== Vectors.dense(0.0) absTol 1E-3)
assert(allZeroInterceptModel.intercept === Double.NegativeInfinity)
assert(allZeroInterceptModel.summary.totalIterations === 0)
assert(allZeroInterceptModel.summary.objectiveHistory(0) ~== 0.0 absTol 1e-4)
Copy link
Contributor Author

@huaxingao huaxingao Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since I change the objectiveHistory from Array.empty to Array(0.0) for no training case, summary.objectiveHistory(0) should be 0.0 here.

assert(pred === 0.0)
}
assert(modelZeroLabel.summary.totalIterations > 0)
assert(modelZeroLabel.summary.totalIterations === 0)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No training here so I think the totalIterations should be 0 instead of 1

Seq("auto", "normal").foreach { solver =>
val trainer = new LinearRegression().setSolver(solver)
val model = trainer.fit(datasetWithDenseFeature)
assert(model.summary.totalIterations === 0)
Copy link
Contributor Author

@huaxingao huaxingao Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

before my change, summary.totalIterations is 1 for "normal". I think it should be 0 since Normal Equation is not an iterative method. totalIterations should be 0 too for "auto", since "auto" uses Normal Equation in this test (a small dataset).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think summary.totalIterations is 1 is reasonable, since it needs one pass on the dataset

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If no iterative optimizer was run, I think 0 makes sense?

s = model.summary
# test that api is callable and returns expected types
self.assertGreater(s.totalIterations, 0)
self.assertEqual(s.totalIterations, 0)
Copy link
Contributor Author

@huaxingao huaxingao Jun 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

solver="normal" in this case, so I think totalIterations should be 0.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123780 has finished for PR 28786 at commit 949a0b6.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jun 10, 2020

Test build #123784 has finished for PR 28786 at commit b8a0d72.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

cc @srowen @zhengruifeng

Copy link
Member

@srowen srowen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's fine if the objective history has one additional initial state element, as it does now, but I agree it feels funny to return n+1 as total iterations when n iterations were requested.

Looks fine, the only other thing I can think of is maybe documenting anywhere the history is returned that it will contain one more element, the initial state, than number of iterations

@SparkQA
Copy link

SparkQA commented Jun 11, 2020

Test build #123866 has finished for PR 28786 at commit 74d7b79.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@zhengruifeng
Copy link
Contributor

I just check this in ml.clustering, numIter in summary of KMeans/BiKMeans/GMM will be exactly maxIter

@srowen
Copy link
Member

srowen commented Jun 14, 2020

@zhengruifeng if you don't strongly object to #28786 (comment) I think this one can be merged.

Just needs a rebase at your convenience @huaxingao

@zhengruifeng
Copy link
Contributor

@srowen I do't feel strongly about it.
LGTM

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124018 has finished for PR 28786 at commit 4c4d52b.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@huaxingao
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jun 15, 2020

Test build #124025 has finished for PR 28786 at commit 4c4d52b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@srowen srowen closed this in f83cb3c Jun 15, 2020
@srowen
Copy link
Member

srowen commented Jun 15, 2020

Merged to master. It could go in 3.0.1 too; I dont' feel strongly about it.

@huaxingao
Copy link
Contributor Author

Thanks! @srowen @zhengruifeng

@huaxingao huaxingao deleted the summary_iter branch June 15, 2020 15:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants