Skip to content

Conversation

@ankuriitg
Copy link
Contributor

@ankuriitg ankuriitg commented Oct 1, 2018

…ethod in GeneralizedLinearRegressionTrainingSummary

What changes were proposed in this pull request?

After the change in SPARK-25118 (not submitted yet), which enables spark-shell to run with default
log level, test_glr_summary started failing with StackOverflow error.

Cause: ClosureCleaner calls logDebug on various objects and when it is called
for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which runs
into infinite loop and fails with the below exception.

Fix: Remove toString method and move the existing logic to a new method
"summarize". This follows the general guideline as other summary objects do not
have a toString method as well.

How was this patch tested?

Ran the python "mllib" and scala/junit tests for this module

…ethod in

GeneralizedLinearRegressionTrainingSummary

After the change in SPARK-25118, which enables spark-shell to run with default
log level, test_glr_summary started failing with StackOverflow error.

Cause: ClosureCleaner calls logDebug on various objects and when it is called
for GeneralizedLinearRegressionTrainingSummary, it starts a spark job which runs
into infinite loop and fails with the below exception.

Fix: Remove toString method and move the existing logic to a new method
"summarize". This follows the general guideline as other summary objects do not
have a toString method as well.

Testing Done:
Ran the python "mllib" and scala/junit tests for this module
@ankuriitg ankuriitg changed the title [SPARK-25586][MLlib][Core] Replace toString method with a summarize m… [SPARK-25586][MLlib][Core] Replace toString method with summarize m… Oct 1, 2018
@ankuriitg
Copy link
Contributor Author

@vanzin @squito @bersprockets

@squito
Copy link
Contributor

squito commented Oct 1, 2018

Jenkins, ok to test

@ankuriitg
Copy link
Contributor Author

@actuaryzhang @yanboliang

@srowen
Copy link
Member

srowen commented Oct 1, 2018

This just removes toString entirely and adds a new method. I don't think we can do that. What's the minimum change to toString that means it doesn't execute a Spark job? can the same info be preserved and presented otherwise?

Alternatively, we go back to the other JIRA and figure out why it was causing an infinite loop in the first place.

@ankuriitg
Copy link
Contributor Author

Few variables (I think there are 3) printed in toString cause a Spark Job to be started and the main reason is that those variables are lazily evaluated. I can remove those variables from toString altogether.

An alternative fix will be to change those lazily evaluated variables to instantaneously evaluated but that may cause some other issues that I am not aware of.

@SparkQA
Copy link

SparkQA commented Oct 1, 2018

Test build #96830 has finished for PR 22604 at commit 8c162a7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ankuriitg
Copy link
Contributor Author

If I remove the following code, then the test succeeds and no spark jobs are started. Shall I do this instead?

`
sb.append("\n")
val nd = s"Null deviance: ${round(nullDeviance)} on $degreesOfFreedom degrees of freedom"
val rd = s"Residual deviance: ${round(deviance)} on $residualDegreeOfFreedom degrees of " +
"freedom"
val l = math.max(nd.length, rd.length)
sb.append(StringUtils.leftPad(nd, l))
sb.append("\n")
sb.append(StringUtils.leftPad(rd, l))

  if (family.name != "tweedie") {
    sb.append("\n")
    sb.append(s"AIC: " + round(aic))
  }

`

@srowen
Copy link
Member

srowen commented Oct 2, 2018

OK, backing up here, yes it's important to specify what the problem was that started this. It's exhibited in the error logs from your PR for SPARK-25118 (which is not submitted; might be worth clarifying your description).

	at org.apache.spark.rdd.RDD.map(RDD.scala:371)
	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance$lzycompute(GeneralizedLinearRegression.scala:1342)
	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance(GeneralizedLinearRegression.scala:1315)
	at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.toString(GeneralizedLinearRegression.scala:1556)
	at java.lang.String.valueOf(String.java:2994)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at scala.StringContext.standardInterpolator(StringContext.scala:125)
	at scala.StringContext.s(StringContext.scala:95)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
	at org.apache.spark.internal.Logging$class.logDebug(Logging.scala:62)
	at org.apache.spark.util.ClosureCleaner$.logDebug(ClosureCleaner.scala:35)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:289)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
	at org.apache.spark.SparkContext.clean(SparkContext.scala:2342)
	at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:372)
	at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:371)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
	at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
	at org.apache.spark.rdd.RDD.withScope(RDD.scala:364)
	at org.apache.spark.rdd.RDD.map(RDD.scala:371)
	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance$lzycompute(GeneralizedLinearRegression.scala:1342)
	at org.apache.spark.ml.regression.GeneralizedLinearRegressionSummary.nullDeviance(GeneralizedLinearRegression.scala:1315)
	at org.apache.spark.ml.regression.GeneralizedLinearRegressionTrainingSummary.toString(GeneralizedLinearRegression.scala:1556)
	at java.lang.String.valueOf(String.java:2994)
	at java.lang.StringBuilder.append(StringBuilder.java:131)
	at scala.StringContext.standardInterpolator(StringContext.scala:125)
	at scala.StringContext.s(StringContext.scala:95)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12$$anonfun$apply$6.apply(ClosureCleaner.scala:289)
	at org.apache.spark.internal.Logging$class.logDebug(Logging.scala:62)
	at org.apache.spark.util.ClosureCleaner$.logDebug(ClosureCleaner.scala:35)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
	at org.apache.spark.util.ClosureCleaner$$anonfun$org$apache$spark$util$ClosureCleaner$$clean$12.apply(ClosureCleaner.scala:289)
	at scala.collection.immutable.List.foreach(List.scala:392)
	at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:289)
	at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)

Yes, inside toString, computing nullDeviance and deviance runs a Spark job. The problem occurs at debug log level, because in the ClosureCleaner:

      if (log.isDebugEnabled) {
        ...
        logDebug(s" + outer objects: ${outerObjects.size}")
        outerObjects.foreach { o => logDebug(s"     $o") }
      }

and that calls toString again and thus the problem.

While I don't think it's ideal to call distributed jobs in toString, they're only computed once lazily, so that's reasonable actually.

While we can remove the nullDeviance etc, that's useful info. But I wonder whether there are similar problems lurking just like this. That said, this code has existed since about Spark 1.4 without issue, I guess.

But then, the value of this debug statement goes away almost entirely, I think, in Scala 2.12, where closures aren't implemented with outer classes.

I think I favor just removing this last debug statement in ClosureCleaner, or at best logging their classes only. This seems like less loss and behavior change than removing useful info from the toString.

Any other thoughts out there -- anyone care a lot about the debug info from the closure cleaner about outer class(es)?

@ankuriitg
Copy link
Contributor Author

Makes sense and let me update the description about SPARK-25118

@ankuriitg
Copy link
Contributor Author

I have opened another PR for this, with the recommended changes: #22616

I had to change a few other logDebug statements as well. Please let me know if that works.

@srowen
Copy link
Member

srowen commented Oct 2, 2018

OK, this PR should be closed then.

@ankuriitg ankuriitg closed this Oct 2, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants