Skip to content
Closed
Show file tree
Hide file tree
Changes from 12 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,7 @@ import java.io.ByteArrayInputStream
import java.util.{Map => JavaMap}

import scala.collection.JavaConverters._
import scala.collection.immutable.ListMap
import scala.collection.mutable
import scala.collection.mutable.ArrayBuffer
import scala.language.existentials
Expand Down Expand Up @@ -77,6 +78,20 @@ case class SubExprEliminationState(isNull: String, value: String)
*/
case class SubExprCodes(codes: Seq[String], states: Map[Expression, SubExprEliminationState])

/**
* The main information about a new added function.
*
* @param functionName String representing the name of the function
* @param subclassName Optional value which is empty if the function is added to
* the outer class, otherwise it contains the name of the
* inner class in which the function has been added.
* @param subclassInstance Optional value which is empty if the function is added to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I saw three ways to represent the same concept. subclass, inner class, nested classes.

How about renaming all of them to inner classes? Could you go over all the code changes in this PR to make them consistent?

* the outer class, otherwise it contains the name of the
* instance of the inner class in the outer class.
*/
private[codegen] case class NewFunction(functionName: String, subclassName: Option[String],
subclassInstance: Option[String])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The indent issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may I ask you to kindly explain me which is the right indentation in this case? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that 2-space indentation is expected.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the link @kiszk , but it says that for parameters a 4 space indent is required: look at the example with class Foo.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, 4-space indentation is used when it is followed by 2-space indentation. Here is an example.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this example, there is 2 space indentation because there is a extends and according to the link you provided me, the extends requires a 2 space indentation, while the parameters a 4 space one.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see. This is just my understanding.
Let us wait for explanation fro @gatorsmile.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See the link: https://github.com/databricks/scala-style-guide#spacing-and-indentation

If it does not fit one line, we do it like

private[codegen] case class NewFunction(
    functionName: String,
    subclassName: Option[String],
    subclassInstance: Option[String])


/**
* A context for codegen, tracking a list of objects that could be passed into generated Java
* function.
Expand Down Expand Up @@ -277,13 +292,25 @@ class CodegenContext {
funcName: String,
funcCode: String,
inlineToOuterClass: Boolean = false): String = {
val newFunction = addNewFunctionInternal(funcName, funcCode, inlineToOuterClass)
newFunction match {
case NewFunction(functionName, None, None) => functionName
case NewFunction(functionName, Some(_), Some(subclassInstance)) =>
subclassInstance + "." + functionName
}
}

private[this] def addNewFunctionInternal(
funcName: String,
funcCode: String,
inlineToOuterClass: Boolean): NewFunction = {
// The number of named constants that can exist in the class is limited by the Constant Pool
// limit, 65,536. We cannot know how many constants will be inserted for a class, so we use a
// threshold of 1600k bytes to determine when a function should be inlined to a private, nested
// threshold of 1000k bytes to determine when a function should be inlined to a private, nested
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Any reason to change this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it for the possibly to be add method grouping all split methods? If yes, we can add the method into an inner class (see #19480 (comment)), so we don't need to change this threshold.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, without this change, I got the Constant Pool limit exception of the {{NestedClass}}. Actually I tried to address this issue in this PR (#19447), but without the other changes in thee current PR I wasn't able to create a UT for it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you try #19480 (comment), so we may avoid changing this threshold?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't change that for the reason I explained there.

// sub-class.
val (className, classInstance) = if (inlineToOuterClass) {
outerClassName -> ""
} else if (currClassSize > 1600000) {
} else if (currClassSize > 1000000) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the reason we chose this magic number in the original PR? cc @bdrillard @kiszk

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile it's a byte threshold, similar to the 1024 byte threshold set in splitExpressions. We can't know exactly how much code will contribute to the constant pool, that is, there's no easy static analysis we can perform on a block of code to say "this code will contribute n entries to the constant pool", we only know the size of the code is strongly correlated to entries in the constant pool. We're trying to keep the number of generated classes as low as possible while also grouping enough of the code to avoid the constant pool error.

In short, I tested different types of schemas with many columns to find what the value could be set to empirically.

There's no particular harm in setting the value lower as is done here if it helps us avoid a known constant pool error case. Doing so would effectively reduce the number of expressions each nested class holds, and so also increase the number of nested classes in total.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, actually during several tries, I found that setting the value lower can somehow reduce the chance to hit constant pool limit exception in nested classes.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, this is the reason why I created this PR: this is an estimation, which might be good 99% of the times, but for those other 1% use cases it would be good to be able to tune it to prevent this error IMHO.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Create a variable in object CodeGenerator?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can do that for this PR. But what do you think about introducing an internal configuration for this as I proposed in the other PR? If this should not be done, I can close the other PR: I think that the discussion here and there showed the reasons for its creation. If they are not enough, then I will close the PR.

val className = freshName("NestedClass")
val classInstance = freshName("nestedClassInstance")

Expand All @@ -294,17 +321,23 @@ class CodegenContext {
currClass()
}

classSize(className) += funcCode.length
classFunctions(className) += funcName -> funcCode
addNewFunctionToClass(funcName, funcCode, className)

if (className == outerClassName) {
funcName
NewFunction(funcName, None, None)
} else {

s"$classInstance.$funcName"
NewFunction(funcName, Some(className), Some(classInstance))
}
}

private[this] def addNewFunctionToClass(
funcName: String,
funcCode: String,
className: String) = {
classSize(className) += funcCode.length
classFunctions(className) += funcName -> funcCode
}

/**
* Declares all function code. If the added functions are too many, split them into nested
* sub-classes to avoid hitting Java compiler constant pool limitation.
Expand Down Expand Up @@ -798,10 +831,46 @@ class CodegenContext {
| ${makeSplitFunction(body)}
|}
""".stripMargin
addNewFunction(name, code)
addNewFunctionInternal(name, code, inlineToOuterClass = false)
}

foldFunctions(functions.map(name => s"$name(${arguments.map(_._2).mkString(", ")})"))
// Here we store all the methods which have been added to the outer class.
val outerClassFunctions = functions
.filter(_.subclassName.isEmpty)
.map(_.functionName)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So the calls to outerclass functions should not be an issue, even they could be many?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

exactly, because since they are defined there, they add no entry to the constant pool


// Here we handle all the methods which have been added to the nested subclasses and
// not to the outer class.
// Since they can be many, their direct invocation in the outer class adds many entries
// to the outer class' constant pool. This can cause the constant pool to past JVM limit.
// To avoid this problem, we group them and we call only the grouping methods in the
// outer class.
val innerClassFunctions = functions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might not be very intuitive for later readers. We'd better to add comments to explain it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will, thanks for the suggestion

.filter(_.subclassName.isDefined)
.foldLeft(ListMap.empty[(String, String), Seq[String]]) { case (acc, f) =>
val key = (f.subclassName.get, f.subclassInstance.get)
acc.updated(key, acc.getOrElse(key, Seq.empty[String]) ++ Seq(f.functionName))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The order of function calls should be important and be kept. Does the map guarantee that updated map still preserves the insertion order?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

only up to 4 keys, thanks for the nice catch, I will fix this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To improve the readability, let us avoid using foldLeft here. You can use a mutable map.

}
.flatMap { case ((subclassName, subclassInstance), subclassFunctions) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Normally, we do not like such a long chain of function calls. If possible please split it. Thanks!

if (subclassFunctions.size > CodeGenerator.MERGE_SPLIT_METHODS_THRESHOLD) {
// Adding a new function to each subclass which contains
// the invocation of all the ones which have been added to
// that subclass
val code = s"""
|private $returnType $func($argString) {
| ${makeSplitFunction(foldFunctions(subclassFunctions.map(name =>
s"$name(${arguments.map(_._2).mkString(", ")})")))}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing |

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we are inside the string interpolator here, so | is not missing

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of inlining a function, why not creating a variable before val code?

|}
""".stripMargin
addNewFunctionToClass(func, code, subclassName)
Seq(s"$subclassInstance.$func")
} else {
subclassFunctions.map(f => s"$subclassInstance.$f")
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Build a separate private function for generating innerClassFunctions? Now, the function splitExpressions is pretty large after this PR.


foldFunctions((outerClassFunctions ++ innerClassFunctions).map(
name => s"$name(${arguments.map(_._2).mkString(", ")})"))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The same suggestion. Do not inline a complex expression.

}
}

Expand Down Expand Up @@ -1010,6 +1079,10 @@ object CodeGenerator extends Logging {
// This is the value of HugeMethodLimit in the OpenJDK JVM settings
val DEFAULT_JVM_HUGE_METHOD_LIMIT = 8000

// This is the threshold over which the methods in a inner class are grouped in a single
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit. in a inner class -> in an inner class

// method which is going to be called by the outer class instead of the many small ones
val MERGE_SPLIT_METHODS_THRESHOLD = 3
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add a short comment for this too.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another magic number?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@gatorsmile it comes from this discussion.


/**
* Compile the Java source code into a Java class, using Janino.
*
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -201,6 +201,23 @@ class CodeGenerationSuite extends SparkFunSuite with ExpressionEvalHelper {
}
}

test("SPARK-22226: group splitted expressions into one method per nested class") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides the unit test, can you provide an end-to-end case that can trigger this issue too?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a use case where I faced this problem. And I tried this patch on it. Unfortunately this contains a very complex business logic and I have not been able to reproduce it in a simple one. But if needed, I can try again.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of copying your customer codes, can you making a fake one?

Copy link
Member

@viirya viirya Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 I can reproduce the issue by following test case. You can check it:

  test("SPARK-22226: too much splitted expressions should not exceed constant pool limit") {
    withSQLConf(
      (SQLConf.WHOLESTAGE_CODEGEN_ENABLED.key, "false")) {
      val colNumber = 1000
      val baseDF = spark.range(10).toDF()
      val newCols = (1 to colNumber).map { colIndex =>
        expr(s"id + $colIndex").as(s"_$colIndex")
      }
      val input = baseDF.select(newCols: _*)
      val aggs = (1 to colNumber).flatMap { colIndex =>
        val colName = s"_$colIndex"
        Seq(expr(s"stddev($colName)"),
          expr(s"stddev_samp($colName)"),
          expr(s"stddev_pop($colName)"),
          expr(s"variance($colName)"),
          expr(s"var_samp($colName)"),
          expr(s"var_pop($colName)"),
          expr(s"skewness($colName)"),
          expr(s"kurtosis($colName)"))
      }
      input.agg(aggs.head, aggs.tail: _*).collect()
    }
  }
[info]   Cause: org.codehaus.janino.JaninoRuntimeException: failed to compile: org.codehaus.janino.JaninoRuntimeExc
eption: Constant pool for class org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificMutableProjection 
has grown past JVM limit of 0xFFFF
[info]   at org.apache.spark.sql.catalyst.expressions.codegen.CodeGenerator$.org$apache$spark$sql$catalyst$expressi
ons$codegen$CodeGenerator$$doCompile(CodeGenerator.scala:1079)

Copy link
Contributor Author

@mgaido91 mgaido91 Oct 13, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you very much for your help @viirya ! In my use cases it seemed to be connected to the dropDuplicates method and I focused on it, but thanks to your suggestion now I realize that dropDuplicates by itself is not enough, it needs also some functions applied to columns to generate the issue! Thank you so much. Where should I add this test case? I am adding it to DataFrameAggregateSuite since this is related to aggregating some functions, is it ok? Thanks.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was adding it to DataFrameAggregateSuite.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya I have a good and a bad news... Thanks to your suggestion I have been able to understand and reproduce the issue. Moreover, I found also another issue which is fixed by this problem and I am adding a UT for that too: in some cases, we might have a

Code of method apply(...) grows beyond 64 KB

And with this PR the problem is fixed.

The bad thing is that the UT you provided still fails, but with a different error: actually it is always a Constant Pool limit exceeded exception, but it is in a NestedClass. From my analysis, this is caused by another problem, ie. that we might reference too many fields of the superclass in the NestedClasses. This might be addressed maybe trying to tune the magic number which I brought to 1000k in this PR, but I am pretty sure that it will be also addressed by the ongoing PR for SPARK-18016, since he is trying to reduce the number of variables. Thus I consider this out of scope for this PR.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Do you meant "2: compact primitive declarations into arrays" in SPARK-18016?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Thanks for trying it. Yeah, those expressions like skewness are very complicated, so they're likely to cause the issue you encountered.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viirya exactly, I meant that. Thank you for your suggestion. You have been very helpful to me.

val length = 10000
val expressions = Seq.fill(length) {
ToUTCTimestamp(
Literal.create(Timestamp.valueOf("2017-10-10 00:00:00"), TimestampType),
Literal.create("PST", StringType))
}
val plan = GenerateMutableProjection.generate(expressions)
val actual = plan(new GenericInternalRow(length)).toSeq(expressions.map(_.dataType))
val expected = Seq.fill(length)(
DateTimeUtils.fromJavaTimestamp(Timestamp.valueOf("2017-10-10 07:00:00")))

if (actual != expected) {
fail(s"Incorrect Evaluation: expressions: $expressions, actual: $actual, expected: $expected")
}
}

test("test generated safe and unsafe projection") {
val schema = new StructType(Array(
StructField("a", StringType, true),
Expand Down
12 changes: 12 additions & 0 deletions sql/core/src/test/scala/org/apache/spark/sql/DataFrameSuite.scala
Original file line number Diff line number Diff line change
Expand Up @@ -2103,4 +2103,16 @@ class DataFrameSuite extends QueryTest with SharedSQLContext {
testData2.select(lit(7), 'a, 'b).orderBy(lit(1), lit(2), lit(3)),
Seq(Row(7, 1, 1), Row(7, 1, 2), Row(7, 2, 1), Row(7, 2, 2), Row(7, 3, 1), Row(7, 3, 2)))
}

test("SPARK-22226: splitExpressions should not generate codes beyond 64KB") {
val colNumber = 10000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 .
I'm wondering if this is the new maximum number of columns tested in Spark?
After your patch, what will be the minimum value of colNumber to cause failures in Spark?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this would be the maximum number currently, but #19518 will generate a great improvement for that. Actually I don't have an exact answer because this depends on many factors, like which transformations are performed, which are the datatypes, .... So I am not able to give an answer on that, sorry.

val input = spark.range(2).rdd.map(_ => Row(1 to colNumber: _*))
val df = sqlContext.createDataFrame(input, StructType(
(1 to colNumber).map(colIndex => StructField(s"_$colIndex", IntegerType, false))))
val newCols = (1 to colNumber).flatMap { colIndex =>
Seq(expr(s"if(1000 < _$colIndex, 1000, _$colIndex)"),
expr(s"sqrt(_$colIndex)"))
}
df.select(newCols: _*).collect()
}
}