[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type.#18989
[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type.#18989ueshin wants to merge 3 commits intoapache:masterfrom
Conversation
|
cc @cloud-fan @kiszk |
|
Test build #80825 has finished for PR 18989 at commit
|
|
Jenkins, retest this please. |
|
Test build #80830 has finished for PR 18989 at commit
|
| * Returns concrete column vector class names for each column to be used in a columnar batch | ||
| * if this format supports returning columnar batch. | ||
| */ | ||
| def vectorTypes( |
There was a problem hiding this comment.
Do we need to keep sparkSession and dataSchema?
There was a problem hiding this comment.
I thought sparkSession would be needed because we might want to change the vector type based on some configuration. What do you think about that?
dataSchema might not be needed.
There was a problem hiding this comment.
I see.
To change the type may depend on each overriding function. Can we pass sparkSession as Option?
If dataSchema is not used now, is it make simple to drop dataSchema?
|
Test build #80955 has finished for PR 18989 at commit
|
|
LGTM |
|
Jenkins, retest this please |
|
Test build #81149 has finished for PR 18989 at commit
|
|
Good to know it works well after changing class hierarchy of |
| ctx.addMutableState(columnVectorClz, name, s"$name = null;") | ||
| s"$name = $batch.column($i);" | ||
| val columnVectorClzs = vectorTypes.getOrElse( | ||
| Seq.fill(colVars.size)("org.apache.spark.sql.execution.vectorized.ColumnVector")) |
There was a problem hiding this comment.
nit: classOf[ColumnVector].getName?
| } | ||
|
|
||
| override def vectorTypes( | ||
| sparkSession: Option[SparkSession], |
There was a problem hiding this comment.
do we really need the session parameter?
There was a problem hiding this comment.
I see, let's remove this for now.
|
LGTM |
|
Test build #81209 has finished for PR 18989 at commit
|
|
thanks, merging to master! |
What changes were proposed in this pull request?
As mentioned at #18680 (comment), when we have more
ColumnVectorimplementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.As for read path, one of the major paths is the one generated by
ColumnBatchScan. Currently it refersColumnVectorso the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader usesOnHeapColumnVector. We can use the concrete type in the generated code directly to avoid the penalty.How was this patch tested?
Existing tests.