[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type. by ueshin · Pull Request #18989 · apache/spark

ueshin · 2017-08-18T05:41:41Z

What changes were proposed in this pull request?

As mentioned at #18680 (comment), when we have more ColumnVector implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.

As for read path, one of the major paths is the one generated by ColumnBatchScan. Currently it refers ColumnVector so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses OnHeapColumnVector. We can use the concrete type in the generated code directly to avoid the penalty.

How was this patch tested?

Existing tests.

ueshin · 2017-08-18T05:43:38Z

cc @cloud-fan @kiszk

SparkQA · 2017-08-18T07:04:49Z

Test build #80825 has finished for PR 18989 at commit 6f19db7.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

ueshin · 2017-08-18T07:11:07Z

Jenkins, retest this please.

SparkQA · 2017-08-18T09:42:16Z

Test build #80830 has finished for PR 18989 at commit 6f19db7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-18T11:20:26Z

+   * Returns concrete column vector class names for each column to be used in a columnar batch
+   * if this format supports returning columnar batch.
+   */
+  def vectorTypes(


Do we need to keep sparkSession and dataSchema?

I thought sparkSession would be needed because we might want to change the vector type based on some configuration. What do you think about that?
dataSchema might not be needed.

I see.
To change the type may depend on each overriding function. Can we pass sparkSession as Option?
If dataSchema is not used now, is it make simple to drop dataSchema?

SparkQA · 2017-08-22T06:35:59Z

Test build #80955 has finished for PR 18989 at commit 9effea9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-22T16:30:17Z

LGTM

kiszk · 2017-08-26T04:06:51Z

Jenkins, retest this please

SparkQA · 2017-08-26T06:38:57Z

Test build #81149 has finished for PR 18989 at commit 9effea9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2017-08-26T07:24:29Z

Good to know it works well after changing class hierarchy of ColumnVector

cloud-fan · 2017-08-29T08:19:35Z

-      ctx.addMutableState(columnVectorClz, name, s"$name = null;")
-      s"$name = $batch.column($i);"
+    val columnVectorClzs = vectorTypes.getOrElse(
+      Seq.fill(colVars.size)("org.apache.spark.sql.execution.vectorized.ColumnVector"))


nit: classOf[ColumnVector].getName?

cloud-fan · 2017-08-29T08:21:02Z

  }

+  override def vectorTypes(
+      sparkSession: Option[SparkSession],


do we really need the session parameter?

I see, let's remove this for now.

cloud-fan · 2017-08-29T08:21:08Z

LGTM

SparkQA · 2017-08-29T11:36:34Z

Test build #81209 has finished for PR 18989 at commit fbcf95c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-08-29T12:16:57Z

thanks, merging to master!

Modify DataSourceScanExec to use concrete ColumnVector type.

6f19db7

kiszk reviewed Aug 18, 2017

View reviewed changes

Address a comment.

9effea9

cloud-fan reviewed Aug 29, 2017

View reviewed changes

Address comments.

fbcf95c

asfgit closed this in 32fa0b8 Aug 29, 2017

Conversation

ueshin commented Aug 18, 2017

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

ueshin commented Aug 18, 2017

Uh oh!

SparkQA commented Aug 18, 2017

Uh oh!

ueshin commented Aug 18, 2017

Uh oh!

SparkQA commented Aug 18, 2017

Uh oh!

kiszk Aug 18, 2017

Choose a reason for hiding this comment

Uh oh!

ueshin Aug 21, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kiszk Aug 21, 2017

Choose a reason for hiding this comment

Uh oh!

SparkQA commented Aug 22, 2017

Uh oh!

kiszk commented Aug 22, 2017

Uh oh!

kiszk commented Aug 26, 2017

Uh oh!

SparkQA commented Aug 26, 2017

Uh oh!

kiszk commented Aug 26, 2017

Uh oh!

cloud-fan Aug 29, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan Aug 29, 2017

Choose a reason for hiding this comment

Uh oh!

ueshin Aug 29, 2017

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Aug 29, 2017

Uh oh!

SparkQA commented Aug 29, 2017

Uh oh!

cloud-fan commented Aug 29, 2017

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ueshin Aug 21, 2017 •

edited

Loading