Skip to content

[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type.#18989

Closed
ueshin wants to merge 3 commits intoapache:masterfrom
ueshin:issues/SPARK-21781
Closed

[SPARK-21781][SQL] Modify DataSourceScanExec to use concrete ColumnVector type.#18989
ueshin wants to merge 3 commits intoapache:masterfrom
ueshin:issues/SPARK-21781

Conversation

@ueshin
Copy link
Copy Markdown
Member

@ueshin ueshin commented Aug 18, 2017

What changes were proposed in this pull request?

As mentioned at #18680 (comment), when we have more ColumnVector implementations, it might (or might not) have huge performance implications because it might disable inlining, or force virtual dispatches.

As for read path, one of the major paths is the one generated by ColumnBatchScan. Currently it refers ColumnVector so the penalty will be bigger as we have more classes, but we can know the concrete type from its usage, e.g. vectorized Parquet reader uses OnHeapColumnVector. We can use the concrete type in the generated code directly to avoid the penalty.

How was this patch tested?

Existing tests.

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Aug 18, 2017

cc @cloud-fan @kiszk

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 18, 2017

Test build #80825 has finished for PR 18989 at commit 6f19db7.

  • This patch fails due to an unknown error code, -9.
  • This patch merges cleanly.
  • This patch adds no public classes.

@ueshin
Copy link
Copy Markdown
Member Author

ueshin commented Aug 18, 2017

Jenkins, retest this please.

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 18, 2017

Test build #80830 has finished for PR 18989 at commit 6f19db7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

* Returns concrete column vector class names for each column to be used in a columnar batch
* if this format supports returning columnar batch.
*/
def vectorTypes(
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need to keep sparkSession and dataSchema?

Copy link
Copy Markdown
Member Author

@ueshin ueshin Aug 21, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought sparkSession would be needed because we might want to change the vector type based on some configuration. What do you think about that?
dataSchema might not be needed.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.
To change the type may depend on each overriding function. Can we pass sparkSession as Option?
If dataSchema is not used now, is it make simple to drop dataSchema?

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 22, 2017

Test build #80955 has finished for PR 18989 at commit 9effea9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Copy Markdown
Member

kiszk commented Aug 22, 2017

LGTM

@kiszk
Copy link
Copy Markdown
Member

kiszk commented Aug 26, 2017

Jenkins, retest this please

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 26, 2017

Test build #81149 has finished for PR 18989 at commit 9effea9.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@kiszk
Copy link
Copy Markdown
Member

kiszk commented Aug 26, 2017

Good to know it works well after changing class hierarchy of ColumnVector

ctx.addMutableState(columnVectorClz, name, s"$name = null;")
s"$name = $batch.column($i);"
val columnVectorClzs = vectorTypes.getOrElse(
Seq.fill(colVars.size)("org.apache.spark.sql.execution.vectorized.ColumnVector"))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: classOf[ColumnVector].getName?

}

override def vectorTypes(
sparkSession: Option[SparkSession],
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need the session parameter?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see, let's remove this for now.

@cloud-fan
Copy link
Copy Markdown
Contributor

LGTM

@SparkQA
Copy link
Copy Markdown

SparkQA commented Aug 29, 2017

Test build #81209 has finished for PR 18989 at commit fbcf95c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Copy Markdown
Contributor

thanks, merging to master!

@asfgit asfgit closed this in 32fa0b8 Aug 29, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants