[SPARK-23188][SQL] Make vectorized columar reader batch size configurable #20361

jiangxb1987 · 2018-01-23T08:23:06Z

What changes were proposed in this pull request?

This PR include the following changes:

Make the capacity of VectorizedParquetRecordReader configurable;
Make the capacity of OrcColumnarBatchReader configurable;
Update the error message when required capacity in writable columnar vector cannot be fulfilled.

How was this patch tested?

N/A

SparkQA · 2018-01-23T09:08:56Z

Test build #86522 has finished for PR 20361 at commit 927c6b4.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2018-01-23T21:15:10Z

Test build #86542 has finished for PR 20361 at commit 38debd7.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

jiangxb1987 · 2018-01-23T21:16:59Z

retest this please

SparkQA · 2018-01-24T00:48:34Z

Test build #86547 has finished for PR 20361 at commit 38debd7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

kiszk · 2018-01-25T17:52:37Z

...n/java/org/apache/spark/sql/execution/datasources/parquet/VectorizedParquetRecordReader.java

+  // Vectorized parquet reader used for testing and benchmark.
  public VectorizedParquetRecordReader(boolean useOffHeap) {
-    this(null, useOffHeap);
+    this(null, useOffHeap, 4096);


How about changing benchmark and test programs to pass capacity and remove this constructor?
These programs also have SQLConf.

It's good to avoid hardcoding the default value again in the code. If there are only a few places need to be changed, let's do it.

jiangxb1987 · 2018-01-29T19:10:57Z

cc @cloud-fan @sameeragarwal @gatorsmile

cloud-fan · 2018-01-30T03:44:33Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

      .booleanConf
      .createWithDefault(true)

+  val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.batchSize")


I'd prefer spark.sql.parquet.columnarReaderBatchSize to be more clear.

Still a question. Is that possible to use the estimated memory size instead of the number of rows?

I'd say it's very hard. If we need to satisfy a sizeInBytes limitation, we would need to load data record by record, and stop loading if we hit the limitation. But for performance reasons, we wanna load the data with batch, which needs to know the batch size ahead.

cloud-fan · 2018-01-30T03:44:53Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

    .booleanConf
    .createWithDefault(true)

+  val ORC_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.orc.batchSize")


cloud-fan · 2018-01-30T03:45:38Z

...ore/src/main/java/org/apache/spark/sql/execution/datasources/orc/OrcColumnarBatchReader.java

-  // TODO: make this configurable.
-  private static final int CAPACITY = 4 * 1024;
+
+  // The default size of vectorized batch.


maybe we can remove the comment. It's just the capacity, not a default value.

How about rephrase to The capacity of vectorized batch ?

cloud-fan · 2018-02-01T02:09:50Z

LGTM, pending jenkins

SparkQA · 2018-02-01T04:36:33Z

Test build #86901 has finished for PR 20361 at commit 5ad935f.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2018-02-01T04:56:34Z

thanks, merging to master!

dongjoon-hyun · 2018-02-01T05:01:00Z

Hi, All.
Can we have this in Spark 2.3, too?

gatorsmile · 2018-02-01T05:02:15Z

Not a bug fix. This is not qualified for merging to Spark 2.3

make vector batch size configurable.

927c6b4

jiangxb1987 changed the title ~~[SPARK-23188] [SQL] Make vectorized columar reader batch size configurable~~ [SPARK-23188][SQL] Make vectorized columar reader batch size configurable Jan 23, 2018

fix

38debd7

kiszk reviewed Jan 25, 2018

View reviewed changes

cloud-fan reviewed Jan 30, 2018

View reviewed changes

remove constructer VectorizedParquetRecordReader(useOffHeap).

5ad935f

asfgit closed this in cc41245 Feb 1, 2018

[SPARK-23188][SQL] Make vectorized columar reader batch size configurable #20361

[SPARK-23188][SQL] Make vectorized columar reader batch size configurable #20361

Uh oh!

Conversation

jiangxb1987 commented Jan 23, 2018

What changes were proposed in this pull request?

How was this patch tested?

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

SparkQA commented Jan 23, 2018

Uh oh!

jiangxb1987 commented Jan 23, 2018

Uh oh!

SparkQA commented Jan 24, 2018

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

jiangxb1987 commented Jan 29, 2018 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Feb 1, 2018

Uh oh!

SparkQA commented Feb 1, 2018

Uh oh!

cloud-fan commented Feb 1, 2018

Uh oh!

dongjoon-hyun commented Feb 1, 2018

Uh oh!

gatorsmile commented Feb 1, 2018

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

jiangxb1987 commented Jan 29, 2018 •

edited

Loading