Skip to content

Conversation

@jiangxb1987
Copy link
Contributor

What changes were proposed in this pull request?

This PR include the following changes:

  • Make the capacity of VectorizedParquetRecordReader configurable;
  • Make the capacity of OrcColumnarBatchReader configurable;
  • Update the error message when required capacity in writable columnar vector cannot be fulfilled.

How was this patch tested?

N/A

@jiangxb1987 jiangxb1987 changed the title [SPARK-23188] [SQL] Make vectorized columar reader batch size configurable [SPARK-23188][SQL] Make vectorized columar reader batch size configurable Jan 23, 2018
@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86522 has finished for PR 20361 at commit 927c6b4.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Jan 23, 2018

Test build #86542 has finished for PR 20361 at commit 38debd7.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@jiangxb1987
Copy link
Contributor Author

retest this please

@SparkQA
Copy link

SparkQA commented Jan 24, 2018

Test build #86547 has finished for PR 20361 at commit 38debd7.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

// Vectorized parquet reader used for testing and benchmark.
public VectorizedParquetRecordReader(boolean useOffHeap) {
this(null, useOffHeap);
this(null, useOffHeap, 4096);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about changing benchmark and test programs to pass capacity and remove this constructor?
These programs also have SQLConf.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's good to avoid hardcoding the default value again in the code. If there are only a few places need to be changed, let's do it.

@jiangxb1987
Copy link
Contributor Author

jiangxb1987 commented Jan 29, 2018

.booleanConf
.createWithDefault(true)

val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.batchSize")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd prefer spark.sql.parquet.columnarReaderBatchSize to be more clear.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still a question. Is that possible to use the estimated memory size instead of the number of rows?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd say it's very hard. If we need to satisfy a sizeInBytes limitation, we would need to load data record by record, and stop loading if we hit the limitation. But for performance reasons, we wanna load the data with batch, which needs to know the batch size ahead.

.booleanConf
.createWithDefault(true)

val ORC_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.orc.batchSize")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

// TODO: make this configurable.
private static final int CAPACITY = 4 * 1024;

// The default size of vectorized batch.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe we can remove the comment. It's just the capacity, not a default value.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about rephrase to The capacity of vectorized batch ?

@cloud-fan
Copy link
Contributor

LGTM, pending jenkins

@SparkQA
Copy link

SparkQA commented Feb 1, 2018

Test build #86901 has finished for PR 20361 at commit 5ad935f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

thanks, merging to master!

@asfgit asfgit closed this in cc41245 Feb 1, 2018
@dongjoon-hyun
Copy link
Member

Hi, All.
Can we have this in Spark 2.3, too?

@gatorsmile
Copy link
Member

Not a bug fix. This is not qualified for merging to Spark 2.3

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants