-
Notifications
You must be signed in to change notification settings - Fork 29.1k
[SPARK-23188][SQL] Make vectorized columar reader batch size configurable #20361
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Test build #86522 has finished for PR 20361 at commit
|
|
Test build #86542 has finished for PR 20361 at commit
|
|
retest this please |
|
Test build #86547 has finished for PR 20361 at commit
|
| // Vectorized parquet reader used for testing and benchmark. | ||
| public VectorizedParquetRecordReader(boolean useOffHeap) { | ||
| this(null, useOffHeap); | ||
| this(null, useOffHeap, 4096); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about changing benchmark and test programs to pass capacity and remove this constructor?
These programs also have SQLConf.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's good to avoid hardcoding the default value again in the code. If there are only a few places need to be changed, let's do it.
| .booleanConf | ||
| .createWithDefault(true) | ||
|
|
||
| val PARQUET_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.parquet.batchSize") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd prefer spark.sql.parquet.columnarReaderBatchSize to be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Still a question. Is that possible to use the estimated memory size instead of the number of rows?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd say it's very hard. If we need to satisfy a sizeInBytes limitation, we would need to load data record by record, and stop loading if we hit the limitation. But for performance reasons, we wanna load the data with batch, which needs to know the batch size ahead.
| .booleanConf | ||
| .createWithDefault(true) | ||
|
|
||
| val ORC_VECTORIZED_READER_BATCH_SIZE = buildConf("spark.sql.orc.batchSize") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
| // TODO: make this configurable. | ||
| private static final int CAPACITY = 4 * 1024; | ||
|
|
||
| // The default size of vectorized batch. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe we can remove the comment. It's just the capacity, not a default value.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about rephrase to The capacity of vectorized batch ?
|
LGTM, pending jenkins |
|
Test build #86901 has finished for PR 20361 at commit
|
|
thanks, merging to master! |
|
Hi, All. |
|
Not a bug fix. This is not qualified for merging to Spark 2.3 |
What changes were proposed in this pull request?
This PR include the following changes:
VectorizedParquetRecordReaderconfigurable;OrcColumnarBatchReaderconfigurable;How was this patch tested?
N/A