-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-41108][CONNECT] Control the max size of arrow batch #38612
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
| while (rowIter.hasNext && (rowCount == 0 || estimatedBatchSize < maxBatchSize)) { | ||
| val row = rowIter.next() | ||
| arrowWriter.write(row) | ||
| estimatedBatchSize += row.asInstanceOf[UnsafeRow].getSizeInBytes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
refer to how the size is computed in BroadcastExchange
but not 100% sure, should I use this instead?
row match {
case unsafe: UnsafeRow => estimatedBatchSize += unsafe.getSizeInBytes
case _ => estimatedBatchSize += SizeEstimator.estimate(row)
}
cc @HyukjinKwon
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The size of message should be based on Arrow but we are only able to know the size of the batch when Arrow batch is created.
So I am fine with the current approach. I do believe that UnsafeRow has bigger size than ArrowBatch in general.
One nit would be we should probably set the lower size in maxBatchSize to be conservative. For example, maxBatchSize * 0.7
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will update maxBatchSize * 0.7
|
Let me actually merge and refactor this out. I am working on it actually. |
|
Merged to master. |
…rters codes ### What changes were proposed in this pull request? This PR is a followup of both #38468 and #38612 that proposes to deduplicate codes in `ArrowConverters` by creating two classes `ArrowBatchIterator` and `ArrowBatchWithSchemaIterator`. In addition, we reuse `ArrowBatchWithSchemaIterator` when creating an empty Arrow batch at `createEmptyArrowBatch`. While I am here, - I addressed my own comment at #38612 (comment) - Kept the support of both max rows and size. Max row size check was removed in #38612 ### Why are the changes needed? For better readability and maintenance. ### Does this PR introduce _any_ user-facing change? No, both codes are not related. ### How was this patch tested? This is refactoring so existing CI should cover. Closes #38618 from HyukjinKwon/SPARK-41108-followup-1. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
### What changes were proposed in this pull request? Control the max size of arrow batch ### Why are the changes needed? as per the suggestion apache#38468 (comment) ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? existing tests Closes apache#38612 from zhengruifeng/connect_arrow_batchsize. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
…rters codes ### What changes were proposed in this pull request? This PR is a followup of both apache#38468 and apache#38612 that proposes to deduplicate codes in `ArrowConverters` by creating two classes `ArrowBatchIterator` and `ArrowBatchWithSchemaIterator`. In addition, we reuse `ArrowBatchWithSchemaIterator` when creating an empty Arrow batch at `createEmptyArrowBatch`. While I am here, - I addressed my own comment at apache#38612 (comment) - Kept the support of both max rows and size. Max row size check was removed in apache#38612 ### Why are the changes needed? For better readability and maintenance. ### Does this PR introduce _any_ user-facing change? No, both codes are not related. ### How was this patch tested? This is refactoring so existing CI should cover. Closes apache#38618 from HyukjinKwon/SPARK-41108-followup-1. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>
What changes were proposed in this pull request?
Control the max size of arrow batch
Why are the changes needed?
as per the suggestion #38468 (comment)
Does this PR introduce any user-facing change?
No
How was this patch tested?
existing tests