Skip to content

Conversation

@zhengruifeng
Copy link
Contributor

What changes were proposed in this pull request?

Control the max size of arrow batch

Why are the changes needed?

as per the suggestion #38468 (comment)

Does this PR introduce any user-facing change?

No

How was this patch tested?

existing tests

while (rowIter.hasNext && (rowCount == 0 || estimatedBatchSize < maxBatchSize)) {
val row = rowIter.next()
arrowWriter.write(row)
estimatedBatchSize += row.asInstanceOf[UnsafeRow].getSizeInBytes
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

refer to how the size is computed in BroadcastExchange

but not 100% sure, should I use this instead?

row match {
   case unsafe: UnsafeRow => estimatedBatchSize += unsafe.getSizeInBytes
   case _ => estimatedBatchSize += SizeEstimator.estimate(row)
}

cc @HyukjinKwon

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The size of message should be based on Arrow but we are only able to know the size of the batch when Arrow batch is created.

So I am fine with the current approach. I do believe that UnsafeRow has bigger size than ArrowBatch in general.

One nit would be we should probably set the lower size in maxBatchSize to be conservative. For example, maxBatchSize * 0.7

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will update maxBatchSize * 0.7

@HyukjinKwon
Copy link
Member

Let me actually merge and refactor this out. I am working on it actually.

@HyukjinKwon
Copy link
Member

Merged to master.

HyukjinKwon added a commit that referenced this pull request Nov 11, 2022
…rters codes

### What changes were proposed in this pull request?

This PR is a followup of both #38468 and #38612 that proposes to deduplicate codes in `ArrowConverters` by creating two classes `ArrowBatchIterator` and `ArrowBatchWithSchemaIterator`.  In addition, we reuse `ArrowBatchWithSchemaIterator` when creating an empty Arrow batch at `createEmptyArrowBatch`.

While I am here,
- I addressed my own comment at #38612 (comment)
- Kept the support of both max rows and size. Max row size check was removed in #38612

### Why are the changes needed?

For better readability and maintenance.

### Does this PR introduce _any_ user-facing change?

No, both codes are not related.

### How was this patch tested?

This is refactoring so existing CI should cover.

Closes #38618 from HyukjinKwon/SPARK-41108-followup-1.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

Control the max size of arrow batch

### Why are the changes needed?

as per the suggestion apache#38468 (comment)

### Does this PR introduce _any_ user-facing change?
No

### How was this patch tested?
existing tests

Closes apache#38612 from zhengruifeng/connect_arrow_batchsize.

Authored-by: Ruifeng Zheng <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…rters codes

### What changes were proposed in this pull request?

This PR is a followup of both apache#38468 and apache#38612 that proposes to deduplicate codes in `ArrowConverters` by creating two classes `ArrowBatchIterator` and `ArrowBatchWithSchemaIterator`.  In addition, we reuse `ArrowBatchWithSchemaIterator` when creating an empty Arrow batch at `createEmptyArrowBatch`.

While I am here,
- I addressed my own comment at apache#38612 (comment)
- Kept the support of both max rows and size. Max row size check was removed in apache#38612

### Why are the changes needed?

For better readability and maintenance.

### Does this PR introduce _any_ user-facing change?

No, both codes are not related.

### How was this patch tested?

This is refactoring so existing CI should cover.

Closes apache#38618 from HyukjinKwon/SPARK-41108-followup-1.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants