Skip to content

Conversation

@HyukjinKwon
Copy link
Member

@HyukjinKwon HyukjinKwon commented Nov 11, 2022

What changes were proposed in this pull request?

This PR is a followup of both #38468 and #38612 that proposes to deduplicate codes in ArrowConverters by creating two classes ArrowBatchIterator and ArrowBatchWithSchemaIterator. In addition, we reuse ArrowBatchWithSchemaIterator when creating an empty Arrow batch at createEmptyArrowBatch.

While I am here,

Why are the changes needed?

For better readability and maintenance.

Does this PR introduce any user-facing change?

No, both codes are not related.

How was this patch tested?

This is refactoring so existing CI should cover.

@HyukjinKwon HyukjinKwon force-pushed the SPARK-41108-followup-1 branch from 22ecb3f to cf2de76 Compare November 11, 2022 07:54
@HyukjinKwon
Copy link
Member Author

Merged to master.

extends ArrowBatchIterator(
rowIter, schema, maxRecordsPerBatch, timeZoneId, context) {

private val arrowSchemaSize = SizeEstimator.estimate(arrowSchema)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this supposed to achieve? This uses a lot of reflective code to figure out the size of the schema object. How is this related to the size of the batch?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@hvanhovell, this PR is virtually pure refactoring except the couple of points I mentioned in the PR description. For the question, it came from #38612 to estimate the size of the batch before creating an Arrow batch.

rowIter, schema, maxRecordsPerBatch, timeZoneId, context) {

private val arrowSchemaSize = SizeEstimator.estimate(arrowSchema)
var rowCountInLastBatch: Long = 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A couple of questions. Why do we need the rowcount? It is already encoded in the batch itself. If we do need the rowcount, please make the iterator return it in the next call instead relying on a side effect.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This logic is also from #38468, and this PR is a followup. The return type here is Array[Byte] that is raw binary record batch. So we cannot get the count from that unless we define other case classes to keep the row count. This class is private that is only used in the specific case.

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
…rters codes

### What changes were proposed in this pull request?

This PR is a followup of both apache#38468 and apache#38612 that proposes to deduplicate codes in `ArrowConverters` by creating two classes `ArrowBatchIterator` and `ArrowBatchWithSchemaIterator`.  In addition, we reuse `ArrowBatchWithSchemaIterator` when creating an empty Arrow batch at `createEmptyArrowBatch`.

While I am here,
- I addressed my own comment at apache#38612 (comment)
- Kept the support of both max rows and size. Max row size check was removed in apache#38612

### Why are the changes needed?

For better readability and maintenance.

### Does this PR introduce _any_ user-facing change?

No, both codes are not related.

### How was this patch tested?

This is refactoring so existing CI should cover.

Closes apache#38618 from HyukjinKwon/SPARK-41108-followup-1.

Authored-by: Hyukjin Kwon <[email protected]>
Signed-off-by: Hyukjin Kwon <[email protected]>
pan3793 added a commit to apache/kyuubi that referenced this pull request Apr 23, 2023
…tead of `ArrowConveters#toBatchIterator`

### _Why are the changes needed?_

to adapt Spark 3.4

the signature of function `ArrowConveters#toBatchIterator` is changed in apache/spark#38618 (since Spark 3.4)

Before Spark 3.4:

```
private[sql] def toBatchIterator(
    rowIter: Iterator[InternalRow],
    schema: StructType,
    maxRecordsPerBatch: Int,
    timeZoneId: String,
    context: TaskContext): Iterator[Array[Byte]]
```

Spark 3.4

```
private[sql] def toBatchIterator(
    rowIter: Iterator[InternalRow],
    schema: StructType,
    maxRecordsPerBatch: Long,
    timeZoneId: String,
    context: TaskContext): ArrowBatchIterator
```

the return type is changed from `Iterator[Array[Byte]]` to `ArrowBatchIterator`

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4754 from cfmcgrady/arrow-spark34.

Closes #4754

a3c58d0 [Fu Chen] fix ci
32704c5 [Fu Chen] Revert "fix ci"
e32311a [Fu Chen] fix ci
a76af62 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala
453b6a6 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala
74a9f7a [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala
4ce5844 [Fu Chen] adapt Spark 3.4

Lead-authored-by: Fu Chen <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
pan3793 added a commit to apache/kyuubi that referenced this pull request Apr 23, 2023
…tead of `ArrowConveters#toBatchIterator`

### _Why are the changes needed?_

to adapt Spark 3.4

the signature of function `ArrowConveters#toBatchIterator` is changed in apache/spark#38618 (since Spark 3.4)

Before Spark 3.4:

```
private[sql] def toBatchIterator(
    rowIter: Iterator[InternalRow],
    schema: StructType,
    maxRecordsPerBatch: Int,
    timeZoneId: String,
    context: TaskContext): Iterator[Array[Byte]]
```

Spark 3.4

```
private[sql] def toBatchIterator(
    rowIter: Iterator[InternalRow],
    schema: StructType,
    maxRecordsPerBatch: Long,
    timeZoneId: String,
    context: TaskContext): ArrowBatchIterator
```

the return type is changed from `Iterator[Array[Byte]]` to `ArrowBatchIterator`

### _How was this patch tested?_
- [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible

- [ ] Add screenshots for manual tests if appropriate

- [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request

Closes #4754 from cfmcgrady/arrow-spark34.

Closes #4754

a3c58d0 [Fu Chen] fix ci
32704c5 [Fu Chen] Revert "fix ci"
e32311a [Fu Chen] fix ci
a76af62 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala
453b6a6 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala
74a9f7a [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala
4ce5844 [Fu Chen] adapt Spark 3.4

Lead-authored-by: Fu Chen <[email protected]>
Co-authored-by: Cheng Pan <[email protected]>
Signed-off-by: Cheng Pan <[email protected]>
(cherry picked from commit d0a7ca4)
Signed-off-by: Cheng Pan <[email protected]>
@HyukjinKwon HyukjinKwon deleted the SPARK-41108-followup-1 branch January 15, 2024 00:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants