[SPARK-41108][SPARK-41005][CONNECT][FOLLOW-UP] Deduplicate ArrowConverters codes #38618

HyukjinKwon · 2022-11-11T07:06:11Z

What changes were proposed in this pull request?

This PR is a followup of both #38468 and #38612 that proposes to deduplicate codes in ArrowConverters by creating two classes ArrowBatchIterator and ArrowBatchWithSchemaIterator. In addition, we reuse ArrowBatchWithSchemaIterator when creating an empty Arrow batch at createEmptyArrowBatch.

While I am here,

I addressed my own comment at [SPARK-41108][CONNECT] Control the max size of arrow batch #38612 (comment)
Kept the support of both max rows and size. Max row size check was removed in [SPARK-41108][CONNECT] Control the max size of arrow batch #38612

Why are the changes needed?

For better readability and maintenance.

Does this PR introduce any user-facing change?

No, both codes are not related.

How was this patch tested?

This is refactoring so existing CI should cover.

HyukjinKwon · 2022-11-11T10:12:04Z

Merged to master.

hvanhovell · 2022-11-16T20:01:53Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+    extends ArrowBatchIterator(
+      rowIter, schema, maxRecordsPerBatch, timeZoneId, context) {
+
+    private val arrowSchemaSize = SizeEstimator.estimate(arrowSchema)


What is this supposed to achieve? This uses a lot of reflective code to figure out the size of the schema object. How is this related to the size of the batch?

@hvanhovell, this PR is virtually pure refactoring except the couple of points I mentioned in the PR description. For the question, it came from #38612 to estimate the size of the batch before creating an Arrow batch.

hvanhovell · 2022-11-16T20:04:16Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

+      rowIter, schema, maxRecordsPerBatch, timeZoneId, context) {
+
+    private val arrowSchemaSize = SizeEstimator.estimate(arrowSchema)
+    var rowCountInLastBatch: Long = 0


A couple of questions. Why do we need the rowcount? It is already encoded in the batch itself. If we do need the rowcount, please make the iterator return it in the next call instead relying on a side effect.

This logic is also from #38468, and this PR is a followup. The return type here is Array[Byte] that is raw binary record batch. So we cannot get the count from that unless we define other case classes to keep the row count. This class is private that is only used in the specific case.

…rters codes ### What changes were proposed in this pull request? This PR is a followup of both apache#38468 and apache#38612 that proposes to deduplicate codes in `ArrowConverters` by creating two classes `ArrowBatchIterator` and `ArrowBatchWithSchemaIterator`. In addition, we reuse `ArrowBatchWithSchemaIterator` when creating an empty Arrow batch at `createEmptyArrowBatch`. While I am here, - I addressed my own comment at apache#38612 (comment) - Kept the support of both max rows and size. Max row size check was removed in apache#38612 ### Why are the changes needed? For better readability and maintenance. ### Does this PR introduce _any_ user-facing change? No, both codes are not related. ### How was this patch tested? This is refactoring so existing CI should cover. Closes apache#38618 from HyukjinKwon/SPARK-41108-followup-1. Authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

…tead of `ArrowConveters#toBatchIterator` ### _Why are the changes needed?_ to adapt Spark 3.4 the signature of function `ArrowConveters#toBatchIterator` is changed in apache/spark#38618 (since Spark 3.4) Before Spark 3.4: ``` private[sql] def toBatchIterator( rowIter: Iterator[InternalRow], schema: StructType, maxRecordsPerBatch: Int, timeZoneId: String, context: TaskContext): Iterator[Array[Byte]] ``` Spark 3.4 ``` private[sql] def toBatchIterator( rowIter: Iterator[InternalRow], schema: StructType, maxRecordsPerBatch: Long, timeZoneId: String, context: TaskContext): ArrowBatchIterator ``` the return type is changed from `Iterator[Array[Byte]]` to `ArrowBatchIterator` ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request Closes #4754 from cfmcgrady/arrow-spark34. Closes #4754 a3c58d0 [Fu Chen] fix ci 32704c5 [Fu Chen] Revert "fix ci" e32311a [Fu Chen] fix ci a76af62 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 453b6a6 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 74a9f7a [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 4ce5844 [Fu Chen] adapt Spark 3.4 Lead-authored-by: Fu Chen <[email protected]> Co-authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]>

…tead of `ArrowConveters#toBatchIterator` ### _Why are the changes needed?_ to adapt Spark 3.4 the signature of function `ArrowConveters#toBatchIterator` is changed in apache/spark#38618 (since Spark 3.4) Before Spark 3.4: ``` private[sql] def toBatchIterator( rowIter: Iterator[InternalRow], schema: StructType, maxRecordsPerBatch: Int, timeZoneId: String, context: TaskContext): Iterator[Array[Byte]] ``` Spark 3.4 ``` private[sql] def toBatchIterator( rowIter: Iterator[InternalRow], schema: StructType, maxRecordsPerBatch: Long, timeZoneId: String, context: TaskContext): ArrowBatchIterator ``` the return type is changed from `Iterator[Array[Byte]]` to `ArrowBatchIterator` ### _How was this patch tested?_ - [ ] Add some test cases that check the changes thoroughly including negative and positive cases if possible - [ ] Add screenshots for manual tests if appropriate - [x] [Run test](https://kyuubi.readthedocs.io/en/master/develop_tools/testing.html#running-tests) locally before make a pull request Closes #4754 from cfmcgrady/arrow-spark34. Closes #4754 a3c58d0 [Fu Chen] fix ci 32704c5 [Fu Chen] Revert "fix ci" e32311a [Fu Chen] fix ci a76af62 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 453b6a6 [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 74a9f7a [Cheng Pan] Update externals/kyuubi-spark-sql-engine/src/main/scala/org/apache/spark/sql/kyuubi/SparkDatasetHelper.scala 4ce5844 [Fu Chen] adapt Spark 3.4 Lead-authored-by: Fu Chen <[email protected]> Co-authored-by: Cheng Pan <[email protected]> Signed-off-by: Cheng Pan <[email protected]> (cherry picked from commit d0a7ca4) Signed-off-by: Cheng Pan <[email protected]>

github-actions bot added CONNECT SQL labels Nov 11, 2022

HyukjinKwon mentioned this pull request Nov 11, 2022

[SPARK-41005][CONNECT][PYTHON] Arrow-based collect #38468

Closed

cloud-fan approved these changes Nov 11, 2022

View reviewed changes

Deduplicate ArrowConverters codes

cf2de76

HyukjinKwon force-pushed the SPARK-41108-followup-1 branch from 22ecb3f to cf2de76 Compare November 11, 2022 07:54

zhengruifeng approved these changes Nov 11, 2022

View reviewed changes

HyukjinKwon closed this in a45b081 Nov 11, 2022

hvanhovell reviewed Nov 16, 2022

View reviewed changes

cfmcgrady mentioned this pull request Apr 23, 2023

[ARROW] Use KyuubiArrowConveters#toBatchIterator instead of ArrowConveters#toBatchIterator apache/kyuubi#4754

Closed

3 tasks

HyukjinKwon deleted the SPARK-41108-followup-1 branch January 15, 2024 00:53

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SPARK-41108][SPARK-41005][CONNECT][FOLLOW-UP] Deduplicate ArrowConverters codes #38618

[SPARK-41108][SPARK-41005][CONNECT][FOLLOW-UP] Deduplicate ArrowConverters codes #38618

Uh oh!

HyukjinKwon commented Nov 11, 2022 •

edited

Loading

Uh oh!

HyukjinKwon commented Nov 11, 2022

Uh oh!

hvanhovell Nov 16, 2022

Uh oh!

HyukjinKwon Nov 16, 2022

Uh oh!

hvanhovell Nov 16, 2022

Uh oh!

HyukjinKwon Nov 16, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[SPARK-41108][SPARK-41005][CONNECT][FOLLOW-UP] Deduplicate ArrowConverters codes #38618

[SPARK-41108][SPARK-41005][CONNECT][FOLLOW-UP] Deduplicate ArrowConverters codes #38618

Uh oh!

Conversation

HyukjinKwon commented Nov 11, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

HyukjinKwon commented Nov 11, 2022

Uh oh!

hvanhovell Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

hvanhovell Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Nov 16, 2022

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

HyukjinKwon commented Nov 11, 2022 •

edited

Loading