[SPARK-53525][CONNECT] Spark Connect ArrowBatch Result Chunking #52271

xi-db · 2025-09-08T13:27:24Z

What changes were proposed in this pull request?

Currently, we enforce gRPC message limits on both the client and the server. These limits are largely meant to protect both sides from potential OOMs by rejecting abnormally large messages. However, there are cases in which the server incorrectly sends oversized messages that exceed these limits and cause execution failures.

Specifically, the large message issue from the server to the client we’re solving here, comes from the Arrow batch data in ExecutePlanResponse being too large. It’s caused by a single arrow row exceeding the 128MB message limit, and Arrow cannot partition further and it has to return the single large row in one gRPC message.

To improve Spark Connect stability, this PR implements chunking large Arrow batches when returning query results from the server to the client, ensuring each ExecutePlanResponse chunk remains within the size limit, and the chunks from a batch will be reassembled on the client when parsing as an arrow batch.

(Scala client changes are being implemented in a follow-up PR.)

To reproduce the existing issue we are solving here, run this code on Spark Connect:

repeat_num_per_mb = 1024 * 1024 // len('Apache Spark ')
res = spark.sql(f"select repeat('Apache Spark ', {repeat_num_per_mb * 300}) as huge_col from range(1)").collect()
print(len(res))

It fails with StatusCode.RESOURCE_EXHAUSTED error with message Received message larger than max (314570608 vs. 134217728), because the server is trying to send an ExecutePlanResponse of ~300MB to the client.

With the improvement introduced by the PR, the above code runs successfully and prints the expected result.

Why are the changes needed?

It improves Spark Connect stability when returning large rows.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New tests on both the server side and the client side.

Was this patch authored or co-authored using generative AI tooling?

No.

hvanhovell · 2025-09-08T15:11:53Z

sql/connect/common/src/main/protobuf/spark/connect/base.proto

  bool reattachable = 1;
 }

+message ResultChunkingOptions {


Should the client be able to set chunk size?

Also, can we name this ResultOptions? I can also see us setting a max arrow batch size here.

Yes, I introduced a client-side option preferred_arrow_chunk_size (this commit).

hvanhovell · 2025-09-08T17:06:06Z

...server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala

-        .build()
-      response.setArrowBatch(batch)
-      responseObserver.onNext(response.build())
+        for (i <- 0 until numChunks) {


Please don't use for comprehensions... Either use an actual loop, or the functional equivalent this translates into.

Yeah, right, updated to an explicit loop with foreach.

hvanhovell · 2025-09-08T17:07:01Z

...server/src/main/scala/org/apache/spark/sql/connect/execution/SparkConnectPlanExecution.scala

+          val to = math.min(from + maxChunkSize, bytes.length)
+          val length = to - from
+
+          val response = proto.ExecutePlanResponse


You can reuse the builder.

Yes, updated.

hvanhovell · 2025-09-08T17:08:10Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/config/Connect.scala

+        " While spark.connect.grpc.arrow.maxBatchSize determines the max size of a result batch," +
+        " maxChunkSize defines the max size of each individual chunk that is part of the batch" +
+        " that will be sent in a response. This allows the server to send large rows to clients." +
+        " However, excessively large plans remain unsupported due to Spark internals and JVM" +


Remove these two lines. They are not related to the conf.

Sounds good, removed.

hvanhovell · 2025-09-08T17:11:20Z

...ct/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala

+
+        // Execute plan.
+        @volatile var done = false
+        val responses = mutable.Buffer.empty[proto.ExecutePlanResponse]


If done need volatile, then this should be synchronized... Unless you are relying on some happens-before cleverness with the volatile variable (if so, then you need to document this)...

done here doesn't need volatile. I added volatile because other test cases have it. Removed as it is not needed in our case.

hvanhovell · 2025-09-08T17:44:17Z

...ct/server/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectServiceSuite.scala

+          }
+
+          // Reassemble the chunks into a single Arrow batch and validate its content.
+          val batchData: ByteString =


When you build this for scala, please use a Concatenating InputStream or something like that.

Yeah, it's a good point, updated the test as well.

hvanhovell · 2025-09-08T17:44:50Z

@xi-db are you adding scala support in a follow-up?

xi-db · 2025-09-09T09:18:52Z

@xi-db are you adding scala support in a follow-up?

Yes, I'm implementing the scala support as a follow-up.

heyihong · 2025-09-11T11:58:34Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteHolder.scala

+    sessionHolder.session.conf.get(CONNECT_SESSION_RESULT_CHUNKING_MAX_CHUNK_SIZE) > 0 &&
+    request.getRequestOptionsList.asScala.exists { option =>
+      option.hasResultChunkingOptions &&
+      option.getResultChunkingOptions.getAllowArrowBatchChunking == true


nit: option.getResultChunkingOptions.getAllowArrowBatchChunking should be sufficient, since the default value is false. option.getResultChunkingOptions will return a default message even if it is not set. In proto3, you won’t get a null pointer when accessing an unset field.

heyihong · 2025-09-11T12:01:31Z

sql/connect/server/src/main/scala/org/apache/spark/sql/connect/service/ExecuteHolder.scala

+      request.getRequestOptionsList.asScala.iterator.collectFirst {
+        case option
+            if option.hasResultChunkingOptions &&
+              option.getResultChunkingOptions.hasPreferredArrowChunkSize =>


nit: option.getResultChunkingOptions.hasPreferredArrowChunkSize should be sufficient

hvanhovell

LGTM

hvanhovell · 2025-09-11T15:33:43Z

Merging to master. You can fix the NITS in a follow-up.

dongjoon-hyun

+1, LGTM. Thank you, @xi-db and @hvanhovell .

…th `4.1.0-preview2` ### What changes were proposed in this pull request? This PR aims to update Spark Connect-generated Swift source code with Apache Spark `4.1.0-preview2`. ### Why are the changes needed? There are many changes from Apache Spark 4.1.0. - apache/spark#52342 - apache/spark#52256 - apache/spark#52271 - apache/spark#52242 - apache/spark#51473 - apache/spark#51653 - apache/spark#52072 - apache/spark#51561 - apache/spark#51563 - apache/spark#51489 - apache/spark#51507 - apache/spark#51462 - apache/spark#51464 - apache/spark#51442 To use the latest bug fixes and new messages to develop for new features of `4.1.0-preview2`. ``` $ git clone -b v4.1.0-preview2 https://github.com/apache/spark.git $ cd spark/sql/connect/common/src/main/protobuf/ $ protoc --swift_out=. spark/connect/*.proto $ protoc --grpc-swift_out=. spark/connect/*.proto // Remove empty GRPC files $ cd spark/connect $ grep 'This file contained no services' * catalog.grpc.swift:// This file contained no services. commands.grpc.swift:// This file contained no services. common.grpc.swift:// This file contained no services. example_plugins.grpc.swift:// This file contained no services. expressions.grpc.swift:// This file contained no services. ml_common.grpc.swift:// This file contained no services. ml.grpc.swift:// This file contained no services. pipelines.grpc.swift:// This file contained no services. relations.grpc.swift:// This file contained no services. types.grpc.swift:// This file contained no services. $ rm catalog.grpc.swift commands.grpc.swift common.grpc.swift example_plugins.grpc.swift expressions.grpc.swift ml_common.grpc.swift ml.grpc.swift pipelines.grpc.swift relations.grpc.swift types.grpc.swift ``` ### Does this PR introduce _any_ user-facing change? Pass the CIs. ### How was this patch tested? Pass the CIs. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #250 from dongjoon-hyun/SPARK-53777. Authored-by: Dongjoon Hyun <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

…king - Scala Client ### What changes were proposed in this pull request? In the previous PR #52271 of Spark Connect ArrowBatch Result Chunking, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so large Arrow rows are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` val res = spark.sql("select repeat('a', 1024*1024*300)").collect() println(res(0).getString(0).length) ``` It fails with `RESOURCE_EXHAUSTED` error with message `gRPC message exceeds maximum size 134217728: 314573320`, because the server is trying to send an ExecutePlanResponse of ~300MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when returning large rows. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52496 from xi-db/arrow-batch-chuking-scala-client. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

…king - Scala Client ### What changes were proposed in this pull request? In the previous PR #52271 of Spark Connect ArrowBatch Result Chunking, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so large Arrow rows are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` val res = spark.sql("select repeat('a', 1024*1024*300)").collect() println(res(0).getString(0).length) ``` It fails with `RESOURCE_EXHAUSTED` error with message `gRPC message exceeds maximum size 134217728: 314573320`, because the server is trying to send an ExecutePlanResponse of ~300MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when returning large rows. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52496 from xi-db/arrow-batch-chuking-scala-client. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit daa83fc) Signed-off-by: Herman van Hovell <[email protected]>

…king - Scala Client ### What changes were proposed in this pull request? In the previous PR apache#52271 of Spark Connect ArrowBatch Result Chunking, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so large Arrow rows are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` val res = spark.sql("select repeat('a', 1024*1024*300)").collect() println(res(0).getString(0).length) ``` It fails with `RESOURCE_EXHAUSTED` error with message `gRPC message exceeds maximum size 134217728: 314573320`, because the server is trying to send an ExecutePlanResponse of ~300MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when returning large rows. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52496 from xi-db/arrow-batch-chuking-scala-client. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit daa83fc)

… Chunking - Scala Client ### What changes were proposed in this pull request? (This PR is a backporting PR containing #52496 and the test fix #52941.) In the previous PR #52271 of Spark Connect ArrowBatch Result Chunking, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so large Arrow rows are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` val res = spark.sql("select repeat('a', 1024*1024*300)").collect() println(res(0).getString(0).length) ``` It fails with `RESOURCE_EXHAUSTED` error with message `gRPC message exceeds maximum size 134217728: 314573320`, because the server is trying to send an ExecutePlanResponse of ~300MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when returning large rows. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52953 from xi-db/[email protected]. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Dongjoon Hyun <[email protected]>

### What changes were proposed in this pull request? Currently, Spark Connect enforce gRPC message limits on both the client and the server. These limits are largely meant to protect the server from potential OOMs by rejecting abnormally large messages. However, there are several cases where genuine messages exceed the limit and cause execution failures. To improve Spark Connect stability, this PR implements compressing unresolved proto plans to mitigate the issue of oversized messages from the client to the server. The compression applies to ExecutePlan and AnalyzePlan - the only two methods that might hit the message limit. The other issue of message limit from the server to the client is a different issue, and it’s out of the scope (that one is already fixed in #52271). In the implementation, * Zstandard is leveraged to compress proto plan as it has consistent high performance in our benchmark and achieves a good balance between compression ratio and performance. * The config `spark.connect.maxPlanSize` is introduced to control the maximum size of a (decompressed) proto plan that can be executed in Spark Connect. It is mainly used to avoid decompression bomb attacks. (Scala client changes are being implemented in a follow-up PR.) To reproduce the existing issue we are solving here, run this code on Spark Connect: ``` import random import string def random_letters(length: int) -> str: return ''.join(random.choices(string.ascii_letters, k=length)) num_unique_small_relations = 5 size_per_small_relation = 512 * 1024 small_dfs = [spark.createDataFrame([(random_letters(size_per_small_relation),)],) for _ in range(num_unique_small_relations)] result_df = small_dfs[0] for _ in range(512): result_df = result_df.unionByName(small_dfs[random.randint(0, len(small_dfs) - 1)]) result_df.collect() ``` It fails with `StatusCode.RESOURCE_EXHAUSTED` error with message`Sent message larger than max (269178955 vs. 134217728)`, because the client was trying to send a too large message to the server. Note: repeated small local relations is just one way causing a large plan, the size of the plan can also be contributed by repeated subtrees of plan transformations, serialized UDFs, captured external variables by UDFs, etc. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when executing and analyzing large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests on both the server side and the client side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52894 from xi-db/plan-compression. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

### What changes were proposed in this pull request? Currently, Spark Connect enforce gRPC message limits on both the client and the server. These limits are largely meant to protect the server from potential OOMs by rejecting abnormally large messages. However, there are several cases where genuine messages exceed the limit and cause execution failures. To improve Spark Connect stability, this PR implements compressing unresolved proto plans to mitigate the issue of oversized messages from the client to the server. The compression applies to ExecutePlan and AnalyzePlan - the only two methods that might hit the message limit. The other issue of message limit from the server to the client is a different issue, and it’s out of the scope (that one is already fixed in #52271). In the implementation, * Zstandard is leveraged to compress proto plan as it has consistent high performance in our benchmark and achieves a good balance between compression ratio and performance. * The config `spark.connect.maxPlanSize` is introduced to control the maximum size of a (decompressed) proto plan that can be executed in Spark Connect. It is mainly used to avoid decompression bomb attacks. (Scala client changes are being implemented in a follow-up PR.) To reproduce the existing issue we are solving here, run this code on Spark Connect: ``` import random import string def random_letters(length: int) -> str: return ''.join(random.choices(string.ascii_letters, k=length)) num_unique_small_relations = 5 size_per_small_relation = 512 * 1024 small_dfs = [spark.createDataFrame([(random_letters(size_per_small_relation),)],) for _ in range(num_unique_small_relations)] result_df = small_dfs[0] for _ in range(512): result_df = result_df.unionByName(small_dfs[random.randint(0, len(small_dfs) - 1)]) result_df.collect() ``` It fails with `StatusCode.RESOURCE_EXHAUSTED` error with message`Sent message larger than max (269178955 vs. 134217728)`, because the client was trying to send a too large message to the server. Note: repeated small local relations is just one way causing a large plan, the size of the plan can also be contributed by repeated subtrees of plan transformations, serialized UDFs, captured external variables by UDFs, etc. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when executing and analyzing large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests on both the server side and the client side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #52894 from xi-db/plan-compression. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]> (cherry picked from commit 0ccaacf) Signed-off-by: Herman van Hovell <[email protected]>

### What changes were proposed in this pull request? Currently, Spark Connect enforce gRPC message limits on both the client and the server. These limits are largely meant to protect the server from potential OOMs by rejecting abnormally large messages. However, there are several cases where genuine messages exceed the limit and cause execution failures. To improve Spark Connect stability, this PR implements compressing unresolved proto plans to mitigate the issue of oversized messages from the client to the server. The compression applies to ExecutePlan and AnalyzePlan - the only two methods that might hit the message limit. The other issue of message limit from the server to the client is a different issue, and it’s out of the scope (that one is already fixed in apache#52271). In the implementation, * Zstandard is leveraged to compress proto plan as it has consistent high performance in our benchmark and achieves a good balance between compression ratio and performance. * The config `spark.connect.maxPlanSize` is introduced to control the maximum size of a (decompressed) proto plan that can be executed in Spark Connect. It is mainly used to avoid decompression bomb attacks. (Scala client changes are being implemented in a follow-up PR.) To reproduce the existing issue we are solving here, run this code on Spark Connect: ``` import random import string def random_letters(length: int) -> str: return ''.join(random.choices(string.ascii_letters, k=length)) num_unique_small_relations = 5 size_per_small_relation = 512 * 1024 small_dfs = [spark.createDataFrame([(random_letters(size_per_small_relation),)],) for _ in range(num_unique_small_relations)] result_df = small_dfs[0] for _ in range(512): result_df = result_df.unionByName(small_dfs[random.randint(0, len(small_dfs) - 1)]) result_df.collect() ``` It fails with `StatusCode.RESOURCE_EXHAUSTED` error with message`Sent message larger than max (269178955 vs. 134217728)`, because the client was trying to send a too large message to the server. Note: repeated small local relations is just one way causing a large plan, the size of the plan can also be contributed by repeated subtrees of plan transformations, serialized UDFs, captured external variables by UDFs, etc. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when executing and analyzing large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests on both the server side and the client side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52894 from xi-db/plan-compression. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

### What changes were proposed in this pull request? Currently, we enforce gRPC message limits on both the client and the server. These limits are largely meant to protect both sides from potential OOMs by rejecting abnormally large messages. However, there are cases in which the server incorrectly sends oversized messages that exceed these limits and cause execution failures. Specifically, the large message issue from the server to the client we’re solving here, comes from the Arrow batch data in ExecutePlanResponse being too large. It’s caused by a single arrow row exceeding the 128MB message limit, and Arrow cannot partition further and it has to return the single large row in one gRPC message. To improve Spark Connect stability, this PR implements chunking large Arrow batches when returning query results from the server to the client, ensuring each ExecutePlanResponse chunk remains within the size limit, and the chunks from a batch will be reassembled on the client when parsing as an arrow batch. (Scala client changes are being implemented in a follow-up PR.) To reproduce the existing issue we are solving here, run this code on Spark Connect: ``` repeat_num_per_mb = 1024 * 1024 // len('Apache Spark ') res = spark.sql(f"select repeat('Apache Spark ', {repeat_num_per_mb * 300}) as huge_col from range(1)").collect() print(len(res)) ``` It fails with `StatusCode.RESOURCE_EXHAUSTED` error with message `Received message larger than max (314570608 vs. 134217728)`, because the server is trying to send an ExecutePlanResponse of ~300MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when returning large rows. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests on both the server side and the client side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52271 from xi-db/arrow-batch-chunking. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

…king - Scala Client ### What changes were proposed in this pull request? In the previous PR apache#52271 of Spark Connect ArrowBatch Result Chunking, both Server-side and PySpark client changes were implemented. In this PR, the corresponding Scala client changes are implemented, so large Arrow rows are now supported on the Scala client as well. To reproduce the existing issue we are solving here, run this code on Spark Connect Scala client: ``` val res = spark.sql("select repeat('a', 1024*1024*300)").collect() println(res(0).getString(0).length) ``` It fails with `RESOURCE_EXHAUSTED` error with message `gRPC message exceeds maximum size 134217728: 314573320`, because the server is trying to send an ExecutePlanResponse of ~300MB to the client. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when returning large rows. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52496 from xi-db/arrow-batch-chuking-scala-client. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

### What changes were proposed in this pull request? Currently, Spark Connect enforce gRPC message limits on both the client and the server. These limits are largely meant to protect the server from potential OOMs by rejecting abnormally large messages. However, there are several cases where genuine messages exceed the limit and cause execution failures. To improve Spark Connect stability, this PR implements compressing unresolved proto plans to mitigate the issue of oversized messages from the client to the server. The compression applies to ExecutePlan and AnalyzePlan - the only two methods that might hit the message limit. The other issue of message limit from the server to the client is a different issue, and it’s out of the scope (that one is already fixed in apache#52271). In the implementation, * Zstandard is leveraged to compress proto plan as it has consistent high performance in our benchmark and achieves a good balance between compression ratio and performance. * The config `spark.connect.maxPlanSize` is introduced to control the maximum size of a (decompressed) proto plan that can be executed in Spark Connect. It is mainly used to avoid decompression bomb attacks. (Scala client changes are being implemented in a follow-up PR.) To reproduce the existing issue we are solving here, run this code on Spark Connect: ``` import random import string def random_letters(length: int) -> str: return ''.join(random.choices(string.ascii_letters, k=length)) num_unique_small_relations = 5 size_per_small_relation = 512 * 1024 small_dfs = [spark.createDataFrame([(random_letters(size_per_small_relation),)],) for _ in range(num_unique_small_relations)] result_df = small_dfs[0] for _ in range(512): result_df = result_df.unionByName(small_dfs[random.randint(0, len(small_dfs) - 1)]) result_df.collect() ``` It fails with `StatusCode.RESOURCE_EXHAUSTED` error with message`Sent message larger than max (269178955 vs. 134217728)`, because the client was trying to send a too large message to the server. Note: repeated small local relations is just one way causing a large plan, the size of the plan can also be contributed by repeated subtrees of plan transformations, serialized UDFs, captured external variables by UDFs, etc. With the improvement introduced by the PR, the above code runs successfully and prints the expected result. ### Why are the changes needed? It improves Spark Connect stability when executing and analyzing large plans. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New tests on both the server side and the client side. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#52894 from xi-db/plan-compression. Authored-by: Xi Lyu <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

Spark Connect ArrowBatch Result Chunking

e3196c3

github-actions bot added SQL PYTHON CONNECT labels Sep 8, 2025

xi-db changed the title ~~[SPARK-53525] Spark Connect ArrowBatch Result Chunking~~ [SPARK-53525][CONNECT] Spark Connect ArrowBatch Result Chunking Sep 8, 2025

xi-db added 2 commits September 8, 2025 13:46

Reformat code

a3fdc99

Correct the place to add request option

a81ad62

hvanhovell reviewed Sep 8, 2025

View reviewed changes

xi-db added 3 commits September 9, 2025 09:12

Extract response builder logic

0b3f071

Update config doc

66fb544

Rename the helper method to newExecutePlanResponseBuilder

a96e823

xi-db added 8 commits September 9, 2025 09:48

Remove unnecessary volatile annotation

20f81d3

Use SequenceInputStream in tests to avoid copying bytes

332a2db

Add missing variable definition

21235b7

Use explicit foreach loop instead of for-comprehension

9453896

Merge branch 'master' into arrow-batch-chunking

83c1b50

Linting

d7d1c4b

Merge branch 'master' into arrow-batch-chunking

c3e325f

Introduce client option preferred_arrow_chunk_size

b68461f

xi-db force-pushed the arrow-batch-chunking branch from fb5689a to b68461f Compare September 10, 2025 12:09

xi-db added 2 commits September 10, 2025 13:11

Fix failing tests

801b796

Fix failing tests

43117d2

heyihong reviewed Sep 11, 2025

View reviewed changes

hvanhovell approved these changes Sep 11, 2025

View reviewed changes

asf-gitbox-commits closed this in bb41e19 Sep 11, 2025

dongjoon-hyun reviewed Sep 11, 2025

View reviewed changes

xi-db mentioned this pull request Oct 1, 2025

[SPARK-53525][CONNECT][FOLLOWUP] Spark Connect ArrowBatch Result Chunking - Scala Client #52496

Closed

dongjoon-hyun mentioned this pull request Oct 1, 2025

[SPARK-53777] Update Spark Connect-generated Swift source code with 4.1.0-preview2 apache/spark-connect-swift#250

Closed

xi-db mentioned this pull request Nov 5, 2025

[SPARK-54194][CONNECT] Spark Connect Proto Plan Compression #52894

Closed

xi-db mentioned this pull request Nov 8, 2025

[SPARK-53525][CONNECT][FOLLOWUP][4.1] Spark Connect ArrowBatch Result Chunking - Scala Client #52953

Closed

[SPARK-53525][CONNECT] Spark Connect ArrowBatch Result Chunking #52271

[SPARK-53525][CONNECT] Spark Connect ArrowBatch Result Chunking #52271

Uh oh!

Conversation

xi-db commented Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell Sep 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Sep 8, 2025

Uh oh!

xi-db commented Sep 9, 2025

Uh oh!

heyihong Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

heyihong Sep 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hvanhovell left a comment

Choose a reason for hiding this comment

Uh oh!

hvanhovell commented Sep 11, 2025

Uh oh!

dongjoon-hyun left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

xi-db commented Sep 8, 2025 •

edited

Loading

hvanhovell Sep 8, 2025 •

edited

Loading

heyihong Sep 11, 2025 •

edited

Loading

heyihong Sep 11, 2025 •

edited

Loading