[SPARK-41114][CONNECT] Support local data for LocalRelation #38659

dengziming · 2022-11-14T13:18:05Z

What changes were proposed in this pull request?

This PR supports local data for LocalRelation, we decided to use Arrow IPC batches format to transfer data. the schema is embedded in the binary records so we can remove the attributes field from LocalRelation

Why are the changes needed?

It's necessary to have local data to do unit test and validation.

Does this PR introduce any user-facing change?

No

How was this patch tested?

unit test.

amaliujia · 2022-11-14T18:52:40Z

Question: can we re-use the ARROW collection we have done here?

cc @zhengruifeng

amaliujia · 2022-11-14T18:52:49Z

also cc @hvanhovell @amaliujia

grundprinzip · 2022-11-14T19:16:38Z

Thanks for the contribution! The overall approach seems good.

The original idea was for the local data to be sent as Arrow IPC batches as well as it follows the same direction as on the return path.

In addition, we have the benefit that the Arrow IPC message actually have a schema embedded so that we can do nice validation on the receive path.

It would be great to figure out if we can get an Python e2e test in there as well just to make sure we cover the whole scenario. The easiest way might be to convert the Pandas DF into Arrow and then serialize this to the server.

https://arrow.apache.org/docs/python/pandas.html

zhengruifeng · 2022-11-15T02:11:07Z

@dengziming thanks for the contributions!

I think we'd better apply Arrow batch instead of structs in this proto message.

you may refer to #38468 on how to update the proto message, and the implementation of fromBatchIterator on how to convert arrow batches into internal rows;

dengziming · 2022-11-15T02:13:48Z

Thank you all for you reviews @zhengruifeng @amaliujia @grundprinzip , there may be some delay since I need some time to get familiar with Arrow.🤝

AmplabJenkins · 2022-11-15T16:19:11Z

Can one of the admins verify this patch?

amaliujia · 2022-11-15T18:30:37Z

@dengziming thanks!

BTW you can try to covert this PR to draft then re-open when you think it is ready for review again.

dengziming · 2022-11-16T09:33:02Z

I used the arrow format without schema here since we already defined attributes in LocalRelation, WDYT? @amaliujia @zhengruifeng @grundprinzip

grundprinzip

Very much looking forward to this change!

grundprinzip · 2022-11-16T15:18:49Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

I'm not sure this needs to be repeated bytes here because bytes itself is binary data of "arbitrary" length.

Thank you, I use repeated bytes in case that the batch size is lager than maxRecordsPerBatch, I think is enough to use bytes here since LocalRelation is mostly used in debugging cases.

grundprinzip · 2022-11-16T15:20:16Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

If data is a regular byte array, it becomes a ByteString that you can simply extract here.

grundprinzip · 2022-11-16T15:22:06Z

...r/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala

literals -> data?

zhengruifeng · 2022-11-17T03:36:18Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

what about removing this field and also read the schema from the arrow batch? @grundprinzip

For short term it should work. We probably only need name and type to ask server side construct such attributes for the local relation.

For longer term I am not sure. Depending on if there are other extra information that LocalRelation needs from such attributes.

each arrow_batch in collect starts with the schema, it will be consistent if we also do this in createDataFrame

Sure I am not against to use arrow schema for now.

I find we lack a fromBatchWithSchemaIterator method correspond to toBatchWithSchemaIterator, so I will implement one.

zhengruifeng · 2022-11-17T03:38:41Z

you may reformat the scala code by
./build/mvn -Pscala-2.12 scalafmt:format -Dscalafmt.skip=fase -Dscalafmt.validateOnly=false -Dscalafmt.changedOnly=false -pl connector/connect

amaliujia · 2022-11-17T04:54:02Z

You can also run the scala lint locally ./dev/lint-scala

dengziming · 2022-11-17T16:38:04Z

I resolved the comments and move schema to the arrow batch, there are still some TODOs left behind which I will fix after we all agree this plan. @amaliujia @grundprinzip @zhengruifeng

HyukjinKwon · 2022-11-18T11:52:17Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala

Let's keep these newlines. I think Scala linter would complain about this.

Thank you, I have reverted these changes.

grundprinzip

Conceptually the approach looks good. I think we need to improve the testing and fix some of the unnecessary format changes. The connect module uses auto formatting so it should be really easy.

One thing I've seen is that in the from batch with schema iterator approach we don't check the schema integrity over batch boundaries. This might be ok in this case.

In addition im thinking if it's not better to just use the from buffer with schema method because it has the invariant of having exactly one schema and the message type only has one buffer anyways.

Otherwise im good with the approach. Looks very promising.

grundprinzip · 2022-11-18T12:56:49Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

why these changes?

Those are made by IDE format plugin, I have reverted them.

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

connector/connect/src/main/protobuf/spark/connect/relations.proto

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

grundprinzip · 2022-11-19T07:03:16Z

...r/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala

The test here is kind of bare bones. Before we fully approve the PR we need to extend the test coverage a bit.

dengziming

Thank you for your reviews @grundprinzip , I have fixed most of them, I also changed the test case to be more convincing and added 2 other cases about empty data and illegal data.
Do you think we need more test case here?

dengziming · 2022-11-21T16:24:04Z

sql/core/src/test/scala/org/apache/spark/sql/execution/arrow/ArrowConvertersSuite.scala

Thank you, I have reverted these changes.

dengziming · 2022-11-21T16:25:06Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

Those are made by IDE format plugin, I have reverted them.

HyukjinKwon · 2022-11-22T07:06:05Z

...tor/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectProtoSuite.scala

Suggested change

Seq())

Seq.empty)

HyukjinKwon · 2022-11-22T07:06:51Z

core/src/main/scala/org/apache/spark/util/Utils.scala

I think this is too much to have it as a common util at the core module. It's only used twice ..

HyukjinKwon · 2022-11-22T07:07:55Z

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

Sorry for late reviews. Can we dedup the logic like ArrowBatchWithSchemaIterator is doing?

HyukjinKwon · 2022-11-22T07:08:57Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

Should we use the same protobuf message you added, @zhengruifeng ?

We may want to do this the other way around right? The row count in the current arrowbatch message is not needed, that information is already encoded inside the Arrow IPC stream.

I'm fine to remove the row count, it was supported in collect just because it was in the initial proto message

if the information is already in ARROW IPC stream, +1 to remove row count.

I don't also think row count was used properly in the initial implementation (e.g. probably was not used in the CSV version).

grundprinzip

Generally I'm happy with the PR. I just have minor nits on comments etc. I will approve, but we will need @hvanhovell and / or @cloud-fan to approve as well.

sql/core/src/main/scala/org/apache/spark/sql/execution/arrow/ArrowConverters.scala

dengziming · 2022-11-23T07:40:09Z

Thank you @grundprinzip for your review, I fixed the comments and let's wait for @hvanhovell and @cloud-fan. 🤝

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

…e comments

amaliujia

LGTM! Thanks!

HyukjinKwon · 2022-11-24T05:23:43Z

Merged to master.

HyukjinKwon · 2022-11-24T09:54:37Z

@dengziming this is really awesome. Thanks for addressing all comments, and landing this feature. Since you implemented this, are you also interested in supporting spark.createDataFrame(panadsDF) case too? Pandas is arguably super common than just plain spark.createDataFrame(others). Wouldn't be super complicated to implement.

dengziming · 2022-11-25T03:47:09Z

@HyukjinKwon Thank you, I'm glad to have a try, but I'm new to python, it will take me some time to get familiar with it.

grundprinzip · 2022-11-25T17:29:47Z

@dengziming If you don't mind, I would create a quick PR that allows reading the data from a Pandas DF because thats very quick and helps us to get to a useful state quickest.

If you're still interested in doing the Python side work, maybe if you can have a look at the createDataFrame without Pandas based on Schema and Rows.

dengziming · 2022-11-26T09:57:23Z

@grundprinzip Thank you, I would like to review your code.

grundprinzip · 2022-11-26T10:16:14Z

@dengziming please have a look at #38803

### What changes were proposed in this pull request? This PR supports local data for LocalRelation, we decided to use Arrow IPC batches format to transfer data. the schema is embedded in the binary records so we can remove the `attributes` field from `LocalRelation` ### Why are the changes needed? It's necessary to have local data to do unit test and validation. ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? unit test. Closes apache#38659 from dengziming/SPARK-41114. Authored-by: dengziming <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels Nov 14, 2022

dengziming marked this pull request as draft November 16, 2022 02:15

dengziming force-pushed the SPARK-41114 branch from 0eacf97 to c1504fa Compare November 16, 2022 06:15

dengziming marked this pull request as ready for review November 16, 2022 06:17

dengziming force-pushed the SPARK-41114 branch from c1504fa to 71472f8 Compare November 16, 2022 09:29

dengziming force-pushed the SPARK-41114 branch from 71472f8 to ea5a78b Compare November 16, 2022 10:53

grundprinzip reviewed Nov 16, 2022

View reviewed changes

zhengruifeng reviewed Nov 17, 2022

View reviewed changes

dengziming force-pushed the SPARK-41114 branch from ea5a78b to 3827f99 Compare November 17, 2022 16:34

HyukjinKwon reviewed Nov 18, 2022

View reviewed changes

grundprinzip reviewed Nov 19, 2022

View reviewed changes

dengziming force-pushed the SPARK-41114 branch 2 times, most recently from 5aa4bad to 8e49fd1 Compare November 21, 2022 16:22

dengziming commented Nov 21, 2022

View reviewed changes

HyukjinKwon reviewed Nov 22, 2022

View reviewed changes

dengziming force-pushed the SPARK-41114 branch from 8e49fd1 to 97d18d8 Compare November 22, 2022 12:52

grundprinzip approved these changes Nov 22, 2022

View reviewed changes

dengziming force-pushed the SPARK-41114 branch from 97d18d8 to 8ea502b Compare November 23, 2022 07:37

dengziming force-pushed the SPARK-41114 branch from 8ea502b to d4588ed Compare November 23, 2022 07:44

zhengruifeng approved these changes Nov 23, 2022

View reviewed changes

cloud-fan reviewed Nov 23, 2022

View reviewed changes

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala Outdated Show resolved Hide resolved

cloud-fan approved these changes Nov 23, 2022

View reviewed changes

dengziming added 5 commits November 23, 2022 23:21

[SPARK-41114][CONNECT] Support local data for LocalRelation && resolv…

70a42a9

…e comments

resolve comments && fix scalastyle

2db0b5a

resolve comments

c288a25

comments on new classes and new methods && scala 2.13 fix

7c2c0a1

MINOR: resolve comments

21b6482

dengziming force-pushed the SPARK-41114 branch from d4588ed to 21b6482 Compare November 23, 2022 15:21

HyukjinKwon approved these changes Nov 24, 2022

View reviewed changes

amaliujia reviewed Nov 24, 2022

View reviewed changes

HyukjinKwon closed this in 2b2ffcd Nov 24, 2022

[SPARK-41114][CONNECT] Support local data for LocalRelation #38659

[SPARK-41114][CONNECT] Support local data for LocalRelation #38659

Uh oh!

Conversation

dengziming commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Nov 14, 2022

Uh oh!

amaliujia commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

grundprinzip commented Nov 14, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zhengruifeng commented Nov 15, 2022

Uh oh!

dengziming commented Nov 15, 2022

Uh oh!

AmplabJenkins commented Nov 15, 2022

Uh oh!

amaliujia commented Nov 15, 2022

Uh oh!

dengziming commented Nov 16, 2022

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zhengruifeng commented Nov 17, 2022

Uh oh!

amaliujia commented Nov 17, 2022

Uh oh!

dengziming commented Nov 17, 2022

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

dengziming left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

dengziming commented Nov 14, 2022 •

edited

Loading

amaliujia commented Nov 14, 2022 •

edited

Loading

grundprinzip commented Nov 14, 2022 •

edited

Loading