[SPARK-40539][CONNECT] Initial DataFrame Read API parity for Spark Connect #38086

amaliujia · 2022-10-03T22:14:29Z

What changes were proposed in this pull request?

Add initial Read API for Spark Connect that allows setting schema, format, option and path, and then to read files into DataFrame.

Why are the changes needed?

PySpark readwriter API parity for Spark Connect

Does this PR introduce any user-facing change?

No

How was this patch tested?

UT

amaliujia · 2022-10-03T22:15:12Z

@grundprinzip @HyukjinKwon @cloud-fan

connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

grundprinzip

I left some suggestions on some of the pieces. Nothing major.

connect/src/main/protobuf/spark/connect/relations.proto

connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

python/pyspark/sql/connect/readwriter.py

AmplabJenkins · 2022-10-05T20:31:01Z

Can one of the admins verify this patch?

python/pyspark/sql/connect/readwriter.py

connector/connect/src/main/protobuf/spark/connect/relations.proto

cloud-fan · 2022-10-10T05:39:06Z

python/pyspark/sql/connect/plan.py

path should end up with options.

cloud-fan · 2022-10-10T05:39:13Z

python/pyspark/sql/connect/plan.py

format is required

zhengruifeng · 2022-10-10T05:47:17Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

maybe simply map<string, string>?

oh yes.. I forget that proto supports MAP. Will update this.

python/pyspark/sql/connect/readwriter.py

amaliujia · 2022-10-16T18:27:40Z

@HyukjinKwon @zhengruifeng @cloud-fan

This PR finally has caught up after blockers were solved. It is ready for another review now.

grundprinzip

Please have a look at my main comment in SparkConnectPlanner with regard to calling load() during the planning.

grundprinzip · 2022-10-16T18:57:10Z

connector/connect/src/main/protobuf/spark/connect/relations.proto

Suggested change

// Required. Supported formats include: parquet, orc, text, json, parquet, csv, avro.

// Required. Supported formats may include: parquet, orc, text, json, parquet, csv, avro.

The reason is that the resolution of the data source is happening on the server side and depends on which DS classes are available in the classpath.

I believe these formats are called built-in format and we can trust that Spark will always support those by default.

jdbc is also a built-in format. I think it's OK to just give some examples here.

I really like how Apache Beam document their proto and I want to match it in connect once the proto becomes stable: https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto

So this part will be revised and expanded anyway (e.g. include the full list, document case sensitivity, document applicable options for each format if there is any, etc.)

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

grundprinzip · 2022-10-16T19:00:53Z

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

So my question here is, if the execution has already happened or not. What I mean is, is load() a blocking operation or a logical one? If you call load() and then return the analyzed plan. What happens if you call collect on this plan? Does the load happen again?

Can verify this?

load() just builds the logical plan. it's not an action.

Thanks I was able to verify this for myself as well using some experiments.

python/pyspark/sql/connect/plan.py

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

cloud-fan · 2022-10-19T02:43:04Z

python/pyspark/sql/connect/readwriter.py

I assume these APIs are just copied from pyspark.

Yes these are to match DataFrame API.

cloud-fan · 2022-10-19T03:59:24Z

python/pyspark/sql/connect/plan.py

is it generated by proto?

This is not. This class is more or less the same as the DSL that we introduced for Scala: the core idea is to provide a way for toProto.

cloud-fan · 2022-10-20T01:26:02Z

thanks, merging to master!

HyukjinKwon

late LGTM2

…nnect ### What changes were proposed in this pull request? Add initial Read API for Spark Connect that allows setting schema, format, option and path, and then to read files into DataFrame. ### Why are the changes needed? PySpark readwriter API parity for Spark Connect ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38086 from amaliujia/SPARK-40539. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 3, 2022

amaliujia changed the title ~~[SPARK-40539][CONNECT] PySpark readwriter API parity for Spark Connect~~ [SPARK-40539][CONNECT] Initial DataFrame Read API parity for Spark Connect Oct 3, 2022

amaliujia commented Oct 4, 2022

View reviewed changes

connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala Outdated Show resolved Hide resolved

grundprinzip reviewed Oct 5, 2022

View reviewed changes

amaliujia force-pushed the SPARK-40539 branch from da65e24 to 4f5a969 Compare October 6, 2022 22:08

zhengruifeng reviewed Oct 7, 2022

View reviewed changes

python/pyspark/sql/connect/readwriter.py Outdated Show resolved Hide resolved

amaliujia force-pushed the SPARK-40539 branch from a9e9b9b to 937c399 Compare October 8, 2022 18:15

grundprinzip reviewed Oct 9, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/relations.proto Outdated Show resolved Hide resolved

amaliujia commented Oct 10, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/relations.proto Outdated Show resolved Hide resolved

amaliujia commented Oct 10, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/relations.proto Outdated Show resolved Hide resolved

amaliujia commented Oct 10, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/relations.proto Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 10, 2022

View reviewed changes

python/pyspark/sql/connect/plan.py Outdated

Copy link

Contributor

cloud-fan Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

path should end up with options.

cloud-fan reviewed Oct 10, 2022

View reviewed changes

python/pyspark/sql/connect/plan.py Outdated

Copy link

Contributor

cloud-fan Oct 10, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

format is required

zhengruifeng reviewed Oct 10, 2022

View reviewed changes

python/pyspark/sql/connect/readwriter.py Outdated Show resolved Hide resolved

amaliujia mentioned this pull request Oct 10, 2022

[SPARK-39375][CONNECT][FOLLOW-UP] Refactor Read to UnresolvedRelation #38193

Closed

amaliujia force-pushed the SPARK-40539 branch from 7471733 to 21ac1dc Compare October 16, 2022 08:17

grundprinzip reviewed Oct 16, 2022

View reviewed changes

amaliujia force-pushed the SPARK-40539 branch 2 times, most recently from 81bbfc3 to cb1395f Compare October 18, 2022 18:27

cloud-fan reviewed Oct 19, 2022

View reviewed changes

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala Outdated Show resolved Hide resolved

cloud-fan reviewed Oct 19, 2022

View reviewed changes

amaliujia force-pushed the SPARK-40539 branch from 076564d to 1997a14 Compare October 19, 2022 03:22

cloud-fan reviewed Oct 19, 2022

View reviewed changes

amaliujia force-pushed the SPARK-40539 branch from 1997a14 to a7aa936 Compare October 19, 2022 20:10

[SPARK-40539] PySpark readwriter API parity for Spark Connect.

0551ea4

amaliujia force-pushed the SPARK-40539 branch from a7aa936 to 0551ea4 Compare October 19, 2022 21:40

cloud-fan approved these changes Oct 20, 2022

View reviewed changes

cloud-fan closed this in 01c7a46 Oct 20, 2022

HyukjinKwon reviewed Oct 31, 2022

View reviewed changes

	// Required. Supported formats include: parquet, orc, text, json, parquet, csv, avro.
	// Required. Supported formats may include: parquet, orc, text, json, parquet, csv, avro.

[SPARK-40539][CONNECT] Initial DataFrame Read API parity for Spark Connect #38086

[SPARK-40539][CONNECT] Initial DataFrame Read API parity for Spark Connect #38086

Uh oh!

Conversation

amaliujia commented Oct 3, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

amaliujia commented Oct 3, 2022

Uh oh!

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AmplabJenkins commented Oct 5, 2022

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

amaliujia commented Oct 16, 2022

Uh oh!

grundprinzip left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

amaliujia Oct 19, 2022 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Oct 20, 2022

Uh oh!

HyukjinKwon left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

amaliujia commented Oct 3, 2022 •

edited

Loading

amaliujia Oct 19, 2022 •

edited

Loading