-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-40539][CONNECT] Initial DataFrame Read API parity for Spark Connect #38086
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Outdated
Show resolved
Hide resolved
grundprinzip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some suggestions on some of the pieces. Nothing major.
connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Outdated
Show resolved
Hide resolved
connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Outdated
Show resolved
Hide resolved
|
Can one of the admins verify this patch? |
connector/connect/src/main/protobuf/spark/connect/relations.proto
Outdated
Show resolved
Hide resolved
connector/connect/src/main/protobuf/spark/connect/relations.proto
Outdated
Show resolved
Hide resolved
connector/connect/src/main/protobuf/spark/connect/relations.proto
Outdated
Show resolved
Hide resolved
connector/connect/src/main/protobuf/spark/connect/relations.proto
Outdated
Show resolved
Hide resolved
python/pyspark/sql/connect/plan.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
path should end up with options.
python/pyspark/sql/connect/plan.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
format is required
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe simply map<string, string>?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh yes.. I forget that proto supports MAP. Will update this.
7471733 to
21ac1dc
Compare
|
@HyukjinKwon @zhengruifeng @cloud-fan This PR finally has caught up after blockers were solved. It is ready for another review now. |
grundprinzip
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please have a look at my main comment in SparkConnectPlanner with regard to calling load() during the planning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
| // Required. Supported formats include: parquet, orc, text, json, parquet, csv, avro. | |
| // Required. Supported formats may include: parquet, orc, text, json, parquet, csv, avro. |
The reason is that the resolution of the data source is happening on the server side and depends on which DS classes are available in the classpath.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe these formats are called built-in format and we can trust that Spark will always support those by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
jdbc is also a built-in format. I think it's OK to just give some examples here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like how Apache Beam document their proto and I want to match it in connect once the proto becomes stable: https://github.com/apache/beam/blob/master/model/pipeline/src/main/proto/org/apache/beam/model/pipeline/v1/beam_runner_api.proto
So this part will be revised and expanded anyway (e.g. include the full list, document case sensitivity, document applicable options for each format if there is any, etc.)
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So my question here is, if the execution has already happened or not. What I mean is, is load() a blocking operation or a logical one? If you call load() and then return the analyzed plan. What happens if you call collect on this plan? Does the load happen again?
Can verify this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
load() just builds the logical plan. it's not an action.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks I was able to verify this for myself as well using some experiments.
81bbfc3 to
cb1395f
Compare
connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume these APIs are just copied from pyspark.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes these are to match DataFrame API.
076564d to
1997a14
Compare
python/pyspark/sql/connect/plan.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it generated by proto?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not. This class is more or less the same as the DSL that we introduced for Scala: the core idea is to provide a way for toProto.
1997a14 to
a7aa936
Compare
a7aa936 to
0551ea4
Compare
|
thanks, merging to master! |
HyukjinKwon
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
late LGTM2
…nnect ### What changes were proposed in this pull request? Add initial Read API for Spark Connect that allows setting schema, format, option and path, and then to read files into DataFrame. ### Why are the changes needed? PySpark readwriter API parity for Spark Connect ### Does this PR introduce _any_ user-facing change? No ### How was this patch tested? UT Closes apache#38086 from amaliujia/SPARK-40539. Authored-by: Rui Wang <[email protected]> Signed-off-by: Wenchen Fan <[email protected]>
What changes were proposed in this pull request?
Add initial Read API for Spark Connect that allows setting schema, format, option and path, and then to read files into DataFrame.
Why are the changes needed?
PySpark readwriter API parity for Spark Connect
Does this PR introduce any user-facing change?
No
How was this patch tested?
UT