[SPARK-40737][CONNECT] Add basic support for DataFrameWriter #38192

grundprinzip · 2022-10-10T18:19:20Z

What changes were proposed in this pull request?

This change adds basic support for writes through the Spark Connect API. In todays implementation of the write API from a DataFrame perspective the interface of the DataFrameWriter is as declarative as possible today.

The write support is implemented as a Command and does not return anything.

Why are the changes needed?

Write support through Spark Connect.

Does this PR introduce any user-facing change?

Experimental API

How was this patch tested?

Added new unit tests for the behavior.

connector/connect/src/main/protobuf/spark/connect/commands.proto

hvanhovell · 2022-10-10T18:43:18Z

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala

It is a bit weird to have this in the SparkPlanner node, but I guess this is the consequence of the builder() API we have in the DataFrameWriter.

@cloud-fan AFAIK you have been working on making writes more declarative (i.e. planned writes). Do you see a way to improve this?

cc @allisonwang-db FYI

This is more than planned write. We need to create a logical plan for DF write, instead of putting implementation code in DF write APIs.

...ct/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectCommandPlannerSuite.scala

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala

grundprinzip · 2022-10-11T11:20:24Z

@cloud-fan @amaliujia @hvanhovell please take a look!

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala

hvanhovell · 2022-10-11T11:35:25Z

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala

Should we use IllegalArgumentException here? Or do you feel this needs its own specific exception?

I wanted to have a custom exception for when we rethrow.

If this is a user-facing error, we should actually leverage errorframe work we have .. cc @gengliangwang @MaxGekk @itholic

I'm happy to fix this as a follow up, does it make sense?

The errors are reported back through grpc. If you point me to the right base class I can fix it then.

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala

connector/connect/src/main/protobuf/spark/connect/commands.proto

AmplabJenkins · 2022-10-11T23:36:44Z

Can one of the admins verify this patch?

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala

grundprinzip · 2022-10-13T03:59:43Z

@hvanhovell @cloud-fan @HyukjinKwon can you please have a look?

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala

...ct/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectCommandPlannerSuite.scala

connector/connect/src/main/protobuf/spark/connect/commands.proto

HyukjinKwon

LGTM otherwise from my end.

...r/connect/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectPlannerSuite.scala

...ct/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectCommandPlannerSuite.scala

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala

hvanhovell · 2022-10-15T17:32:34Z

Merging this one.

cloud-fan · 2022-10-17T14:02:00Z

connector/connect/src/main/protobuf/spark/connect/commands.proto

+  Relation input = 1;
+  // Format value according to the Spark documentation. Examples are: text, parquet, delta.
+  string source = 2;
+  // The destination of the write operation must be either a path or a table.


in DF API, people can do df.write.format("jdbc").option("table", ...).save() , so the destination is neither path nor table. I think an optional table name is sufficient. If table name is not given, the destination will be figured out from write options (path is just one write option).

cloud-fan · 2022-10-17T14:05:34Z

connector/connect/src/main/protobuf/spark/connect/commands.proto

+    string path = 3;
+    string table_name = 4;
+  }
+  SaveMode mode = 5;


We added DataFrameWriterV2 because we believe SaveMode is a bad design. It's confusing if we write to a table, as there are so many options: create if not exists, create or replace, replace if exists, append if exists, overwrite data if exists, etc.

Anyway, we need to support save mode in the proto definition to support the existing DF API. If we want to support DataFrameWriterV2 in Spark connect client, we should probably have a new proto definition without save mode.

cloud-fan · 2022-10-17T14:07:42Z

connector/connect/src/main/protobuf/spark/connect/commands.proto

+  }
+  SaveMode mode = 5;
+  // List of columns to sort the output by.
+  repeated string sort_column_names = 6;


This should be part of the BucketBy

### What changes were proposed in this pull request? This change adds basic support for writes through the Spark Connect API. In todays implementation of the write API from a DataFrame perspective the interface of the DataFrameWriter is as declarative as possible today. The write support is implemented as a `Command` and does not return anything. ### Why are the changes needed? Write support through Spark Connect. ### Does this PR introduce _any_ user-facing change? Experimental API ### How was this patch tested? Added new unit tests for the behavior. Closes apache#38192 from grundprinzip/spark-40737. Authored-by: Martin Grund <[email protected]> Signed-off-by: Herman van Hovell <[email protected]>

github-actions bot added CONNECT CORE PYTHON SQL labels Oct 10, 2022

hvanhovell reviewed Oct 10, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/commands.proto Outdated Show resolved Hide resolved

hvanhovell reviewed Oct 10, 2022

View reviewed changes

...ct/src/test/scala/org/apache/spark/sql/connect/planner/SparkConnectCommandPlannerSuite.scala Outdated Show resolved Hide resolved

grundprinzip commented Oct 10, 2022

View reviewed changes

connector/connect/src/main/scala/org/apache/spark/sql/connect/planner/SparkConnectPlanner.scala Outdated Show resolved Hide resolved

HyukjinKwon changed the title ~~[CONNECT] [SPARK-40737] Add basic support for DataFrameWriter~~ [SPARK-40737][CONNECT] Add basic support for DataFrameWriter Oct 11, 2022

grundprinzip force-pushed the spark-40737 branch 2 times, most recently from 76aea0e to d67edb8 Compare October 11, 2022 11:15

grundprinzip marked this pull request as ready for review October 11, 2022 11:19

grundprinzip force-pushed the spark-40737 branch from d67edb8 to 186f9ba Compare October 11, 2022 11:25

hvanhovell reviewed Oct 11, 2022

View reviewed changes

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala Outdated Show resolved Hide resolved

hvanhovell reviewed Oct 11, 2022

View reviewed changes

Scala implementation for DataFrameWriter

c4ae79e

grundprinzip force-pushed the spark-40737 branch from 186f9ba to c4ae79e Compare October 11, 2022 12:17

grundprinzip added 2 commits October 11, 2022 16:02

format

6d152e2

fix

cdf41d6

amaliujia reviewed Oct 11, 2022

View reviewed changes

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala Outdated Show resolved Hide resolved

amaliujia reviewed Oct 11, 2022

View reviewed changes

connector/connect/src/main/protobuf/spark/connect/commands.proto Outdated Show resolved Hide resolved

grundprinzip added 2 commits October 12, 2022 18:14

testing and stuff

139873e

fixing the test

36d320b

hvanhovell reviewed Oct 12, 2022

View reviewed changes

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala Outdated Show resolved Hide resolved

comments

7a3a16c

grundprinzip added 2 commits October 13, 2022 22:24

comments

12a06fe

comments

f681c72

HyukjinKwon reviewed Oct 14, 2022

View reviewed changes

...connect/src/main/scala/org/apache/spark/sql/connect/command/SparkConnectCommandPlanner.scala Outdated Show resolved Hide resolved

HyukjinKwon reviewed Oct 14, 2022

View reviewed changes

connector/connect/src/main/scala/org/apache/spark/sql/connect/dsl/package.scala Show resolved Hide resolved