Skip to content

Conversation

@grundprinzip
Copy link
Contributor

@grundprinzip grundprinzip commented Oct 10, 2022

What changes were proposed in this pull request?

This change adds basic support for writes through the Spark Connect API. In todays implementation of the write API from a DataFrame perspective the interface of the DataFrameWriter is as declarative as possible today.

The write support is implemented as a Command and does not return anything.

Why are the changes needed?

Write support through Spark Connect.

Does this PR introduce any user-facing change?

Experimental API

How was this patch tested?

Added new unit tests for the behavior.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a bit weird to have this in the SparkPlanner node, but I guess this is the consequence of the builder() API we have in the DataFrameWriter.

@cloud-fan AFAIK you have been working on making writes more declarative (i.e. planned writes). Do you see a way to improve this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is more than planned write. We need to create a logical plan for DF write, instead of putting implementation code in DF write APIs.

@HyukjinKwon HyukjinKwon changed the title [CONNECT] [SPARK-40737] Add basic support for DataFrameWriter [SPARK-40737][CONNECT] Add basic support for DataFrameWriter Oct 11, 2022
@grundprinzip grundprinzip force-pushed the spark-40737 branch 2 times, most recently from 76aea0e to d67edb8 Compare October 11, 2022 11:15
@grundprinzip grundprinzip marked this pull request as ready for review October 11, 2022 11:19
@grundprinzip
Copy link
Contributor Author

@cloud-fan @amaliujia @hvanhovell please take a look!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we use IllegalArgumentException here? Or do you feel this needs its own specific exception?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to have a custom exception for when we rethrow.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is a user-facing error, we should actually leverage errorframe work we have .. cc @gengliangwang @MaxGekk @itholic

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm happy to fix this as a follow up, does it make sense?

The errors are reported back through grpc. If you point me to the right base class I can fix it then.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@grundprinzip
Copy link
Contributor Author

@hvanhovell @cloud-fan @HyukjinKwon can you please have a look?

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise from my end.

@hvanhovell
Copy link
Contributor

Merging this one.

Relation input = 1;
// Format value according to the Spark documentation. Examples are: text, parquet, delta.
string source = 2;
// The destination of the write operation must be either a path or a table.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in DF API, people can do df.write.format("jdbc").option("table", ...).save() , so the destination is neither path nor table. I think an optional table name is sufficient. If table name is not given, the destination will be figured out from write options (path is just one write option).

string path = 3;
string table_name = 4;
}
SaveMode mode = 5;
Copy link
Contributor

@cloud-fan cloud-fan Oct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We added DataFrameWriterV2 because we believe SaveMode is a bad design. It's confusing if we write to a table, as there are so many options: create if not exists, create or replace, replace if exists, append if exists, overwrite data if exists, etc.

Anyway, we need to support save mode in the proto definition to support the existing DF API. If we want to support DataFrameWriterV2 in Spark connect client, we should probably have a new proto definition without save mode.

}
SaveMode mode = 5;
// List of columns to sort the output by.
repeated string sort_column_names = 6;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be part of the BucketBy

SandishKumarHN pushed a commit to SandishKumarHN/spark that referenced this pull request Dec 12, 2022
### What changes were proposed in this pull request?

This change adds basic support for writes through the Spark Connect API. In todays implementation of the write API from a DataFrame perspective the interface of the DataFrameWriter is as declarative as possible today.

The write support is implemented as a `Command` and does not return anything.

### Why are the changes needed?

Write support through Spark Connect.

### Does this PR introduce _any_ user-facing change?
Experimental API

### How was this patch tested?

Added new unit tests for the behavior.

Closes apache#38192 from grundprinzip/spark-40737.

Authored-by: Martin Grund <[email protected]>
Signed-off-by: Herman van Hovell <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants