[SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API #27331

zero323 · 2020-01-23T03:01:55Z

What changes were proposed in this pull request?

Adds DataFramWriterV2 class.
Adds writeTo method to pyspark.sql.DataFrame.
Adds related SQL partitioning functions (years, months, ..., bucket).

Why are the changes needed?

Feature parity.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

Added new unit tests.

TODO: Should we test against org.apache.spark.sql.connector.InMemoryTableCatalog? If so, how to expose it in Python tests?

SparkQA · 2020-01-23T03:31:48Z

Test build #117270 has finished for PR 27331 at commit 81fac11.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFrameWriterV2(object):

SparkQA · 2020-01-23T13:07:22Z

Test build #117307 has finished for PR 27331 at commit 7a1aa6e.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFrameWriterV2(object):

zero323 · 2020-01-26T03:57:39Z

Waiting for resolution of discussion on dev - Revert and revisit the public custom expression API for partition (a.k.a. Transform API)

SparkQA · 2020-02-04T16:17:37Z

Test build #117849 has finished for PR 27331 at commit 8de4978.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFrameWriterV2(object):

zero323 · 2020-03-03T16:35:07Z

@HyukjinKwon Glancing over the discussion it doesn't seem like the upstream feature is going to be reverted, does it?

HyukjinKwon · 2020-03-03T23:58:33Z

No, it seems not. But I am not very sure if we should expose these APIs considering that DSv2 is still under developement, and incomplete yet. These APIs are considered as exceptions in terms of compatibility across versions and they are unstable at this moment.

HyukjinKwon · 2020-03-03T23:59:08Z

WDYT @rdblue, @dbtsai, @cloud-fan, @brkyvz?

zero323 · 2020-03-04T00:17:49Z

No, it seems not. But I am not very sure if we should expose these APIs considering that DSv2 is still under developement, and incomplete yet. These APIs are not considered as exceptions and they are unstable at this moment.

Makes sense.

rdblue · 2020-03-04T00:23:20Z

I think it's a good idea to keep Python up to date with the Scala API. Thanks for fixing this, @zero323!

If there is a stated strategy that prevents us from merging this, then I'm find waiting for now. But if we don't have an existing policy to avoid adding experimental APIs to Python, I think we should add it.

SparkQA · 2020-03-04T01:01:50Z

Test build #119264 has finished for PR 27331 at commit ac9eab4.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class DataFrameWriterV2(object):

HyukjinKwon · 2020-03-04T01:04:19Z

I am okay with adding it in PySpark. Just wanted to make sure if we're going to change DSv2 API shape more or not. If that's expected, let's add them later to reduce dev overhead as new APIs should target 3.1 anyway. If it's expected to not change, I am good with adding it.

zero323 · 2020-03-04T19:08:28Z

I am okay with adding it in PySpark. Just wanted to make sure if we're going to change DSv2 API shape more or not. If that's expected, let's add them later to reduce dev overhead as new APIs should target 3.1 anyway. If it's expected to not change, I am good with adding it.

I don't have strong opinion here, but I guess that one consideration is how much keeping this out of PySpark limits feedback that can gathered and used to drive the evolution of the API.

SparkQA · 2020-06-20T15:15:35Z

Test build #124325 has finished for PR 27331 at commit 24ec8f9.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-06-21T01:50:40Z

@rdblue WDYT? I don't know the details about stability about it here. Does it look to you the APIs are not going to be changed soon? Just rough estimation is fine.

rdblue · 2020-06-22T16:36:49Z

As I said at the time, I think this should have been merged.

HyukjinKwon · 2020-06-23T07:01:31Z

Sorry, there's no policy to block but I don't also think practically it's a good idea to merge if it's expected to change.

Once you have PySpark and SparkR APIs, you should fix all APIs together every time you change APIs with fixing tests, and I don't believe the dev people are all used to all languages which are overhead. That's why we have a bunch of inconsistencies between SQL function APIs in other languages as an example.

@rdblue, can I ask the rough estimation about stability?

zero323 · 2020-06-24T12:35:02Z

it's a good idea to merge if it's expected to change.

Just my 2 cents ‒ if only APIs that stabilized are exposed, then Python users which, if I recall correctly, consist around half of the whole user base, are essentially excluded from the shaping and testing process, aren't they? That's a huge issue.

python/pyspark/sql/tests/test_readwriter.py

python/pyspark/sql/dataframe.py

HyukjinKwon · 2020-06-25T02:10:20Z

python/pyspark/sql/readwriter.py

+        return self
+
+    @since(3.1)
+    def partitionedBy(self, col, *cols):


Maybe it's important to describe what are expected for col. Only columns and the partition transform functions are allowed, not the regular Spark Column.

I still don't like it we made this API looks like it takes regular Spark Columns - they are mutually exclusive given the last discussion in the dev mailing list, this was one of the reason why Pandas UDFs were redesigned and separate into two separate groups .. let's at least clarify it.

@rdblue, @brkyvz, @cloud-fan, Should we maybe at least use a different class for these partition column expressions such as PartitioningColumn like we do for TypedColumn, and add asPartitioningColumn to Column?

I remember we basically want to remove these partitioning specific expressions at [DISCUSS] Revert and revisit the public custom expression API for partition (a.k.a. Transform API)
if we find a better way to do it.

I suspect doing PartitioningColumn is a-okay as a temporary solution (?) because we can guard it by typing, and we can move these partitioning-specific expressions into a separate package. I think this way make them distinguished at least. I can work as well on it if this sounds fine to you guys.

I don't see the need for separation here that doesn't exist in Scala.

@rdblue, I don't mean to we should do that here. I mean to suggest/discuss to make the separation in the Scala first because that propagates the confusion to PySpark API side as well.

They are different things so I am suggesting to make it different. I hope we can more focus on the discussion itself.

python/pyspark/sql/functions.py

rdblue · 2020-06-25T20:47:42Z

nobody knows the answer about the stability

Short version: I don't think it is relevant; on its merits, I expect the API to be stable; I don't know what other people will do.

I haven't replied because I don't see how it is an important concern. An experimental API in Scala can be an experimental API in Python. There's no strategy about not porting APIs that might change. And, it's easier to maintain compatibility on the Python side. That's why I think it should be merged, regardless of the question of stability.

But if stability matters to you, I'll quickly try to address it.

Stability in an API is hard to judge without real-world use. That's why we wait some amount of time, instead of just declaring stable right away. I think this is very likely to be stable because it is a translation of the underlying SQL primitives that are stable. But...
I am not in control of the changes. Maintaining an API as stable is a choice; the previous DataFrameWriter is a poor API, but has been maintained as stable. In contrast, the 2.4 DSv2 API was good enough, but rewritten. I can't say what other members of the community might try to change in this API.

HyukjinKwon · 2020-06-26T00:49:47Z

I haven't replied because I don't see how it is an important concern.

@rdblue, I explained multiple times why I think this is relevant and important - once you add them, you should fix it in Python and R side too. I don't believe all dev people are used to Python and R side given my interactions for many years in Spark dev.
I support to add it for 3.1 but not now in the early stage if it's unstable. As I explained earlier, I take this DSv2 case as an exceptional case. See the concern about #27331 (comment) too.

This isn't a great way to discuss that you ignore because you don't think it's important or relevant.

I just wanted to know the rough picture rather than asking you to assert the stability here because you are the one who drove DSv2 in the community, and I do believe you're the right one to ask. I fully understand the things can change.

I am here to help and make progresses here rather than nitpicking or blaming on something not done. I fully understand the pain we had at DSv2. It would be nicer if we can be more cooperative next time.

rdblue · 2020-06-26T16:10:17Z

This isn't a great way to discuss

To clarify, this isn't a discussion. There isn't more for me to say since I've already added my perspective.

cloud-fan · 2020-06-29T08:07:37Z

python/pyspark/sql/readwriter.py

        self.mode(mode)._jwrite.jdbc(url, table, jprop)


+class DataFrameWriterV2(object):


shall we move it to a new file?

If you think that's better approach. I don't have strong opinion, though feature is small and unlikely to be used directly.

HyukjinKwon · 2020-06-30T15:23:42Z

Okay, @zero323, can you address the comments except #27331 (comment)? Let's just merge this one. I will make a PR to fix the things I pointed out by myself.

SparkQA · 2020-07-19T06:44:29Z

Test build #126123 has finished for PR 27331 at commit 90ddbcc.

This patch fails Python style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-19T06:48:26Z

Test build #126122 has finished for PR 27331 at commit c8fe7e7.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-19T07:01:51Z

Test build #126124 has finished for PR 27331 at commit 3093c35.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2020-07-19T16:08:04Z

Test build #126131 has finished for PR 27331 at commit 9197c84.

This patch fails PySpark pip packaging tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2020-07-20T01:40:56Z

I discussed offline with @rdblue, @cloud-fan and @brkyvz. I will take a look by myself and try to make a fix soon.
Thanks for working on this @zero323 and bearing with me here guys.

HyukjinKwon · 2020-07-20T01:41:45Z

@rdblue, @brkyvz, @cloud-fan, I am merging this since I am going to make a followup soon. Let me know if there are some more comments here, I will address them in the followup.

HyukjinKwon

LGTM

HyukjinKwon · 2020-07-20T01:48:20Z

Merged to master.

- Adds `DataFramWriterV2` class. - Adds `writeTo` method to `pyspark.sql.DataFrame`. - Adds related SQL partitioning functions (`years`, `months`, ..., `bucket`). Feature parity. No. Added new unit tests. TODO: Should we test against `org.apache.spark.sql.connector.InMemoryTableCatalog`? If so, how to expose it in Python tests? Closes apache#27331 from zero323/SPARK-29157. Authored-by: zero323 <[email protected]> Signed-off-by: HyukjinKwon <[email protected]>

zero323 force-pushed the SPARK-29157 branch from 81fac11 to 7a1aa6e Compare January 23, 2020 12:34

zero323 closed this Jan 25, 2020

zero323 deleted the SPARK-29157 branch January 25, 2020 00:01

zero323 restored the SPARK-29157 branch January 25, 2020 00:01

zero323 reopened this Jan 25, 2020

zero323 force-pushed the SPARK-29157 branch 2 times, most recently from 41f4e18 to 8de4978 Compare February 4, 2020 15:41

dongjoon-hyun added PYSPARK SQL labels Feb 5, 2020

zero323 force-pushed the SPARK-29157 branch from 8de4978 to ac9eab4 Compare March 4, 2020 00:17

zero323 changed the title ~~[WIP][SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API~~ [SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API Mar 4, 2020

probot-autolabeler bot added the PYTHON label Jun 20, 2020

zero323 force-pushed the SPARK-29157 branch from 24ec8f9 to 14cbbcc Compare June 24, 2020 12:21

HyukjinKwon reviewed Jun 25, 2020

View reviewed changes

python/pyspark/sql/functions.py Show resolved Hide resolved

cloud-fan reviewed Jun 29, 2020

View reviewed changes

zero323 added 3 commits July 19, 2020 07:46

Add DataFrame.writeTo and DataFrameWriterV2

a4e666f

Retarget to 3.1

264f661

Fix doctest

540035e

zero323 force-pushed the SPARK-29157 branch from 14cbbcc to c8fe7e7 Compare July 19, 2020 06:24

zero323 added 3 commits July 19, 2020 08:39

Use deterministic expression in test and move imports

2432182

Clarify usage of the partition transform functions

6b9d54a

Clarify usage of the partitionedBy method

49168fb

zero323 force-pushed the SPARK-29157 branch from c8fe7e7 to 90ddbcc Compare July 19, 2020 06:39

Add tableProperty

9197c84

zero323 force-pushed the SPARK-29157 branch from 3093c35 to 9197c84 Compare July 19, 2020 15:47

HyukjinKwon approved these changes Jul 20, 2020

View reviewed changes

HyukjinKwon closed this in ef3cad1 Jul 20, 2020

zero323 deleted the SPARK-29157 branch July 20, 2020 05:51

zero323 mentioned this pull request Aug 31, 2020

[SPARK-29157] Add DataFrameWriterV2 to Python API zero323/pyspark-stubs#500

Closed

		self.mode(mode)._jwrite.jdbc(url, table, jprop)


		class DataFrameWriterV2(object):

[SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API #27331

[SPARK-29157][SQL][PYSPARK] Add DataFrameWriterV2 to Python API #27331

Uh oh!

Conversation

zero323 commented Jan 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

SparkQA commented Jan 23, 2020

Uh oh!

SparkQA commented Jan 23, 2020

Uh oh!

zero323 commented Jan 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SparkQA commented Feb 4, 2020

Uh oh!

zero323 commented Mar 3, 2020

Uh oh!

HyukjinKwon commented Mar 3, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Mar 3, 2020

Uh oh!

zero323 commented Mar 4, 2020

Uh oh!

rdblue commented Mar 4, 2020

Uh oh!

SparkQA commented Mar 4, 2020

Uh oh!

HyukjinKwon commented Mar 4, 2020

Uh oh!

zero323 commented Mar 4, 2020

Uh oh!

SparkQA commented Jun 20, 2020

Uh oh!

HyukjinKwon commented Jun 21, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jun 22, 2020

Uh oh!

HyukjinKwon commented Jun 23, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

zero323 commented Jun 24, 2020

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HyukjinKwon Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

rdblue Jun 25, 2020

Choose a reason for hiding this comment

Uh oh!

HyukjinKwon Jun 26, 2020

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rdblue commented Jun 25, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

HyukjinKwon commented Jun 26, 2020 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

rdblue commented Jun 26, 2020

Uh oh!

cloud-fan Jun 29, 2020

Choose a reason for hiding this comment

Uh oh!

zero323 Jul 19, 2020

Choose a reason for hiding this comment

zero323 commented Jan 23, 2020 •

edited

Loading

zero323 commented Jan 26, 2020 •

edited

Loading

HyukjinKwon commented Mar 3, 2020 •

edited

Loading

HyukjinKwon commented Jun 21, 2020 •

edited

Loading

HyukjinKwon commented Jun 23, 2020 •

edited

Loading

HyukjinKwon Jun 25, 2020 •

edited

Loading

HyukjinKwon Jun 25, 2020 •

edited

Loading

rdblue commented Jun 25, 2020 •

edited

Loading

HyukjinKwon commented Jun 26, 2020 •

edited

Loading